Models to predict the action class from a sequence of frames

We will built a bunch of models to read the ImageTuple and output the corresponding Category.

torch.cuda.set_device(1)
torch.cuda.get_device_name()
'GeForce RTX 2070 SUPER'

A resnet based Encoder

Extracting features of images to latent variable space

Let's build a tensor representing a batch of images:

  • (batch_size, channels, width, hight)
x = torch.rand(8, 3, 64, 64)

We will build a basic Resnet based encoder:

class Encoder[source]

Encoder(arch=resnet34, n_in=3, weights_file=None, head=True, pretrained=True, cut=None, init=kaiming_normal_, custom_head=None, concat_pool=True, lin_ftrs=None, ps=0.5, first_bn=True, bn_final=False, lin_first=False, y_range=None) :: Module

Same as nn.Module, but no need for subclasses to call super().__init__

this encoder will reduce images to a latent dimension space:

enc = Encoder(n_in=3, weights_file=None, head=False)
enc.head
Sequential(
  (0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
  (1): Flatten(full=False)
  (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

In this case, 1024

encoded_var = enc(x)
encoded_var.shape
torch.Size([8, 1024])
test_eq(encoded_var.shape, [8,1024])

Simple Model

A very basic CNN model

This network is just using an old resnet and expanding the sequence dimesion on the batch dim. It is not optimal.

class SimpleModel[source]

SimpleModel(arch=resnet34, weights_file=None, num_classes=30, seq_len=40, debug=False) :: Module

A simple CNN model

A splitter function to train separetely the parameers from the encoder and the head, this is a needed argument for the Learner to be able to call Learner.freeze().

simple_splitter[source]

simple_splitter(model)

A sequence of 10 images:

inp = [torch.rand(64, 3, 64, 64) for _ in range(10)]
sm = SimpleModel(debug=True, seq_len=10)
out = sm(inp)
test_eq(out.shape, [64, 30])
 input len:   (10, torch.Size([64, 3, 64, 64]))
 after stack:   torch.Size([64, 10, 3, 64, 64])
 encoded shape: torch.Size([64, 10, 1024])
 after attention shape: torch.Size([64, 1024])

ConvLSTM

An LSTM encoded image model

First the LSTM wrapper, with the reset method to erase hidden state before each epoch.

class LSTM[source]

LSTM(input_dim, n_hidden, n_layers, bidirectional=False, p=0.5) :: Module

Same as nn.Module, but no need for subclasses to call super().__init__

We will take as input_size the output of the encoder, so the latent_dimesion, the num_layers is how many nn.LSTMCell are stacked and hidden dim is the same as before.

Let's build a 16 layers LSTM stack:

lstm = LSTM(512, 512, 1, bidirectional=False)
y = torch.rand(32, 10,  512)

We get the same input, encoded on the hidden_dim

out, (h,c) = lstm(y)
out.shape, h.shape, c.shape
(torch.Size([32, 10, 512]), torch.Size([1, 32, 512]), torch.Size([1, 32, 512]))

It can deal with different batch sizes now:

out, (h,c) = lstm(torch.rand(16,10,512))
out.shape, h.shape, c.shape
(torch.Size([16, 10, 512]), torch.Size([1, 16, 512]), torch.Size([1, 16, 512]))
lstm = LSTM(512, 512, 3, bidirectional=True)
out, (h,c) = lstm(torch.rand(16,10,512))
out.shape,  h.shape, c.shape
(torch.Size([16, 10, 1024]),
 torch.Size([6, 16, 512]),
 torch.Size([6, 16, 512]))

class ConvLSTM[source]

ConvLSTM(arch=resnet34, weights_file=None, num_classes=30, lstm_layers=1, hidden_dim=1024, bidirectional=True, attention=True, debug=False) :: Module

Same as nn.Module, but no need for subclasses to call super().__init__

convlstm_splitter[source]

convlstm_splitter(model)

inp = [torch.rand(32, 3, 64, 64) for _ in range(10)]
clstm = ConvLSTM(attention=False, bidirectional=False, lstm_layers=2, debug=True)
test_eq(clstm(inp).shape, [32, 30])
 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 1024])
 hidden state: torch.Size([2, 32, 1024])
 hidden state flat: torch.Size([32, 2048])
clstm = ConvLSTM(lstm_layers=3, debug=True)
test_eq(clstm(inp).shape, [32, 30])
 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 2048])
 attention_w: torch.Size([32, 10])
 after attention: torch.Size([32, 2048])
clstm = ConvLSTM(lstm_layers=1, debug=True)
test_eq(clstm(inp).shape, [32, 30])
 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 2048])
 attention_w: torch.Size([32, 10])
 after attention: torch.Size([32, 2048])
 

class TimeSformer[source]

TimeSformer(dim, num_frames, num_classes, image_size=224, patch_size=16, channels=3, depth=12, heads=8, dim_head=64, attn_dropout=0.0, ff_dropout=0.0) :: TimeSformer

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

model = TimeSformer(
    dim = 128,
    image_size = 64,
    patch_size = 16,
    num_frames = 8,
    num_classes = 10,
    depth = 12,
    heads = 8,
    dim_head =  64,
    attn_dropout = 0.1,
    ff_dropout = 0.1
)
video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))

STAM - Pytorch

Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in video classification. This corroborates the finding of TimeSformer. Attention is all we need.

 

class STAM[source]

STAM(dim, image_size, patch_size, num_frames, num_classes, space_depth, space_heads, space_mlp_dim, time_depth, time_heads, time_mlp_dim, space_dim_head=64, time_dim_head=64, dropout=0.0, emb_dropout=0.0) :: STAM

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

model = STAM(
    dim = 128,
    image_size = 64,     # size of image
    patch_size = 16,      # patch size
    num_frames = 8,       # number of image frames, selected out of video
    space_depth = 12,     # depth of vision transformer
    space_heads = 8,      # heads of vision transformer
    space_mlp_dim = 2048, # feedforward hidden dimension of vision transformer
    time_depth = 6,       # depth of time transformer (in paper, it was shallower, 6)
    time_heads = 8,       # heads of time transformer
    time_mlp_dim = 2048,  # feedforward hidden dimension of time transformer
    num_classes = 10,    # number of output classes
    space_dim_head = 64,  # space transformer head dimension
    time_dim_head = 64,   # time transformer head dimension
    dropout = 0.,         # dropout
    emb_dropout = 0.      # embedding dropout
)
video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))
from nbdev.export import *
notebook2script()
Converted 00_core.ipynb.
Converted 01_models.ipynb.
Converted index.ipynb.