We will built a bunch of models to read the ImageTuple and output the corresponding Category.

torch.cuda.set_device(1)
torch.cuda.get_device_name()

'GeForce RTX 2070 SUPER'

A resnet based Encoder

Extracting features of images to latent variable space

Let's build a tensor representing a batch of images:

(batch_size, channels, width, hight)

x = torch.rand(8, 3, 64, 64)

We will build a basic Resnet based encoder:

this encoder will reduce images to a latent dimension space:

enc = Encoder(n_in=3, weights_file=None, head=False)

enc.head

Sequential(
  (0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
  (1): Flatten(full=False)
  (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

In this case, 1024

encoded_var = enc(x)
encoded_var.shape

torch.Size([8, 1024])

test_eq(encoded_var.shape, [8,1024])

Simple Model

A very basic CNN model

This network is just using an old resnet and expanding the sequence dimesion on the batch dim. It is not optimal.

A splitter function to train separetely the parameers from the encoder and the head, this is a needed argument for the Learner to be able to call Learner.freeze().

A sequence of 10 images:

inp = [torch.rand(64, 3, 64, 64) for _ in range(10)]

sm = SimpleModel(debug=True, seq_len=10)
out = sm(inp)
test_eq(out.shape, [64, 30])

 input len:   (10, torch.Size([64, 3, 64, 64]))
 after stack:   torch.Size([64, 10, 3, 64, 64])
 encoded shape: torch.Size([64, 10, 1024])
 after attention shape: torch.Size([64, 1024])

ConvLSTM

An LSTM encoded image model

First the LSTM wrapper, with the reset method to erase hidden state before each epoch.

We will take as input_size the output of the encoder, so the latent_dimesion, the num_layers is how many nn.LSTMCell are stacked and hidden dim is the same as before.

Let's build a 16 layers LSTM stack:

lstm = LSTM(512, 512, 1, bidirectional=False)

y = torch.rand(32, 10,  512)

We get the same input, encoded on the hidden_dim

out, (h,c) = lstm(y)
out.shape, h.shape, c.shape

(torch.Size([32, 10, 512]), torch.Size([1, 32, 512]), torch.Size([1, 32, 512]))

It can deal with different batch sizes now:

out, (h,c) = lstm(torch.rand(16,10,512))
out.shape, h.shape, c.shape

(torch.Size([16, 10, 512]), torch.Size([1, 16, 512]), torch.Size([1, 16, 512]))

lstm = LSTM(512, 512, 3, bidirectional=True)

out, (h,c) = lstm(torch.rand(16,10,512))
out.shape,  h.shape, c.shape

(torch.Size([16, 10, 1024]),
 torch.Size([6, 16, 512]),
 torch.Size([6, 16, 512]))

inp = [torch.rand(32, 3, 64, 64) for _ in range(10)]

clstm = ConvLSTM(attention=False, bidirectional=False, lstm_layers=2, debug=True)
test_eq(clstm(inp).shape, [32, 30])

 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 1024])
 hidden state: torch.Size([2, 32, 1024])
 hidden state flat: torch.Size([32, 2048])

clstm = ConvLSTM(lstm_layers=3, debug=True)
test_eq(clstm(inp).shape, [32, 30])

 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 2048])
 attention_w: torch.Size([32, 10])
 after attention: torch.Size([32, 2048])

clstm = ConvLSTM(lstm_layers=1, debug=True)
test_eq(clstm(inp).shape, [32, 30])

 after stack:   torch.Size([32, 10, 3, 64, 64])
 after encode:   torch.Size([320, 1024])
 before lstm:   torch.Size([32, 10, 1024])
 after lstm:   torch.Size([32, 10, 2048])
 attention_w: torch.Size([32, 10])
 after attention: torch.Size([32, 2048])

TimeSformer

thanks LucidRains https://github.com/lucidrains/TimeSformer-pytorch

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

model = TimeSformer(
    dim = 128,
    image_size = 64,
    patch_size = 16,
    num_frames = 8,
    num_classes = 10,
    depth = 12,
    heads = 8,
    dim_head =  64,
    attn_dropout = 0.1,
    ff_dropout = 0.1
)

video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))

STAM - Pytorch

Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in video classification. This corroborates the finding of TimeSformer. Attention is all we need.

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

model = STAM(
    dim = 128,
    image_size = 64,     # size of image
    patch_size = 16,      # patch size
    num_frames = 8,       # number of image frames, selected out of video
    space_depth = 12,     # depth of vision transformer
    space_heads = 8,      # heads of vision transformer
    space_mlp_dim = 2048, # feedforward hidden dimension of vision transformer
    time_depth = 6,       # depth of time transformer (in paper, it was shallower, 6)
    time_heads = 8,       # heads of time transformer
    time_mlp_dim = 2048,  # feedforward hidden dimension of time transformer
    num_classes = 10,    # number of output classes
    space_dim_head = 64,  # space transformer head dimension
    time_dim_head = 64,   # time transformer head dimension
    dropout = 0.,         # dropout
    emb_dropout = 0.      # embedding dropout
)

video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))

from nbdev.export import *
notebook2script()

Converted 00_core.ipynb.
Converted 01_models.ipynb.
Converted index.ipynb.

Image sequence models

A resnet based Encoder

`class` `Encoder`[source]

Simple Model

`class` `SimpleModel`[source]

`simple_splitter`[source]

ConvLSTM

`class` `LSTM`[source]

`class` `ConvLSTM`[source]

`convlstm_splitter`[source]

TimeSformer

`class` `TimeSformer`[source]

STAM - Pytorch

`class` `STAM`[source]

Image sequence models

A resnet based Encoder

class Encoder[source]

Simple Model

class SimpleModel[source]

simple_splitter[source]

ConvLSTM

class LSTM[source]

class ConvLSTM[source]

convlstm_splitter[source]

TimeSformer

class TimeSformer[source]

STAM - Pytorch

class STAM[source]

`class` `Encoder`[source]

`class` `SimpleModel`[source]

`simple_splitter`[source]

`class` `LSTM`[source]

`class` `ConvLSTM`[source]

`convlstm_splitter`[source]

`class` `TimeSformer`[source]

`class` `STAM`[source]