We will built a bunch of models to read the ImageTuple
and output the corresponding Category
.
torch.cuda.set_device(1)
torch.cuda.get_device_name()
Let's build a tensor representing a batch of images:
(batch_size, channels, width, hight)
x = torch.rand(8, 3, 64, 64)
We will build a basic Resnet based encoder:
this encoder will reduce images to a latent dimension space:
enc = Encoder(n_in=3, weights_file=None, head=False)
enc.head
In this case, 1024
encoded_var = enc(x)
encoded_var.shape
test_eq(encoded_var.shape, [8,1024])
This network is just using an old resnet and expanding the sequence dimesion on the batch dim. It is not optimal.
A splitter function to train separetely the parameers from the encoder and the head, this is a needed argument for the Learner
to be able to call Learner.freeze()
.
A sequence of 10 images:
inp = [torch.rand(64, 3, 64, 64) for _ in range(10)]
sm = SimpleModel(debug=True, seq_len=10)
out = sm(inp)
test_eq(out.shape, [64, 30])
First the LSTM wrapper, with the reset
method to erase hidden state before each epoch.
We will take as input_size the output of the encoder, so the latent_dimesion
, the num_layers
is how many nn.LSTMCell
are stacked and hidden dim is the same as before.
Let's build a 16 layers LSTM stack:
lstm = LSTM(512, 512, 1, bidirectional=False)
y = torch.rand(32, 10, 512)
We get the same input, encoded on the hidden_dim
out, (h,c) = lstm(y)
out.shape, h.shape, c.shape
It can deal with different batch sizes now:
out, (h,c) = lstm(torch.rand(16,10,512))
out.shape, h.shape, c.shape
lstm = LSTM(512, 512, 3, bidirectional=True)
out, (h,c) = lstm(torch.rand(16,10,512))
out.shape, h.shape, c.shape
inp = [torch.rand(32, 3, 64, 64) for _ in range(10)]
clstm = ConvLSTM(attention=False, bidirectional=False, lstm_layers=2, debug=True)
test_eq(clstm(inp).shape, [32, 30])
clstm = ConvLSTM(lstm_layers=3, debug=True)
test_eq(clstm(inp).shape, [32, 30])
clstm = ConvLSTM(lstm_layers=1, debug=True)
test_eq(clstm(inp).shape, [32, 30])
TimeSformer
thanks LucidRains https://github.com/lucidrains/TimeSformer-pytorch
model = TimeSformer(
dim = 128,
image_size = 64,
patch_size = 16,
num_frames = 8,
num_classes = 10,
depth = 12,
heads = 8,
dim_head = 64,
attn_dropout = 0.1,
ff_dropout = 0.1
)
video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))
STAM - Pytorch
Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in video classification. This corroborates the finding of TimeSformer. Attention is all we need.
model = STAM(
dim = 128,
image_size = 64, # size of image
patch_size = 16, # patch size
num_frames = 8, # number of image frames, selected out of video
space_depth = 12, # depth of vision transformer
space_heads = 8, # heads of vision transformer
space_mlp_dim = 2048, # feedforward hidden dimension of vision transformer
time_depth = 6, # depth of time transformer (in paper, it was shallower, 6)
time_heads = 8, # heads of time transformer
time_mlp_dim = 2048, # feedforward hidden dimension of time transformer
num_classes = 10, # number of output classes
space_dim_head = 64, # space transformer head dimension
time_dim_head = 64, # time transformer head dimension
dropout = 0., # dropout
emb_dropout = 0. # embedding dropout
)
video = tuple(torch.randn(2, 3, 64, 64) for _ in range(8)) # (batch x frames x channels x height x width)
test_eq(model(video).shape, (2,10))
from nbdev.export import *
notebook2script()