DataBlock API to construct the DataLoaders

We will create a DataBlock to process our UCR datasets

ucr_path = untar_data(URLs.UCR)
df_train, df_test = load_df_ucr(ucr_path, 'StarLightCurves')
Loading files from: /home/tcapelle/.fastai/data/Univariate2018_arff/StarLightCurves
df_train.head()
att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 ... att1016 att1017 att1018 att1019 att1020 att1021 att1022 att1023 att1024 target
0 0.537303 0.531103 0.528503 0.529403 0.533603 0.540903 0.551103 0.564003 0.579603 0.597603 ... 0.546903 0.545903 0.543903 0.541003 0.537203 0.532303 0.526403 0.519503 0.511403 b'3'
1 0.588398 0.593898 0.599098 0.604098 0.608798 0.613397 0.617797 0.622097 0.626097 0.630097 ... 0.237399 0.246499 0.256199 0.266499 0.277399 0.288799 0.300899 0.313599 0.326899 b'3'
2 -0.049900 -0.041500 -0.033400 -0.025600 -0.018100 -0.010800 -0.003800 0.003000 0.009600 0.015900 ... -0.173801 -0.161601 -0.149201 -0.136401 -0.123201 -0.109701 -0.095901 -0.081701 -0.067100 b'1'
3 1.337005 1.319805 1.302905 1.286305 1.270005 1.254005 1.238304 1.223005 1.208104 1.193504 ... 1.288905 1.298505 1.307705 1.316505 1.324905 1.332805 1.340205 1.347005 1.353205 b'3'
4 0.769801 0.775301 0.780401 0.785101 0.789401 0.793301 0.796801 0.799901 0.802601 0.805101 ... 0.742401 0.744501 0.747301 0.750701 0.754801 0.759501 0.765001 0.771301 0.778401 b'3'

5 rows × 1025 columns

x_cols = df_train.columns[slice(0,-1)].to_list()
x_cols[0:5], x_cols[-1]
(['att1', 'att2', 'att3', 'att4', 'att5'], 'att1024')

TSBlock[source]

TSBlock()

A TimeSeries Block to process one timeseries

dblock = DataBlock(blocks=(TSBlock, CategoryBlock),
                           get_x=lambda o: o[x_cols].values.astype(np.float32),
                           get_y=ColReader('target'),
                           splitter=RandomSplitter(0.2))

A good way to debug the Block is using summary:

dblock.summary(df_train)
Setting-up type transforms pipelines
Collecting items from          att1      att2      att3      att4      att5      att6      att7  \
0    0.537303  0.531103  0.528503  0.529403  0.533603  0.540903  0.551103   
1    0.588398  0.593898  0.599098  0.604098  0.608798  0.613397  0.617797   
2   -0.049900 -0.041500 -0.033400 -0.025600 -0.018100 -0.010800 -0.003800   
3    1.337005  1.319805  1.302905  1.286305  1.270005  1.254005  1.238304   
4    0.769801  0.775301  0.780401  0.785101  0.789401  0.793301  0.796801   
..        ...       ...       ...       ...       ...       ...       ...   
995 -0.751000 -0.749100 -0.747100 -0.745200 -0.743300 -0.741500 -0.739600   
996  0.867600  0.860300  0.853300  0.846500  0.840100  0.834000  0.828100   
997  0.087398  0.097398  0.107698  0.118298  0.129298  0.140398  0.151898   
998  0.664799  0.654799  0.646099  0.638599  0.632299  0.627199  0.623099   
999  0.563602  0.569502  0.574902  0.579702  0.583902  0.587702  0.591002   

         att8      att9     att10  ...   att1016   att1017   att1018  \
0    0.564003  0.579603  0.597603  ...  0.546903  0.545903  0.543903   
1    0.622097  0.626097  0.630097  ...  0.237399  0.246499  0.256199   
2    0.003000  0.009600  0.015900  ... -0.173801 -0.161601 -0.149201   
3    1.223005  1.208104  1.193504  ...  1.288905  1.298505  1.307705   
4    0.799901  0.802601  0.805101  ...  0.742401  0.744501  0.747301   
..        ...       ...       ...  ...       ...       ...       ...   
995 -0.737800 -0.735900 -0.734100  ... -0.768600 -0.766100 -0.763500   
996  0.822600  0.817400  0.812500  ...  0.847500  0.849800  0.852100   
997  0.163498  0.175398  0.187498  ...  0.012198  0.022098  0.032398   
998  0.619999  0.617899  0.616699  ...  0.699299  0.698899  0.698099   
999  0.593802  0.596202  0.598302  ...  0.176101  0.203402  0.233902   

      att1019   att1020   att1021   att1022   att1023   att1024  target  
0    0.541003  0.537203  0.532303  0.526403  0.519503  0.511403    b'3'  
1    0.266499  0.277399  0.288799  0.300899  0.313599  0.326899    b'3'  
2   -0.136401 -0.123201 -0.109701 -0.095901 -0.081701 -0.067100    b'1'  
3    1.316505  1.324905  1.332805  1.340205  1.347005  1.353205    b'3'  
4    0.750701  0.754801  0.759501  0.765001  0.771301  0.778401    b'3'  
..        ...       ...       ...       ...       ...       ...     ...  
995 -0.760700 -0.757800 -0.754700 -0.751500 -0.748100 -0.744600    b'2'  
996  0.854400  0.856800  0.859200  0.861700  0.864100  0.866600    b'3'  
997  0.042898  0.053798  0.065098  0.076698  0.088698  0.101098    b'1'  
998  0.696899  0.695299  0.693299  0.690799  0.687799  0.684399    b'3'  
999  0.267602  0.304802  0.345602  0.390102  0.438302  0.490602    b'3'  

[1000 rows x 1025 columns]
Found 1000 items
2 datasets of sizes 800,200
Setting up Pipeline: <lambda> -> TSeries.create
Setting up Pipeline: ColReader -- {'cols': 'target', 'pref': '', 'suff': '', 'label_delim': None} -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}

Building one sample
  Pipeline: <lambda> -> TSeries.create
    starting from
      att1      -0.525298
att2      -0.516598
att3      -0.508298
att4      -0.500598
att5      -0.493198
             ...   
att1021   -0.638398
att1022   -0.632698
att1023   -0.625898
att1024   -0.617898
target         b'2'
Name: 548, Length: 1025, dtype: object
    applying <lambda> gives
      [-0.52529806 -0.5165981  -0.5082981  ... -0.632698   -0.625898
 -0.617898  ]
    applying TSeries.create gives
      TSeries of size 1x1024
  Pipeline: ColReader -- {'cols': 'target', 'pref': '', 'suff': '', 'label_delim': None} -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
    starting from
      att1      -0.525298
att2      -0.516598
att3      -0.508298
att4      -0.500598
att5      -0.493198
             ...   
att1021   -0.638398
att1022   -0.632698
att1023   -0.625898
att1024   -0.617898
target         b'2'
Name: 548, Length: 1025, dtype: object
    applying ColReader -- {'cols': 'target', 'pref': '', 'suff': '', 'label_delim': None} gives
      b'2'
    applying Categorize -- {'vocab': None, 'sort': True, 'add_na': False} gives
      TensorCategory(1)

Final sample: (TSeries(ch=1, len=1024), TensorCategory(1))


Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: 

Building one batch
Applying item_tfms to the first sample:
  Pipeline: ToTensor
    starting from
      (TSeries of size 1x1024, TensorCategory(1))
    applying ToTensor gives
      (TSeries of size 1x1024, TensorCategory(1))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

No batch_tfms to apply
dls = dblock.dataloaders(df_train, bs=4)

The show_batch method is not very practical, let's redefine it on the DataLoader class

dls.show_batch()

A handy function to stack df_train and df_valid together, adds column to know which is which.

stack_train_valid[source]

stack_train_valid(df_train, df_valid)

Stack df_train and df_valid, adds valid_col=True/False for df_valid/df_train

DataLoaders

A custom TSeries DataLoaders class

class TSDataLoaders[source]

TSDataLoaders(*loaders, path='.', device=None) :: DataLoaders

A TimeSeries DataLoader

Overchaging show_batch function to add grid spacing.

Let's test the DataLoader

TSDataLoaders.from_dfs[source]

TSDataLoaders.from_dfs(df_train, df_valid, path='.', x_cols=None, label_col=None, y_block=None, item_tfms=None, batch_tfms=None, bs=64, val_bs=None, shuffle_train=True, device=None)

Create a DataLoader from a df_train and df_valid

dls = TSDataLoaders.from_dfs(df_train, df_test, x_cols=x_cols, label_col='target', bs=16, val_bs=64)
dls.show_batch()

Profiling the DataLoader

len(dls.valid_ds)
8236
def cycle_dl(dl):
    for x,y in iter(dl):
        pass

It is pretty slow

%time cycle_dl(dls.valid)
CPU times: user 122 ms, sys: 354 ms, total: 476 ms
Wall time: 2.15 s