The data block API

The data block API

The data block API lets you customize how to create a DataBunch by isolating the underlying parts of that process in separate blocks, mainly:

  • where are the inputs
  • how to split the data into a training and validation set
  • how to label them
  • possible transforms to apply
  • how to warp in dataloaders and create the DataBunch

This is a bit longer than using the factory methods but is way more flexible. As usual, we'll begin with end-to-end examples, then switch to the details of each of those parts.

Examples of use

Let's begin by our traditional MNIST example.

path = untar_data(URLs.MNIST_TINY)
tfms = get_transforms(do_flip=False)
path.ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/test'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/labels.csv'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/valid'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/history.csv'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/models')]
(path/'train').ls()
[PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/7'),
 PosixPath('/home/ubuntu/.fastai/data/mnist_tiny/train/3')]

In vision.data, we create an easy DataBunch suitable for classification by simply typing:

data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=24)

This is aimed at data that is in folders following an ImageNet style, with a train and valid directory containing each one subdirectory per class, where all the pictures are. With the data block API, the same thing is achieved like this:

data = (ImageItemList.from_folder(path) #Where to find the data? -> in path and its subfolders
        .split_by_folder()              #How to split in train/valid? -> use the folders
        .label_from_folder()            #How to label? -> depending on the folder of the filenames
        .add_test_folder()              #Optionally add a test set (here default name is test)
        .transform(tfms, size=64)       #Data augmentation? -> use tfms with a size of 64
        .databunch())                   #Finally? -> use the defaults for conversion to ImageDataBunch
data.show_batch(3, figsize=(6,6), hide_axis=False)
data.train_ds[0], data.test_ds.classes
((Image (3, 64, 64), Category 7), ['7', '3'])

Let's look at another example from vision.data with the planet dataset. This time, it's a multiclassification problem with the labels in a csv file and no given split between valid and train data, so we use a random split. The factory method is:

planet = untar_data(URLs.PLANET_TINY)
planet_tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
data = ImageDataBunch.from_csv(planet, folder='train', size=128, suffix='.jpg', sep = ' ', ds_tfms=planet_tfms)

With the data block API we can rewrite this like that:

data = (ImageItemList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
        #Where to find the data? -> in planet 'train' folder
        .random_split_by_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(sep=' ')
        #How to label? -> use the csv file
        .transform(planet_tfms, size=128)
        #Data augmentation? -> use tfms with a size of 128
        .databunch())                          
        #Finally -> use the defaults for conversion to databunch
data.show_batch(rows=2, figsize=(9,7))

The data block API also allows you to use dataset types for which there is no direct ImageDataBunch factory method. For a segmentation task, for instance, we can use it to quickly get a DataBunch. Let's take the example of the camvid dataset. The images are in an 'images' folder and their corresponding mask is in a 'labels' folder.

camvid = untar_data(URLs.CAMVID_TINY)
path_lbl = camvid/'labels'
path_img = camvid/'images'

We have a file that gives us the names of the classes (what each code inside the masks corresponds to: a pedestrian, a tree, a road...)

codes = np.loadtxt(camvid/'codes.txt', dtype=str); codes
array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car', 'CartLuggagePram', 'Child', 'Column_Pole',
       'Fence', 'LaneMkgsDriv', 'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving', 'ParkingBlock',
       'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk', 'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
       'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel', 'VegetationMisc', 'Void', 'Wall'], dtype='<U17')

And we define the following function that infers the mask filename from the image filename.

get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'

Then we can easily define a DataBunch using the data block API. Here we need to use tfm_y=True in the transform call because we need the same transforms to be applied to the target mask as were applied to the image.

data = (SegmentationItemList.from_folder(path_img)
        .random_split_by_pct()
        .label_from_func(get_y_fn, classes=codes)
        .transform(get_transforms(), tfm_y=True)
        .databunch())
data.show_batch(rows=2, figsize=(7,5))

One last example for object detection. We use our tiny sample of the COCO dataset here. There is a helper function in the library that reads the annotation file and returns the list of images names with the list of labelled bboxes associated to it. We convert it to a dictionary that maps image names with their bboxes and then write the function that will give us the target for each image filename.

coco = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]

The following code is very similar to what we saw before. The only new addition is the use of special function to collate the samples in batches. This comes from the fact that our images may have multiple bounding boxes, so we need to pad them to the largest number of bounding boxes.

data = (ObjectItemList.from_folder(coco)
        #Where are the images? -> in coco
        .random_split_by_pct()                          
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_func(get_y_func)
        #How to find the labels? -> use get_y_func
        .transform(get_transforms(), tfm_y=True)
        #Data augmentation? -> Standard transforms with tfm_y=True
        .databunch(bs=16, collate_fn=bb_pad_collate))   
        #Finally we convert to a DataBunch and we use bb_pad_collate
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))

Provide inputs

The inputs we want to feed our model are regrouped in the following class. The class contains methods to get the corresponding labels.

show_doc(ItemList, title_level=3, doc_string=False)

class ItemList[source]

ItemList(items:Iterator, create_func:Callable=None, path:PathOrStr='.', label_cls:Callable=None, xtra:Any=None, processor:PreProcessor=None)

This class regroups the inputs for our model in items and saves a path attribute which is where it will look for any files (image files, csv file with labels...) create_func is applied to items to get the final output. label_cls will be called to create the labels from the result of the label function, xtra contains additional information (usually an underlying dataframe) and processor is to be applied to the inputs after the splitting and labelling.

from_folder[source]

from_folder(path:PathOrStr, extensions:StrList=None, recurse=True, kwargs) → ItemList

Get the list of files in path that have a suffix in extensions. recurse determines if we search subfolders.

from_df[source]

from_df(df:DataFrame, path:PathOrStr='.', col:Union[int, Collection[int], str, StrList]=0, kwargs) → ItemList

Create an ItemList in path from the inputs in the col of df.

from_csv[source]

from_csv(path:PathOrStr, csv_name:str, col:Union[int, Collection[int], str, StrList]=0, header:str='infer', kwargs) → ItemList

Create an ItemList in path from the inputs in the col of path/csv_name opened with header.

Split the data

The following functions are methods of ItemList, to create an ItemLists in different ways.

random_split_by_pct[source]

random_split_by_pct(valid_pct:float=0.2, seed:int=None) → ItemLists

Split the items randomly by putting valid_pct in the validation set. Set the seed in numpy if passed.

split_by_files[source]

split_by_files(valid_names:ItemList) → ItemLists

Split the data by using the names in valid_names for validation.

split_by_fname_file[source]

split_by_fname_file(fname:PathOrStr, path:PathOrStr=None) → ItemLists

Split the data by using the file names in fname for the validation set. path will override self.path.

split_by_folder[source]

split_by_folder(train:str='train', valid:str='valid') → ItemLists

Split the data depending on the folder (train or valid) in which the filenames are.

jekyll_note("This method looks at the folder immediately after `self.path` for `valid` and `train`.")

split_by_idx[source]

split_by_idx(valid_idx:Collection[int]) → ItemLists

Split the data according to the indexes in valid_idx.

split_by_valid_func[source]

split_by_valid_func(func:Callable) → ItemLists

Split the data by result of func (which returns True for validation set)

split_from_df[source]

split_from_df(col:Union[int, Collection[int], str, StrList]=2)

Split the data from the col in the dataframe in self.xtra.

Labelling the inputs

All the followings are methods of ItemList (ItemLists delegates them to each one of its ItemList). Note that some of them are primarly intended for inputs that are filenames and might not work in general situations.

label_from_list[source]

label_from_list(labels:Iterator, label_cls:Callable=None, template:Callable=None, kwargs) → LabelList

Label self.items with labels using label_cls and optionally template.

label_from_df[source]

label_from_df(cols:Union[int, Collection[int], str, StrList]=1, sep=None, kwargs)

Label self.items from the values in cols in self.xtra. If sep is passed, will split the labels accordingly.

label_const[source]

label_const(const:Any=0, kwargs) → LabelList

Label every item with const.

label_from_folder[source]

label_from_folder(kwargs) → LabelList

Give a label to each filename depending on its folder.

label_from_func[source]

label_from_func(func:Callable, kwargs) → LabelList

Apply func to every input to get its label.

This method is primarly intended for inputs that are filenames, but could work in other settings.

label_from_re[source]

label_from_re(pat:str, full_path:bool=False, kwargs) → LabelList

Apply the re in pat to determine the label of every filename. If full_path, search in the full name.

class LabelList[source]

LabelList(x:ItemList, y:ItemList, tfms:Union[Callable, Collection[Callable]]=None, tfm_y:bool=False, kwargs) :: Dataset

The basic dataset in fastai. Inputs are in x, targets in y. Optionally apply tfms to x and also y if tfm_y is True.

process[source]

process(xp=None, yp=None)

Launch the preprocessing on xp and yp.

transform[source]

transform(tfms:Union[Callable, Collection[Callable]], tfm_y:bool=None, kwargs)

Set the tfms and ` tfm_y value to be applied to the inputs and targets.

class ItemLists[source]

ItemLists(path:PathOrStr, train:ItemList, valid:ItemList, test:ItemList=None)

Data in path split between several streams of inputs, train, valid and maybe test.

class LabelLists[source]

LabelLists(path:PathOrStr, train:ItemList, valid:ItemList, test:ItemList=None) :: ItemLists

Data in path split between several streams of inputs/targets, train, valid and maybe test.

add_test[source]

add_test(items:Iterator, label:Any=None)

Add test set containing items from items and an arbitrary label

add_test_folder[source]

add_test_folder(test_folder:str='test', label:Any=None)

Add test set containing items from folder test_folder and an arbitrary label.

databunch[source]

databunch(path:PathOrStr=None, kwargs) → ImageDataBunch

Create an DataBunch from self, path will override self.path, kwargs are passed to DataBunch.create.

Preprocessing

Preprocessing is a step that happens after the data has been split and labelled, where the inputs and targets go through a bunch of PreProcessor.

class PreProcessor[source]

PreProcessor()

Basic class that will regroup functions applied to the train set and recording an inner state (vocabulary, statistics of transforms) that it will keep when applied to the validation and maybe the test set.