Data block

High level API to quickly get your data in a DataLoaders

source

TransformBlock

 TransformBlock (type_tfms:list=None, item_tfms:list=None,
                 batch_tfms:list=None,
                 dl_type:fastai.data.core.TfmdDL=None,
                 dls_kwargs:dict=None)

A basic wrapper that links defaults transforms for the data block API

Type Default Details
type_tfms list None One or more Transforms
item_tfms list None ItemTransforms, applied on an item
batch_tfms list None Transforms or RandTransforms, applied by batch
dl_type TfmdDL None Task specific TfmdDL, defaults to TfmdDL
dls_kwargs dict None Additional arguments to be passed to DataLoaders

source

CategoryBlock

 CategoryBlock
                (vocab:collections.abc.MutableSequence|pandas.core.series.
                Series=None, sort:bool=True, add_na:bool=False)

TransformBlock for single-label categorical targets

Type Default Details
vocab MutableSequence | pd.Series None List of unique class names
sort bool True Sort the classes alphabetically
add_na bool False Add #na# to vocab

source

MultiCategoryBlock

 MultiCategoryBlock (encoded:bool=False,
                     vocab:collections.abc.MutableSequence|pandas.core.ser
                     ies.Series=None, add_na:bool=False)

TransformBlock for multi-label categorical targets

Type Default Details
encoded bool False Whether the data comes in one-hot encoded
vocab MutableSequence | pd.Series None List of unique class names
add_na bool False Add #na# to vocab

source

RegressionBlock

 RegressionBlock (n_out:int=None)

TransformBlock for float targets

Type Default Details
n_out int None Number of output values

General API

#For example, so not exported
from fastai.vision.core import *
from fastai.vision.data import *

source

DataBlock

 DataBlock (blocks:list=None, dl_type:TfmdDL=None, getters:list=None,
            n_inp:int=None, item_tfms:list=None, batch_tfms:list=None,
            get_items=None, splitter=None, get_y=None, get_x=None)

Generic container to quickly build Datasets and DataLoaders.

Type Default Details
blocks list None One or more TransformBlocks
dl_type TfmdDL None Task specific TfmdDL, defaults to block’s dl_type orTfmdDL
getters list None Getter functions applied to results of get_items
n_inp int None Number of inputs
item_tfms list None ItemTransforms, applied on an item
batch_tfms list None Transforms or RandTransforms, applied by batch
get_items NoneType None
splitter NoneType None
get_y NoneType None
get_x NoneType None

To build a DataBlock you need to give the library four things: the types of your input/labels, and at least two functions: get_items and splitter. You may also need to include get_x and get_y or a more generic list of getters that are applied to the results of get_items.

splitter is a callable which, when called with items, returns a tuple of iterables representing the indices of the training and validation data.

Once those are provided, you automatically get a Datasets or a DataLoaders:


source

DataBlock.datasets

 DataBlock.datasets (source, verbose:bool=False)

Create a Datasets object from source

Type Default Details
source The data source
verbose bool False Show verbose messages
Returns Datasets

source

DataBlock.dataloaders

 DataBlock.dataloaders (source, path:str='.', verbose:bool=False,
                        bs:int=64, shuffle:bool=False,
                        num_workers:int=None, do_setup:bool=True,
                        pin_memory=False, timeout=0, batch_size=None,
                        drop_last=False, indexed=None, n=None,
                        device=None, persistent_workers=False,
                        pin_memory_device='', wif=None, before_iter=None,
                        after_item=None, before_batch=None,
                        after_batch=None, after_iter=None,
                        create_batches=None, create_item=None,
                        create_batch=None, retain=None, get_idxs=None,
                        sample=None, shuffle_fn=None, do_batch=None)

Create a DataLoaders object from source

Type Default Details
source The data source
path str . Data source and default Learner path
verbose bool False Show verbose messages
bs int 64 Size of batch
shuffle bool False Whether to shuffle data
num_workers int None Number of CPU cores to use in parallel (default: All available up to 16)
do_setup bool True Whether to run setup() for batch transform(s)
pin_memory bool False
timeout int 0
batch_size NoneType None
drop_last bool False
indexed NoneType None
n NoneType None
device NoneType None
persistent_workers bool False
pin_memory_device str
wif NoneType None
before_iter NoneType None
after_item NoneType None
before_batch NoneType None
after_batch NoneType None
after_iter NoneType None
create_batches NoneType None
create_item NoneType None
create_batch NoneType None
retain NoneType None
get_idxs NoneType None
sample NoneType None
shuffle_fn NoneType None
do_batch NoneType None
Returns DataLoaders

You can create a DataBlock by passing functions:

mnist = DataBlock(blocks = (ImageBlock(cls=PILImageBW),CategoryBlock),
                  get_items = get_image_files,
                  splitter = GrandparentSplitter(),
                  get_y = parent_label)

Each type comes with default transforms that will be applied:

  • at the base level to create items in a tuple (usually input,target) from the base elements (like filenames)
  • at the item level of the datasets
  • at the batch level

They are called respectively type transforms, item transforms, batch transforms. In the case of MNIST, the type transforms are the method to create a PILImageBW (for the input) and the Categorize transform (for the target), the item transform is ToTensor and the batch transforms are Cuda and IntToFloatTensor. You can add any other transforms by passing them in DataBlock.datasets or DataBlock.dataloaders.

test_eq(mnist.type_tfms[0], [PILImageBW.create])
test_eq(mnist.type_tfms[1].map(type), [Categorize])
test_eq(mnist.default_item_tfms.map(type), [ToTensor])
test_eq(mnist.default_batch_tfms.map(type), [IntToFloatTensor])
dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(dsets.vocab, ['3', '7'])
x,y = dsets.train[0]
test_eq(x.size,(28,28))
show_at(dsets.train, 0, cmap='Greys', figsize=(2,2));

test_fail(lambda: DataBlock(wrong_kwarg=42, wrong_kwarg2='foo'))

We can pass any number of blocks to DataBlock, we can then define what are the input and target blocks by changing n_inp. For example, defining n_inp=2 will consider the first two blocks passed as inputs and the others as targets.

mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                   get_y=parent_label)
dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(mnist.n_inp, 2)
test_eq(len(dsets.train[0]), 3)
test_fail(lambda: DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                  get_y=[parent_label, noop],
                  n_inp=2), msg='get_y contains 2 functions, but must contain 1 (one for each output)')
mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                  n_inp=1,
                  get_y=[noop, Pipeline([noop, parent_label])])
dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(len(dsets.train[0]), 3)

Debugging


source

DataBlock.summary

 DataBlock.summary (source, bs:int=4, show_batch:bool=False, **kwargs)

Steps through the transform pipeline for one batch, and optionally calls show_batch(**kwargs) on the transient Dataloaders.

Type Default Details
source The data source
bs int 4 The batch size
show_batch bool False Call show_batch after the summary
kwargs

source

DataBlock.summary

 DataBlock.summary (source, bs:int=4, show_batch:bool=False, **kwargs)

Steps through the transform pipeline for one batch, and optionally calls show_batch(**kwargs) on the transient Dataloaders.

Type Default Details
source The data source
bs int 4 The batch size
show_batch bool False Call show_batch after the summary
kwargs

Besides stepping through the transformation, summary() provides a shortcut dls.show_batch(...), to see the data. E.g.

pets.summary(path/"images", bs=8, show_batch=True, unique=True,...)

is a shortcut to:

pets.summary(path/"images", bs=8)
dls = pets.dataloaders(path/"images", bs=8)
dls.show_batch(unique=True,...)  # See different tfms effect on the same image.