Data block

High level API to quickly get your data in a DataLoaders

TransformBlock

 TransformBlock (type_tfms:list=None, item_tfms:list=None,
                 batch_tfms:list=None,
                 dl_type:fastai.data.core.TfmdDL=None,
                 dls_kwargs:dict=None)

A basic wrapper that links defaults transforms for the data block API

	Type	Default	Details
type_tfms	list	None	One or more `Transform`s
item_tfms	list	None	`ItemTransform`s, applied on an item
batch_tfms	list	None	`Transform`s or `RandTransform`s, applied by batch
dl_type	TfmdDL	None	Task specific `TfmdDL`, defaults to `TfmdDL`
dls_kwargs	dict	None	Additional arguments to be passed to `DataLoaders`

source

CategoryBlock

 CategoryBlock
                (vocab:collections.abc.MutableSequence|pandas.core.series.
                Series=None, sort:bool=True, add_na:bool=False)

TransformBlock for single-label categorical targets

	Type	Default	Details
vocab	collections.abc.MutableSequence \| pandas.core.series.Series	None	List of unique class names
sort	bool	True	Sort the classes alphabetically
add_na	bool	False	Add `#na#` to `vocab`

source

MultiCategoryBlock

 MultiCategoryBlock (encoded:bool=False,
                     vocab:collections.abc.MutableSequence|pandas.core.ser
                     ies.Series=None, add_na:bool=False)

TransformBlock for multi-label categorical targets

	Type	Default	Details
encoded	bool	False	Whether the data comes in one-hot encoded
vocab	collections.abc.MutableSequence \| pandas.core.series.Series	None	List of unique class names
add_na	bool	False	Add `#na#` to `vocab`

source

RegressionBlock

 RegressionBlock (n_out:int=None)

TransformBlock for float targets

	Type	Default	Details
n_out	int	None	Number of output values

General API

#For example, so not exported
from fastai.vision.core import *
from fastai.vision.data import *

source

DataBlock

 DataBlock (blocks:list=None, dl_type:TfmdDL=None, getters:list=None,
            n_inp:int=None, item_tfms:list=None, batch_tfms:list=None,
            get_items=None, splitter=None, get_y=None, get_x=None)

Generic container to quickly build Datasets and DataLoaders.

	Type	Default	Details
blocks	list	None	One or more `TransformBlock`s
dl_type	TfmdDL	None	Task specific `TfmdDL`, defaults to `block`’s dl_type or`TfmdDL`
getters	list	None	Getter functions applied to results of `get_items`
n_inp	int	None	Number of inputs
item_tfms	list	None	`ItemTransform`s, applied on an item
batch_tfms	list	None	`Transform`s or `RandTransform`s, applied by batch
get_items	NoneType	None
splitter	NoneType	None
get_y	NoneType	None
get_x	NoneType	None

To build a DataBlock you need to give the library four things: the types of your input/labels, and at least two functions: get_items and splitter. You may also need to include get_x and get_y or a more generic list of getters that are applied to the results of get_items.

splitter is a callable which, when called with items, returns a tuple of iterables representing the indices of the training and validation data.

Once those are provided, you automatically get a Datasets or a DataLoaders:

source

DataBlock.datasets

 DataBlock.datasets (source, verbose:bool=False)

Create a Datasets object from source

	Type	Default	Details
source			The data source
verbose	bool	False	Show verbose messages
Returns	Datasets

source

DataBlock.dataloaders

 DataBlock.dataloaders (source, path:str='.', verbose:bool=False,
                        bs:int=64, shuffle:bool=False,
                        num_workers:int=None, do_setup:bool=True,
                        pin_memory=False, timeout=0, batch_size=None,
                        drop_last=False, indexed=None, n=None,
                        device=None, persistent_workers=False,
                        pin_memory_device='', wif=None, before_iter=None,
                        after_item=None, before_batch=None,
                        after_batch=None, after_iter=None,
                        create_batches=None, create_item=None,
                        create_batch=None, retain=None, get_idxs=None,
                        sample=None, shuffle_fn=None, do_batch=None)

Create a DataLoaders object from source

	Type	Default	Details
source			The data source
path	str	.	Data source and default `Learner` path
verbose	bool	False	Show verbose messages
bs	int	64	Size of batch
shuffle	bool	False	Whether to shuffle data
num_workers	int	None	Number of CPU cores to use in parallel (default: All available up to 16)
do_setup	bool	True	Whether to run `setup()` for batch transform(s)
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
drop_last	bool	False
indexed	NoneType	None
n	NoneType	None
device	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None
Returns	DataLoaders

You can create a DataBlock by passing functions:

mnist = DataBlock(blocks = (ImageBlock(cls=PILImageBW),CategoryBlock),
                  get_items = get_image_files,
                  splitter = GrandparentSplitter(),
                  get_y = parent_label)

Each type comes with default transforms that will be applied:

at the base level to create items in a tuple (usually input,target) from the base elements (like filenames)
at the item level of the datasets
at the batch level

They are called respectively type transforms, item transforms, batch transforms. In the case of MNIST, the type transforms are the method to create a PILImageBW (for the input) and the Categorize transform (for the target), the item transform is ToTensor and the batch transforms are Cuda and IntToFloatTensor. You can add any other transforms by passing them in DataBlock.datasets or DataBlock.dataloaders.

test_eq(mnist.type_tfms[0], [PILImageBW.create])
test_eq(mnist.type_tfms[1].map(type), [Categorize])
test_eq(mnist.default_item_tfms.map(type), [ToTensor])
test_eq(mnist.default_batch_tfms.map(type), [IntToFloatTensor])

dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(dsets.vocab, ['3', '7'])
x,y = dsets.train[0]
test_eq(x.size,(28,28))
show_at(dsets.train, 0, cmap='Greys', figsize=(2,2));

test_fail(lambda: DataBlock(wrong_kwarg=42, wrong_kwarg2='foo'))

We can pass any number of blocks to DataBlock, we can then define what are the input and target blocks by changing n_inp. For example, defining n_inp=2 will consider the first two blocks passed as inputs and the others as targets.

mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                   get_y=parent_label)
dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(mnist.n_inp, 2)
test_eq(len(dsets.train[0]), 3)

test_fail(lambda: DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                  get_y=[parent_label, noop],
                  n_inp=2), msg='get_y contains 2 functions, but must contain 1 (one for each output)')

mnist = DataBlock((ImageBlock, ImageBlock, CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(),
                  n_inp=1,
                  get_y=[noop, Pipeline([noop, parent_label])])
dsets = mnist.datasets(untar_data(URLs.MNIST_TINY))
test_eq(len(dsets.train[0]), 3)

Debugging

source

DataBlock.summary

 DataBlock.summary (source, bs:int=4, show_batch:bool=False, **kwargs)

Steps through the transform pipeline for one batch, and optionally calls show_batch(**kwargs) on the transient Dataloaders.

	Type	Default	Details
source			The data source
bs	int	4	The batch size
show_batch	bool	False	Call `show_batch` after the summary
kwargs	VAR_KEYWORD

source

DataBlock.summary

 DataBlock.summary (source, bs:int=4, show_batch:bool=False, **kwargs)

Steps through the transform pipeline for one batch, and optionally calls show_batch(**kwargs) on the transient Dataloaders.

	Type	Default	Details
source			The data source
bs	int	4	The batch size
show_batch	bool	False	Call `show_batch` after the summary
kwargs	VAR_KEYWORD

Besides stepping through the transformation, summary() provides a shortcut dls.show_batch(...), to see the data. E.g.

pets.summary(path/"images", bs=8, show_batch=True, unique=True,...)

is a shortcut to:

pets.summary(path/"images", bs=8)
dls = pets.dataloaders(path/"images", bs=8)
dls.show_batch(unique=True,...)  # See different tfms effect on the same image.