Tabular data

Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.

source

TabularDataLoaders

 TabularDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)

Basic wrapper around several DataLoaders with factory methods for tabular data

	Type	Default	Details
loaders	VAR_POSITIONAL		`DataLoader` objects to wrap
path	str \| pathlib.Path	.	Path to store export objects
device	NoneType	None	Device to put `DataLoaders`

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

cat_names: the names of the categorical variables
cont_names: the names of the continuous variables
y_names: the names of the dependent variables
y_block: the TransformBlock to use for the target
valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
bs: the batch size
val_bs: the batch size for the validation DataLoader (defaults to bs)
shuffle_train: if we shuffle the training DataLoader or not
n: overrides the numbers of elements in the dataset
device: the PyTorch device to use (defaults to default_device())

source

TabularDataLoaders.from_df

 TabularDataLoaders.from_df (df:pd.DataFrame, path:str|Path='.',
                             procs:list=None, cat_names:list=None,
                             cont_names:list=None, y_names:list=None,
                             y_block:TransformBlock=None,
                             valid_idx:list=None, bs:int=64,
                             shuffle_train:bool=None, shuffle:bool=True,
                             val_shuffle:bool=False, n:int=None,
                             device:torch.device=None,
                             drop_last:bool=None, val_bs:int=None)

Create TabularDataLoaders from df in path using procs

	Type	Default	Details
df	pd.DataFrame
path	str \| Path	.	Location of `df`, defaults to current working directory
procs	list	None	List of `TabularProc`s
cat_names	list	None	Column names pertaining to categorical variables
cont_names	list	None	Column names pertaining to continuous variables
y_names	list	None	Names of the dependent variables
y_block	TransformBlock	None	`TransformBlock` to use for the target(s)
valid_idx	list	None	List of indices to use for the validation set, defaults to a random split
bs	int	64	Batch size
shuffle_train	bool	None	(Deprecated, use `shuffle`) Shuffle training `DataLoader`
shuffle	bool	True	Shuffle training `DataLoader`
val_shuffle	bool	False	Shuffle validation `DataLoader`
n	int	None	Size of `Datasets` used to create `DataLoader`
device	device	None	Device to put `DataLoaders`
drop_last	bool	None	Drop last incomplete batch, defaults to `shuffle`
val_bs	int	None	Validation batch size, defaults to `bs`

Let’s have a look on an example with the adult dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]

dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)

dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	HS-grad	Married-civ-spouse	Adm-clerical	Husband	White	False	24.0	121312.998272	9.0	<50k
1	Private	HS-grad	Never-married	Other-service	Not-in-family	White	False	19.0	198320.000325	9.0	<50k
2	Private	Bachelors	Married-civ-spouse	Sales	Husband	White	False	66.0	169803.999308	13.0	>=50k
3	Private	HS-grad	Divorced	Adm-clerical	Unmarried	White	False	40.0	799280.980929	9.0	<50k
4	Local-gov	10th	Never-married	Other-service	Own-child	White	False	18.0	55658.003629	6.0	<50k
5	Private	HS-grad	Never-married	Handlers-cleaners	Other-relative	White	False	30.0	375827.003847	9.0	<50k
6	Private	Some-college	Never-married	Handlers-cleaners	Own-child	White	False	20.0	173723.999335	10.0	<50k
7	?	Some-college	Never-married	?	Own-child	White	False	21.0	107800.997986	10.0	<50k
8	Private	HS-grad	Never-married	Handlers-cleaners	Own-child	White	False	19.0	263338.000072	9.0	<50k
9	Private	Some-college	Married-civ-spouse	Tech-support	Husband	White	False	35.0	194590.999986	10.0	<50k

source

TabularDataLoaders.from_csv

 TabularDataLoaders.from_csv (csv:str|Path|io.BufferedReader,
                              skipinitialspace:bool=True,
                              path:str|Path='.', procs:list=None,
                              cat_names:list=None, cont_names:list=None,
                              y_names:list=None,
                              y_block:TransformBlock=None,
                              valid_idx:list=None, bs:int=64,
                              shuffle_train:bool=None, shuffle:bool=True,
                              val_shuffle:bool=False, n:int=None,
                              device:torch.device=None,
                              drop_last:bool=None, val_bs:int=None)

Create TabularDataLoaders from csv file in path using procs

	Type	Default	Details
csv	str \| Path \| io.BufferedReader		A csv of training data
skipinitialspace	bool	True	Skip spaces after delimiter
path	str \| Path	.	Location of `df`, defaults to current working directory
procs	list	None	List of `TabularProc`s
cat_names	list	None	Column names pertaining to categorical variables
cont_names	list	None	Column names pertaining to continuous variables
y_names	list	None	Names of the dependent variables
y_block	TransformBlock	None	`TransformBlock` to use for the target(s)
valid_idx	list	None	List of indices to use for the validation set, defaults to a random split
bs	int	64	Batch size
shuffle_train	bool	None	(Deprecated, use `shuffle`) Shuffle training `DataLoader`
shuffle	bool	True	Shuffle training `DataLoader`
val_shuffle	bool	False	Shuffle validation `DataLoader`
n	int	None	Size of `Datasets` used to create `DataLoader`
device	device	None	Device to put `DataLoaders`
drop_last	bool	None	Drop last incomplete batch, defaults to `shuffle`
val_bs	int	None	Validation batch size, defaults to `bs`

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", valid_idx=list(range(800,1000)), bs=64)

source

TabularDataLoaders.test_dl

 TabularDataLoaders.test_dl (test_items, rm_type_tfms=None,
                             process:bool=True, inplace:bool=False, bs=16,
                             shuffle=False, after_batch=None,
                             num_workers=0, verbose:bool=False,
                             do_setup:bool=True, pin_memory=False,
                             timeout=0, batch_size=None, drop_last=False,
                             indexed=None, n=None, device=None,
                             persistent_workers=False,
                             pin_memory_device='', wif=None,
                             before_iter=None, after_item=None,
                             before_batch=None, after_iter=None,
                             create_batches=None, create_item=None,
                             create_batch=None, retain=None,
                             get_idxs=None, sample=None, shuffle_fn=None,
                             do_batch=None)

Create test TabDataLoader from test_items using validation procs

	Type	Default	Details
test_items			Items to create new test `TabDataLoader` formatted the same as the training data
rm_type_tfms	NoneType	None	Number of `Transform`s to be removed from `procs`
process	bool	True	Apply validation `TabularProc`s to `test_items` immediately
inplace	bool	False	Keep separate copy of original `test_items` in memory if `False`
bs	int	64	Size of batch
shuffle	bool	False	Whether to shuffle data
after_batch	NoneType	None
num_workers	int	None	Number of CPU cores to use in parallel (default: All available up to 16)
verbose	bool	False	Whether to print verbose logs
do_setup	bool	True	Whether to run `setup()` for batch transform(s)
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
drop_last	bool	False
indexed	NoneType	None
n	NoneType	None
device	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv(). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.

test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input)

test_ne(0, tdl.dataset.iloc[0]['workclass'])