Base class to deal with tabular data and get a DataBunch

Tabular data handling

This module defines the main class to handle tabular data in the fastai library: TabularDataset. As always, there is also a helper function to quickly get your data.

To allow you to easily create a Learner for your data, it provides tabular_learner.

class TabularDataBunch[source]

TabularDataBunch(`train_dl`:DataLoader, `valid_dl`:DataLoader, `fix_dl`:DataLoader=`None`, `test_dl`:Optional[DataLoader]=`None`, `device`:device=`None`, `tfms`:Optional[Collection[Callable]]=`None`, `path`:PathOrStr=`'.'`, `collate_fn`:Callable=`'data_collate'`, `no_check`:bool=`False`) :: DataBunch

Create a DataBunch suitable for tabular data.

The best way to quickly get your data in a DataBunch suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the adult dataset.

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
valid_idx = range(len(df)-2000, len(df))
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = 'salary'

The initialization of TabularDataBunch is the same as DataBunch so you really want to use the facotry method instead.


from_df(`path`, `df`:DataFrame, `dep_var`:str, `valid_idx`:Collection[int], `procs`:Optional[Collection[TabularProc]]=`None`, `cat_names`:OptStrList=`None`, `cont_names`:OptStrList=`None`, `classes`:Collection[T_co]=`None`, `test_df`=`None`, `kwargs`) → DataBunch

Create a DataBunch from df and valid_idx with dep_var.

Optionally, use test_df for the test set. The dependent variable is dep_var, while the categorical and continuous variables are in the cat_names columns and cont_names columns respectively. If cont_names is None then we assume all variables that aren't dependent or categorical are continuous. The TabularProcessor in procs are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for nan) and the continuous variables are normalized.

Note that the TabularProcessor should be passed as Callable: the actual initialization with cat_names and cont_names is done during the preprocessing.

procs = [FillMissing, Categorify, Normalize]
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

You can then easily create a Learner for this data with tabular_learner.


tabular_learner(`data`:DataBunch, `layers`:Collection[int], `emb_szs`:Dict[str, int]=`None`, `metrics`=`None`, `ps`:Collection[float]=`None`, `emb_drop`:float=`0.0`, `y_range`:OptRange=`None`, `use_bn`:bool=`True`, `kwargs`)

Get a Learner using data, with metrics, including a TabularModel created using the remaining params.

emb_szs is a dict mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.

class TabularList[source]

TabularList(`items`:Iterator[T_co], `cat_names`:OptStrList=`None`, `cont_names`:OptStrList=`None`, `procs`=`None`, `kwargs`) → TabularList :: ItemList

Basic ItemList for tabular data.

Basic class to create a list of inputs in items for tabular data. cat_names and cont_names are the names of the categorical and the continuous variables respectively. processor will be applied to the inputs or one will be created from the transforms in procs.


from_df(`df`:DataFrame, `cat_names`:OptStrList=`None`, `cont_names`:OptStrList=`None`, `procs`=`None`, `kwargs`) → ItemList

Get the list of inputs in the col of path/csv_name.



Return the default embedding sizes suitable for this data or takes the ones in sz_dict.


show_xys(`xs`, `ys`)

Show the xs (inputs) and ys (targets).


show_xyzs(`xs`, `ys`, `zs`)

Show xs (inputs), ys (targets) and zs (predictions).

class TabularLine[source]

TabularLine(`cats`, `conts`, `classes`, `names`) :: ItemBase

An object that will contain the encoded cats, the continuous variables conts, the classes and the names of the columns. This is the basic input for a dataset dealing with tabular data.

class TabularProcessor[source]

TabularProcessor(`ds`:ItemBase=`None`, `procs`=`None`) :: PreProcessor

Regroup the procs in one PreProcessor.

Create a PreProcessor from procs.