Tabular data

Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.


source

TabularDataLoaders


def TabularDataLoaders(
    loaders:VAR_POSITIONAL, # [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) objects to wrap
    path:str | Path='.', # Path to store export objects
    device:NoneType=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders)
):

Basic wrapper around several DataLoaders with factory methods for tabular data

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

  • cat_names: the names of the categorical variables
  • cont_names: the names of the continuous variables
  • y_names: the names of the dependent variables
  • y_block: the TransformBlock to use for the target
  • valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
  • bs: the batch size
  • val_bs: the batch size for the validation DataLoader (defaults to bs)
  • shuffle_train: if we shuffle the training DataLoader or not
  • n: overrides the numbers of elements in the dataset
  • device: the PyTorch device to use (defaults to default_device())

source

TabularDataLoaders.from_df


def from_df(
    df:pd.DataFrame, path:str | Path='.', # Location of `df`, defaults to current working directory
    procs:list=None, # List of [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s
    cat_names:list=None, # Column names pertaining to categorical variables
    cont_names:list=None, # Column names pertaining to continuous variables
    y_names:list=None, # Names of the dependent variables
    y_block:TransformBlock=None, # [`TransformBlock`](https://docs.fast.ai/data.block.html#transformblock) to use for the target(s)
    valid_idx:list=None, # List of indices to use for the validation set, defaults to a random split
    bs:int=64, # Batch size
    shuffle_train:bool=None, # (Deprecated, use `shuffle`) Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    shuffle:bool=True, # Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    val_shuffle:bool=False, # Shuffle validation [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    n:int=None, # Size of [`Datasets`](https://docs.fast.ai/data.core.html#datasets) used to create [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    device:torch.device=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders)
    drop_last:bool=None, # Drop last incomplete batch, defaults to `shuffle`
    val_bs:int=None, # Validation batch size, defaults to `bs`
):

Create TabularDataLoaders from df in path using procs

Let’s have a look on an example with the adult dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private HS-grad Married-civ-spouse Adm-clerical Husband White False 24.0 121312.998272 9.0 <50k
1 Private HS-grad Never-married Other-service Not-in-family White False 19.0 198320.000325 9.0 <50k
2 Private Bachelors Married-civ-spouse Sales Husband White False 66.0 169803.999308 13.0 >=50k
3 Private HS-grad Divorced Adm-clerical Unmarried White False 40.0 799280.980929 9.0 <50k
4 Local-gov 10th Never-married Other-service Own-child White False 18.0 55658.003629 6.0 <50k
5 Private HS-grad Never-married Handlers-cleaners Other-relative White False 30.0 375827.003847 9.0 <50k
6 Private Some-college Never-married Handlers-cleaners Own-child White False 20.0 173723.999335 10.0 <50k
7 ? Some-college Never-married ? Own-child White False 21.0 107800.997986 10.0 <50k
8 Private HS-grad Never-married Handlers-cleaners Own-child White False 19.0 263338.000072 9.0 <50k
9 Private Some-college Married-civ-spouse Tech-support Husband White False 35.0 194590.999986 10.0 <50k

source

TabularDataLoaders.from_csv


def from_csv(
    csv:str | Path | io.BufferedReader, # A csv of training data
    skipinitialspace:bool=True, # Skip spaces after delimiter
    path:str | Path='.', # Location of `df`, defaults to current working directory
    procs:list=None, # List of [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s
    cat_names:list=None, # Column names pertaining to categorical variables
    cont_names:list=None, # Column names pertaining to continuous variables
    y_names:list=None, # Names of the dependent variables
    y_block:TransformBlock=None, # [`TransformBlock`](https://docs.fast.ai/data.block.html#transformblock) to use for the target(s)
    valid_idx:list=None, # List of indices to use for the validation set, defaults to a random split
    bs:int=64, # Batch size
    shuffle_train:bool=None, # (Deprecated, use `shuffle`) Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    shuffle:bool=True, # Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    val_shuffle:bool=False, # Shuffle validation [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    n:int=None, # Size of [`Datasets`](https://docs.fast.ai/data.core.html#datasets) used to create [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader)
    device:torch.device=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders)
    drop_last:bool=None, # Drop last incomplete batch, defaults to `shuffle`
    val_bs:int=None, # Validation batch size, defaults to `bs`
):

Create TabularDataLoaders from csv file in path using procs

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", valid_idx=list(range(800,1000)), bs=64)

source

TabularDataLoaders.test_dl


def test_dl(
    test_items, # Items to create new test [`TabDataLoader`](https://docs.fast.ai/tabular.core.html#tabdataloader) formatted the same as the training data
    rm_type_tfms:NoneType=None, # Number of `Transform`s to be removed from `procs`
    process:bool=True, # Apply validation [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s to `test_items` immediately
    inplace:bool=False, # Keep separate copy of original `test_items` in memory if `False`
    bs:int=16, # Size of batch
    shuffle:bool=False, # Whether to shuffle data
    after_batch:NoneType=None, num_workers:int=0, verbose:bool=False, # Whether to print verbose logs
    do_setup:bool=True, # Whether to run `setup()` for batch transform(s)
    pin_memory:bool=False, timeout:int=0, batch_size:NoneType=None, drop_last:bool=False, indexed:NoneType=None,
    n:NoneType=None, device:NoneType=None, persistent_workers:bool=False, pin_memory_device:str='',
    wif:NoneType=None, before_iter:NoneType=None, after_item:NoneType=None, before_batch:NoneType=None,
    after_iter:NoneType=None, create_batches:NoneType=None, create_item:NoneType=None, create_batch:NoneType=None,
    retain:NoneType=None, get_idxs:NoneType=None, sample:NoneType=None, shuffle_fn:NoneType=None,
    do_batch:NoneType=None
):

Create test TabDataLoader from test_items using validation procs

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv(). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.

test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input)

test_ne(0, tdl.dataset.iloc[0]['workclass'])