Helper functions to get data in a `DataLoaders` in the tabular application and higher class `TabularDataLoaders`

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.

class TabularDataLoaders[source]

TabularDataLoaders(*loaders, path='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders with factory methods for tabular data

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

  • cat_names: the names of the categorical variables
  • cont_names: the names of the continuous variables
  • y_names: the names of the dependent variables
  • y_block: the TransformBlock to use for the target
  • valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
  • bs: the batch size
  • val_bs: the batch size for the validation DataLoader (defaults to bs)
  • shuffle_train: if we shuffle the training DataLoader or not
  • n: overrides the numbers of elements in the dataset
  • device: the PyTorch device to use (defaults to default_device())

TabularDataLoaders.from_df[source]

TabularDataLoaders.from_df(df, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, val_bs=None, shuffle_train=True, n=None, device=None)

Create from df in path using procs

Let's have a look on an example with the adult dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private 11th Separated Adm-clerical Unmarried Black False 55.0 213894.000562 7.0 <50k
1 Private HS-grad Married-civ-spouse Machine-op-inspct Husband White False 53.0 228500.001385 9.0 >=50k
2 Private HS-grad Married-civ-spouse Tech-support Husband White False 38.0 256864.000909 9.0 >=50k
3 Private Bachelors Married-civ-spouse Tech-support Husband White False 40.0 247879.997190 13.0 >=50k
4 Private Some-college Divorced Craft-repair Not-in-family White False 41.0 40151.001925 10.0 >=50k
5 Private HS-grad Married-civ-spouse Sales Husband White False 37.0 110713.001599 9.0 >=50k
6 Private Bachelors Married-civ-spouse Exec-managerial Husband White False 38.0 278924.000902 13.0 >=50k
7 Self-emp-not-inc 11th Married-civ-spouse Farming-fishing Husband White False 60.0 220341.999356 7.0 <50k
8 ? 9th Never-married ? Not-in-family White False 30.0 104965.001013 5.0 <50k
9 ? HS-grad Never-married ? Not-in-family White False 21.0 105311.997415 9.0 <50k

TabularDataLoaders.from_csv[source]

TabularDataLoaders.from_csv(csv, skipinitialspace=True, path='.', procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, valid_idx=None, bs=64, val_bs=None, shuffle_train=True, n=None, device=None)

Create from csv file in path using procs

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", valid_idx=list(range(800,1000)), bs=64)

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv()). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let's test this feature.

test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input)

test_ne(0, tdl.dataset.iloc[0]['workclass'])