Helper functions to get data in a `DataLoaders` in the tabular application and higher class `TabularDataLoaders`

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.

class TabularDataLoaders[source]

TabularDataLoaders(*loaders, path:(str, Path)='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders with factory methods for tabular data

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

  • cat_names: the names of the categorical variables
  • cont_names: the names of the continuous variables
  • y_names: the names of the dependent variables
  • y_block: the TransformBlock to use for the target
  • valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
  • bs: the batch size
  • val_bs: the batch size for the validation DataLoader (defaults to bs)
  • shuffle_train: if we shuffle the training DataLoader or not
  • n: overrides the numbers of elements in the dataset
  • device: the PyTorch device to use (defaults to default_device())

TabularDataLoaders.from_df[source]

TabularDataLoaders.from_df(df:pd.DataFrame, path:(str, Path)='.', procs:list=None, cat_names:list=None, cont_names:list=None, y_names:list=None, y_block:TransformBlock=None, valid_idx:list=None, bs:int=64, shuffle_train:bool=None, shuffle:bool=True, val_shuffle:bool=False, n:int=None, device:torch.device=None, drop_last:bool=None, val_bs:int=None)

Create TabularDataLoaders from df in path using procs

Type Default Details
df pd.DataFrame No Content
path (str, Path) . Location of df, defaults to current working directory
procs list None List of TabularProcs
cat_names list None Column names pertaining to categorical variables
cont_names list None Column names pertaining to continuous variables
y_names list None Names of the dependent variables
y_block TransformBlock None TransformBlock to use for the target(s)
valid_idx list None List of indices to use for the validation set, defaults to a random split
Valid Keyword Arguments
bs int 64 Batch size passed to FilteredBase.dataloaders
shuffle_train bool None (Deprecated, use shuffle) Shuffle training DataLoader passed to FilteredBase.dataloaders
shuffle bool True Shuffle training DataLoader passed to FilteredBase.dataloaders
val_shuffle bool False Shuffle validation DataLoader passed to FilteredBase.dataloaders
n int None Size of Datasets used to create DataLoader passed to FilteredBase.dataloaders
device torch.device None Device to put DataLoaders passed to FilteredBase.dataloaders
drop_last bool None Drop last incomplete batch, defaults to shuffle passed to FilteredBase.dataloaders
val_bs int None Validation batch size, defaults to bs passed to FilteredBase.dataloaders

Let's have a look on an example with the adult dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private HS-grad Married-civ-spouse Adm-clerical Husband White False 24.0 121312.998272 9.0 <50k
1 Private HS-grad Never-married Other-service Not-in-family White False 19.0 198320.000325 9.0 <50k
2 Private Bachelors Married-civ-spouse Sales Husband White False 66.0 169803.999308 13.0 >=50k
3 Private HS-grad Divorced Adm-clerical Unmarried White False 40.0 799280.980929 9.0 <50k
4 Local-gov 10th Never-married Other-service Own-child White False 18.0 55658.003629 6.0 <50k
5 Private HS-grad Never-married Handlers-cleaners Other-relative White False 30.0 375827.003847 9.0 <50k
6 Private Some-college Never-married Handlers-cleaners Own-child White False 20.0 173723.999335 10.0 <50k
7 ? Some-college Never-married ? Own-child White False 21.0 107800.997986 10.0 <50k
8 Private HS-grad Never-married Handlers-cleaners Own-child White False 19.0 263338.000072 9.0 <50k
9 Private Some-college Married-civ-spouse Tech-support Husband White False 35.0 194590.999986 10.0 <50k

TabularDataLoaders.from_csv[source]

TabularDataLoaders.from_csv(csv:(str, Path, io.BufferedReader), skipinitialspace:bool=True, path:(str, Path)='.', procs:list=None, cat_names:list=None, cont_names:list=None, y_names:list=None, y_block:TransformBlock=None, valid_idx:list=None, bs:int=64, shuffle_train:bool=None, shuffle:bool=True, val_shuffle:bool=False, n:int=None, device:torch.device=None, drop_last:bool=None, val_bs:int=None)

Create TabularDataLoaders from csv file in path using procs

Type Default Details
csv (str, Path, io.BufferedReader) A csv of training data
skipinitialspace bool True Skip spaces after delimiter
Valid Keyword Arguments
path (str, Path) . Location of df, defaults to current working directory passed to TabularDataLoaders.from_df
procs list None List of TabularProcs passed to TabularDataLoaders.from_df
cat_names list None Column names pertaining to categorical variables passed to TabularDataLoaders.from_df
cont_names list None Column names pertaining to continuous variables passed to TabularDataLoaders.from_df
y_names list None Names of the dependent variables passed to TabularDataLoaders.from_df
y_block TransformBlock None TransformBlock to use for the target(s) passed to TabularDataLoaders.from_df
valid_idx list None List of indices to use for the validation set, defaults to a random split passed to TabularDataLoaders.from_df
bs int 64 Argument passed to TabularDataLoaders.from_df
shuffle_train bool None Argument passed to TabularDataLoaders.from_df
shuffle bool True Argument passed to TabularDataLoaders.from_df
val_shuffle bool False Argument passed to TabularDataLoaders.from_df
n int None Argument passed to TabularDataLoaders.from_df
device torch.device None Argument passed to TabularDataLoaders.from_df
drop_last bool None Argument passed to TabularDataLoaders.from_df
val_bs int None Argument passed to TabularDataLoaders.from_df
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", valid_idx=list(range(800,1000)), bs=64)

TabularDataLoaders.test_dl[source]

TabularDataLoaders.test_dl(test_items, rm_type_tfms=None, process:bool=True, inplace:bool=False, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)

Create test TabDataLoader from test_items using validation procs

Type Default Details
test_items Items to create new test TabDataLoader formatted the same as the training data
rm_type_tfms None Number of Transforms to be removed from procs
process bool True Apply validation TabularProcs to test_items immediately
inplace bool False Keep separate copy of original test_items in memory if False
Valid Keyword Arguments
bs int 16 Argument passed to TabDataLoader.__init__
shuffle bool False Argument passed to TabDataLoader.__init__
after_batch None Argument passed to TabDataLoader.__init__
num_workers int 0 Argument passed to TabDataLoader.__init__
verbose bool False Argument passed to TabDataLoader.__init__
do_setup bool True Argument passed to TabDataLoader.__init__
pin_memory bool False Argument passed to TabDataLoader.__init__
timeout int 0 Argument passed to TabDataLoader.__init__
batch_size None Argument passed to TabDataLoader.__init__
drop_last bool False Argument passed to TabDataLoader.__init__
indexed None Argument passed to TabDataLoader.__init__
n None Argument passed to TabDataLoader.__init__
device None Argument passed to TabDataLoader.__init__
persistent_workers bool False Argument passed to TabDataLoader.__init__
wif None Argument passed to TabDataLoader.__init__
before_iter None Argument passed to TabDataLoader.__init__
after_item None Argument passed to TabDataLoader.__init__
before_batch None Argument passed to TabDataLoader.__init__
after_iter None Argument passed to TabDataLoader.__init__
create_batches None Argument passed to TabDataLoader.__init__
create_item None Argument passed to TabDataLoader.__init__
create_batch None Argument passed to TabDataLoader.__init__
retain None Argument passed to TabDataLoader.__init__
get_idxs None Argument passed to TabDataLoader.__init__
sample None Argument passed to TabDataLoader.__init__
shuffle_fn None Argument passed to TabDataLoader.__init__
do_batch None Argument passed to TabDataLoader.__init__

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv()). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let's test this feature.

test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input)

test_ne(0, tdl.dataset.iloc[0]['workclass'])