Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders
The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the tabular tutorial for examples of use.
def TabularDataLoaders( loaders:VAR_POSITIONAL, # [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) objects to wrap path:str| Path='.', # Path to store export objects device:NoneType=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders)):
Basic wrapper around several DataLoaders with factory methods for tabular data
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
def from_df( df:pd.DataFrame, path:str| Path='.', # Location of `df`, defaults to current working directory procs:list=None, # List of [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s cat_names:list=None, # Column names pertaining to categorical variables cont_names:list=None, # Column names pertaining to continuous variables y_names:list=None, # Names of the dependent variables y_block:TransformBlock=None, # [`TransformBlock`](https://docs.fast.ai/data.block.html#transformblock) to use for the target(s) valid_idx:list=None, # List of indices to use for the validation set, defaults to a random split bs:int=64, # Batch size shuffle_train:bool=None, # (Deprecated, use `shuffle`) Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) shuffle:bool=True, # Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) val_shuffle:bool=False, # Shuffle validation [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) n:int=None, # Size of [`Datasets`](https://docs.fast.ai/data.core.html#datasets) used to create [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) device:torch.device=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders) drop_last:bool=None, # Drop last incomplete batch, defaults to `shuffle` val_bs:int=None, # Validation batch size, defaults to `bs`):
def from_csv( csv:str| Path | io.BufferedReader, # A csv of training data skipinitialspace:bool=True, # Skip spaces after delimiter path:str| Path='.', # Location of `df`, defaults to current working directory procs:list=None, # List of [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s cat_names:list=None, # Column names pertaining to categorical variables cont_names:list=None, # Column names pertaining to continuous variables y_names:list=None, # Names of the dependent variables y_block:TransformBlock=None, # [`TransformBlock`](https://docs.fast.ai/data.block.html#transformblock) to use for the target(s) valid_idx:list=None, # List of indices to use for the validation set, defaults to a random split bs:int=64, # Batch size shuffle_train:bool=None, # (Deprecated, use `shuffle`) Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) shuffle:bool=True, # Shuffle training [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) val_shuffle:bool=False, # Shuffle validation [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) n:int=None, # Size of [`Datasets`](https://docs.fast.ai/data.core.html#datasets) used to create [`DataLoader`](https://docs.fast.ai/data.load.html#dataloader) device:torch.device=None, # Device to put [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders) drop_last:bool=None, # Drop last incomplete batch, defaults to `shuffle` val_bs:int=None, # Validation batch size, defaults to `bs`):
def test_dl( test_items, # Items to create new test [`TabDataLoader`](https://docs.fast.ai/tabular.core.html#tabdataloader) formatted the same as the training data rm_type_tfms:NoneType=None, # Number of `Transform`s to be removed from `procs` process:bool=True, # Apply validation [`TabularProc`](https://docs.fast.ai/tabular.core.html#tabularproc)s to `test_items` immediately inplace:bool=False, # Keep separate copy of original `test_items` in memory if `False` bs:int=16, # Size of batch shuffle:bool=False, # Whether to shuffle data after_batch:NoneType=None, num_workers:int=0, verbose:bool=False, # Whether to print verbose logs do_setup:bool=True, # Whether to run `setup()` for batch transform(s) pin_memory:bool=False, timeout:int=0, batch_size:NoneType=None, drop_last:bool=False, indexed:NoneType=None, n:NoneType=None, device:NoneType=None, persistent_workers:bool=False, pin_memory_device:str='', wif:NoneType=None, before_iter:NoneType=None, after_item:NoneType=None, before_batch:NoneType=None, after_iter:NoneType=None, create_batches:NoneType=None, create_item:NoneType=None, create_batch:NoneType=None, retain:NoneType=None, get_idxs:NoneType=None, sample:NoneType=None, shuffle_fn:NoneType=None, do_batch:NoneType=None):
Create test TabDataLoader from test_items using validation procs
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv(). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.