The main class to get your data ready for model training is TabularDataLoaders
and its factory methods. Checkout the tabular tutorial for examples of use.
source
TabularDataLoaders
TabularDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)
Basic wrapper around several DataLoader
s with factory methods for tabular data
loaders
VAR_POSITIONAL
DataLoader
objects to wrap
path
str | pathlib.Path
.
Path to store export objects
device
NoneType
None
Device to put DataLoaders
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
cat_names
: the names of the categorical variables
cont_names
: the names of the continuous variables
y_names
: the names of the dependent variables
y_block
: the TransformBlock
to use for the target
valid_idx
: the indices to use for the validation set (defaults to a random split otherwise)
bs
: the batch size
val_bs
: the batch size for the validation DataLoader
(defaults to bs
)
shuffle_train
: if we shuffle the training DataLoader
or not
n
: overrides the numbers of elements in the dataset
device
: the PyTorch device to use (defaults to default_device()
)
source
TabularDataLoaders.from_df
TabularDataLoaders.from_df (df:pd.DataFrame, path:str|Path='.',
procs:list=None, cat_names:list=None,
cont_names:list=None, y_names:list=None,
y_block:TransformBlock=None,
valid_idx:list=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None,
drop_last:bool=None, val_bs:int=None)
Create TabularDataLoaders
from df
in path
using procs
df
pd.DataFrame
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
valid_idx
list
None
List of indices to use for the validation set, defaults to a random split
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle training DataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
val_bs
int
None
Validation batch size, defaults to bs
Let’s have a look on an example with the adult dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/ 'adult.csv' , skipinitialspace= True )
df.head()
0
49
Private
101320
Assoc-acdm
12.0
Married-civ-spouse
NaN
Wife
White
Female
0
1902
40
United-States
>=50k
1
44
Private
236746
Masters
14.0
Divorced
Exec-managerial
Not-in-family
White
Male
10520
0
45
United-States
>=50k
2
38
Private
96185
HS-grad
NaN
Divorced
NaN
Unmarried
Black
Female
0
0
32
United-States
<50k
3
38
Self-emp-inc
112847
Prof-school
15.0
Married-civ-spouse
Prof-specialty
Husband
Asian-Pac-Islander
Male
0
0
40
United-States
>=50k
4
42
Self-emp-not-inc
82297
7th-8th
NaN
Married-civ-spouse
Other-service
Wife
Black
Female
0
0
50
United-States
<50k
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , valid_idx= list (range (800 ,1000 )), bs= 64 )
0
Private
HS-grad
Married-civ-spouse
Adm-clerical
Husband
White
False
24.0
121312.998272
9.0
<50k
1
Private
HS-grad
Never-married
Other-service
Not-in-family
White
False
19.0
198320.000325
9.0
<50k
2
Private
Bachelors
Married-civ-spouse
Sales
Husband
White
False
66.0
169803.999308
13.0
>=50k
3
Private
HS-grad
Divorced
Adm-clerical
Unmarried
White
False
40.0
799280.980929
9.0
<50k
4
Local-gov
10th
Never-married
Other-service
Own-child
White
False
18.0
55658.003629
6.0
<50k
5
Private
HS-grad
Never-married
Handlers-cleaners
Other-relative
White
False
30.0
375827.003847
9.0
<50k
6
Private
Some-college
Never-married
Handlers-cleaners
Own-child
White
False
20.0
173723.999335
10.0
<50k
7
?
Some-college
Never-married
?
Own-child
White
False
21.0
107800.997986
10.0
<50k
8
Private
HS-grad
Never-married
Handlers-cleaners
Own-child
White
False
19.0
263338.000072
9.0
<50k
9
Private
Some-college
Married-civ-spouse
Tech-support
Husband
White
False
35.0
194590.999986
10.0
<50k
source
TabularDataLoaders.from_csv
TabularDataLoaders.from_csv (csv:str|Path|io.BufferedReader,
skipinitialspace:bool=True,
path:str|Path='.', procs:list=None,
cat_names:list=None, cont_names:list=None,
y_names:list=None,
y_block:TransformBlock=None,
valid_idx:list=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None,
drop_last:bool=None, val_bs:int=None)
Create TabularDataLoaders
from csv
file in path
using procs
csv
str | Path | io.BufferedReader
A csv of training data
skipinitialspace
bool
True
Skip spaces after delimiter
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
valid_idx
list
None
List of indices to use for the validation set, defaults to a random split
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle training DataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
val_bs
int
None
Validation batch size, defaults to bs
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/ 'adult.csv' , path= path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , valid_idx= list (range (800 ,1000 )), bs= 64 )
source
TabularDataLoaders.test_dl
TabularDataLoaders.test_dl (test_items, rm_type_tfms=None,
process:bool=True, inplace:bool=False, bs=16,
shuffle=False, after_batch=None,
num_workers=0, verbose:bool=False,
do_setup:bool=True, pin_memory=False,
timeout=0, batch_size=None, drop_last=False,
indexed=None, n=None, device=None,
persistent_workers=False,
pin_memory_device='', wif=None,
before_iter=None, after_item=None,
before_batch=None, after_iter=None,
create_batches=None, create_item=None,
create_batch=None, retain=None,
get_idxs=None, sample=None, shuffle_fn=None,
do_batch=None)
Create test TabDataLoader
from test_items
using validation procs
test_items
Items to create new test TabDataLoader
formatted the same as the training data
rm_type_tfms
NoneType
None
Number of Transform
s to be removed from procs
process
bool
True
Apply validation TabularProc
s to test_items
immediately
inplace
bool
False
Keep separate copy of original test_items
in memory if False
bs
int
64
Size of batch
shuffle
bool
False
Whether to shuffle data
after_batch
NoneType
None
num_workers
int
None
Number of CPU cores to use in parallel (default: All available up to 16)
verbose
bool
False
Whether to print verbose logs
do_setup
bool
True
Whether to run setup()
for batch transform(s)
pin_memory
bool
False
timeout
int
0
batch_size
NoneType
None
drop_last
bool
False
indexed
NoneType
None
n
NoneType
None
device
NoneType
None
persistent_workers
bool
False
pin_memory_device
str
wif
NoneType
None
before_iter
NoneType
None
after_item
NoneType
None
before_batch
NoneType
None
after_iter
NoneType
None
create_batches
NoneType
None
create_item
NoneType
None
create_batch
NoneType
None
retain
NoneType
None
get_idxs
NoneType
None
sample
NoneType
None
shuffle_fn
NoneType
None
do_batch
NoneType
None
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ..."
. Often trimming is needed. Pandas has a convenient parameter skipinitialspace
that is exposed by TabularDataLoaders.from_csv()
. Otherwise category labels use for inference later such as workclass
:Private
will be categorized wrongly to 0 or "#na#"
if training label was read as " Private"
. Let’s test this feature.
test_data = {
'age' : [49 ],
'workclass' : ['Private' ],
'fnlwgt' : [101320 ],
'education' : ['Assoc-acdm' ],
'education-num' : [12.0 ],
'marital-status' : ['Married-civ-spouse' ],
'occupation' : ['' ],
'relationship' : ['Wife' ],
'race' : ['White' ],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input )
test_ne(0 , tdl.dataset.iloc[0 ]['workclass' ])