The main class to get your data ready for model training is TabularDataLoaders
and its factory methods. Checkout the tabular tutorial for examples of use.
source
TabularDataLoaders
TabularDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)
Basic wrapper around several DataLoader
s with factory methods for tabular data
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
cat_names
: the names of the categorical variables
cont_names
: the names of the continuous variables
y_names
: the names of the dependent variables
y_block
: the TransformBlock
to use for the target
valid_idx
: the indices to use for the validation set (defaults to a random split otherwise)
bs
: the batch size
val_bs
: the batch size for the validation DataLoader
(defaults to bs
)
shuffle_train
: if we shuffle the training DataLoader
or not
n
: overrides the numbers of elements in the dataset
device
: the PyTorch device to use (defaults to default_device()
)
source
TabularDataLoaders.from_df
TabularDataLoaders.from_df (df:pd.DataFrame, path:str|Path='.',
procs:list=None, cat_names:list=None,
cont_names:list=None, y_names:list=None,
y_block:TransformBlock=None,
valid_idx:list=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None,
drop_last:bool=None, val_bs:int=None)
Create TabularDataLoaders
from df
in path
using procs
df
pd.DataFrame
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
valid_idx
list
None
List of indices to use for the validation set, defaults to a random split
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle training DataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
val_bs
int
None
Validation batch size, defaults to bs
Let’s have a look on an example with the adult dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/ 'adult.csv' , skipinitialspace= True )
df.head()
0
49
Private
101320
Assoc-acdm
12.0
Married-civ-spouse
NaN
Wife
White
Female
0
1902
40
United-States
>=50k
1
44
Private
236746
Masters
14.0
Divorced
Exec-managerial
Not-in-family
White
Male
10520
0
45
United-States
>=50k
2
38
Private
96185
HS-grad
NaN
Divorced
NaN
Unmarried
Black
Female
0
0
32
United-States
<50k
3
38
Self-emp-inc
112847
Prof-school
15.0
Married-civ-spouse
Prof-specialty
Husband
Asian-Pac-Islander
Male
0
0
40
United-States
>=50k
4
42
Self-emp-not-inc
82297
7th-8th
NaN
Married-civ-spouse
Other-service
Wife
Black
Female
0
0
50
United-States
<50k
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , valid_idx= list (range (800 ,1000 )), bs= 64 )
0
Private
HS-grad
Married-civ-spouse
Adm-clerical
Husband
White
False
24.0
121312.998272
9.0
<50k
1
Private
HS-grad
Never-married
Other-service
Not-in-family
White
False
19.0
198320.000325
9.0
<50k
2
Private
Bachelors
Married-civ-spouse
Sales
Husband
White
False
66.0
169803.999308
13.0
>=50k
3
Private
HS-grad
Divorced
Adm-clerical
Unmarried
White
False
40.0
799280.980929
9.0
<50k
4
Local-gov
10th
Never-married
Other-service
Own-child
White
False
18.0
55658.003629
6.0
<50k
5
Private
HS-grad
Never-married
Handlers-cleaners
Other-relative
White
False
30.0
375827.003847
9.0
<50k
6
Private
Some-college
Never-married
Handlers-cleaners
Own-child
White
False
20.0
173723.999335
10.0
<50k
7
?
Some-college
Never-married
?
Own-child
White
False
21.0
107800.997986
10.0
<50k
8
Private
HS-grad
Never-married
Handlers-cleaners
Own-child
White
False
19.0
263338.000072
9.0
<50k
9
Private
Some-college
Married-civ-spouse
Tech-support
Husband
White
False
35.0
194590.999986
10.0
<50k
source
TabularDataLoaders.from_csv
TabularDataLoaders.from_csv (csv:str|Path|io.BufferedReader,
skipinitialspace:bool=True,
path:str|Path='.', procs:list=None,
cat_names:list=None, cont_names:list=None,
y_names:list=None,
y_block:TransformBlock=None,
valid_idx:list=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None,
drop_last:bool=None, val_bs:int=None)
Create TabularDataLoaders
from csv
file in path
using procs
csv
str | Path | io.BufferedReader
A csv of training data
skipinitialspace
bool
True
Skip spaces after delimiter
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
valid_idx
list
None
List of indices to use for the validation set, defaults to a random split
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle training DataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
val_bs
int
None
Validation batch size, defaults to bs
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/ 'adult.csv' , path= path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , valid_idx= list (range (800 ,1000 )), bs= 64 )
source
TabularDataLoaders.test_dl
TabularDataLoaders.test_dl (test_items, rm_type_tfms=None,
process:bool=True, inplace:bool=False, bs=16,
shuffle=False, after_batch=None,
num_workers=0, verbose:bool=False,
do_setup:bool=True, pin_memory=False,
timeout=0, batch_size=None, drop_last=False,
indexed=None, n=None, device=None,
persistent_workers=False,
pin_memory_device='', wif=None,
before_iter=None, after_item=None,
before_batch=None, after_iter=None,
create_batches=None, create_item=None,
create_batch=None, retain=None,
get_idxs=None, sample=None, shuffle_fn=None,
do_batch=None)
Create test TabDataLoader
from test_items
using validation procs
test_items
Items to create new test TabDataLoader
formatted the same as the training data
rm_type_tfms
NoneType
None
Number of Transform
s to be removed from procs
process
bool
True
Apply validation TabularProc
s to test_items
immediately
inplace
bool
False
Keep separate copy of original test_items
in memory if False
bs
int
64
Size of batch
shuffle
bool
False
Whether to shuffle data
after_batch
NoneType
None
num_workers
int
None
Number of CPU cores to use in parallel (default: All available up to 16)
verbose
bool
False
Whether to print verbose logs
do_setup
bool
True
Whether to run setup()
for batch transform(s)
pin_memory
bool
False
timeout
int
0
batch_size
NoneType
None
drop_last
bool
False
indexed
NoneType
None
n
NoneType
None
device
NoneType
None
persistent_workers
bool
False
pin_memory_device
str
wif
NoneType
None
before_iter
NoneType
None
after_item
NoneType
None
before_batch
NoneType
None
after_iter
NoneType
None
create_batches
NoneType
None
create_item
NoneType
None
create_batch
NoneType
None
retain
NoneType
None
get_idxs
NoneType
None
sample
NoneType
None
shuffle_fn
NoneType
None
do_batch
NoneType
None
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ..."
. Often trimming is needed. Pandas has a convenient parameter skipinitialspace
that is exposed by TabularDataLoaders.from_csv()
. Otherwise category labels use for inference later such as workclass
:Private
will be categorized wrongly to 0 or "#na#"
if training label was read as " Private"
. Let’s test this feature.
test_data = {
'age' : [49 ],
'workclass' : ['Private' ],
'fnlwgt' : [101320 ],
'education' : ['Assoc-acdm' ],
'education-num' : [12.0 ],
'marital-status' : ['Married-civ-spouse' ],
'occupation' : ['' ],
'relationship' : ['Wife' ],
'race' : ['White' ],
}
input = pd.DataFrame(test_data)
tdl = dls.test_dl(input )
test_ne(0 , tdl.dataset.iloc[0 ]['workclass' ])