= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
df 'date')
make_date(df, 'date'].dtype, np.dtype('datetime64[ns]')) test_eq(df[
Tabular core
DataLoaders
.
Initial preprocessing
make_date
make_date (df, date_field)
Make sure df[date_field]
is of the right date type.
add_datepart
add_datepart (df, field_name, prefix=None, drop=True, time=False)
Helper function that adds columns relevant to a date in the column field_name
of df
.
For example if we have a series of dates we can then generate features such as Year
, Month
, Day
, Dayofweek
, Is_month_start
, etc as shown below:
= pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df df.head()
Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 |
1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN |
2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 |
3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 |
add_elapsed_times
add_elapsed_times (df, field_names, date_field, base_field)
Add in df
for each event in field_names
the elapsed time according to date_field
grouped by base_field
= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
df 'event': [False, True, False, True], 'base': [1,1,2,2]})
= add_elapsed_times(df, ['event'], 'date', 'base')
df df.head()
date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
---|---|---|---|---|---|---|---|
0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 |
1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 |
2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 |
3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 |
cont_cat_split
cont_cat_split (df, max_card=20, dep_var=None)
Helper function that returns column names of cont and cat variables from given df
.
This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card
parameter (or a float
datatype) then it will be added to the cont_names
else cat_names
. An example is below:
# Example with simple numpy types
= pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
df 'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
= cont_cat_split(df) cont_names, cat_names
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
= pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
df 'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
})= add_datepart(df, 'd1_date', drop=False)
df 'cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
df[= cont_cat_split(df, max_card=0) cont_names, cat_names
/home/jhoward/miniconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']
df_shrink_dtypes
df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)
Return any possible smaller data types for DataFrame columns. Allows object
->category
, int
->uint
, and exclusion.
For example we will make a sample DataFrame
with int
, float
, bool
, and object
datatypes:
= pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
df 'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes
i int64
f float64
e bool
date object
dtype: object
We can then call df_shrink_dtypes
to find the smallest possible datatype that can support the data:
= df_shrink_dtypes(df)
dt dt
{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}
df_shrink
df_shrink (df, skip=[], obj2cat=True, int2uint=False)
Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes()
.
df_shrink(df)
attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:
boolean
,category
,datetime64[ns]
dtype columns are ignored.- ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by
obj2cat=False
. int2uint=True
, to fitint
types touint
types, if all data in the column is >= 0.- columns can be excluded by name using
excl_cols=['col1','col2']
.
To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes()
with all the same parameters for df_shrink()
.
= pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
df 'date':['2019-12-04','2019-11-29','2019-11-15']})
= df_shrink(df, skip=['date']) df2
Let’s compare the two:
df.dtypes
i int64
f float64
u int64
date object
dtype: object
df2.dtypes
i int8
f float32
u int16
date object
dtype: object
We can see that the datatypes changed, and even further we can look at their relative memory usages:
Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes
Here’s another example using the ADULT_SAMPLE
dataset:
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df_shrink(df, int2uint=True) new_df
Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes
We reduced the overall memory used by 79%!
Tabular
Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
A DataFrame
wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__
df
: ADataFrame
of your datacat_names
: Your categoricalx
variablescont_names
: Your continuousx
variablesy_names
: Your dependenty
variables- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block
: How to sub-categorize the type ofy_names
(CategoryBlock
orRegressionBlock
)splits
: How to split your datado_setup
: A parameter for ifTabular
will run the data through theprocs
upon initializationdevice
:cuda
orcpu
inplace
: IfTrue
,Tabular
will not keep a separate copy of your originalDataFrame
in memory. You should ensurepd.options.mode.chained_assignment
isNone
before setting thisreduce_memory
:fastai
will attempt to reduce the overall memory usage by the inputtedDataFrame
withdf_shrink
TabularPandas
TabularPandas (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
A Tabular
object with transforms
TabularProc
TabularProc (enc=None, dec=None, split_idx=None, order=None)
Base class to write a non-lazy tabular processor for dataframes
These transforms are applied as soon as the data is available rather than as data is called from the DataLoader
Categorify
Categorify (enc=None, dec=None, split_idx=None, order=None)
Transform the categorical variables to something similar to pd.Categorical
While visually in the DataFrame
you will not see a change, the classes are stored in to.procs.categorify
as we can see below on a dummy DataFrame
:
= pd.DataFrame({'a':[0,1,2,0,2]})
df = TabularPandas(df, Categorify, 'a')
to to.show()
a | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 0 |
4 | 2 |
Each column’s unique values are stored in a dictionary of column:[values]
:
= to.procs.categorify
cat cat.classes
{'a': ['#na#', 0, 1, 2]}
FillStrategy
FillStrategy ()
Namespace containing the various filling strategies.
Currently, filling with the median
, a constant
, and the mode
are supported.
FillMissing
FillMissing (fill_strategy=<function median>, add_col=True, fill_vals=None)
Fill the missing values in continuous columns.
ReadTabBatch
ReadTabBatch (to)
Transform TabularPandas
values into a Tensor
with the ability to decode
TabDataLoader
TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
A transformed DataLoader
for Tabular data
TabWeightedDL
TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
A transformed DataLoader
for Tabular Weighted data
Integration example
For a more in-depth explanation, see the tabular tutorial
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test 'salary', axis=1, inplace=True)
df_test.drop( df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
= TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits) to
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Self-emp-not-inc | Prof-school | Divorced | Prof-specialty | Not-in-family | White | False | 65.000000 | 316093.005287 | 15.0 | <50k |
1 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 69.999999 | 280306.998091 | 13.0 | <50k |
2 | Federal-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 34.000000 | 199933.999862 | 10.0 | >=50k |
3 | Private | HS-grad | Never-married | Handlers-cleaners | Unmarried | White | False | 24.000001 | 300584.002430 | 9.0 | <50k |
4 | Private | Assoc-voc | Never-married | Other-service | Not-in-family | White | False | 34.000000 | 220630.999335 | 11.0 | <50k |
5 | Private | Bachelors | Divorced | Prof-specialty | Unmarried | White | False | 45.000000 | 289230.003178 | 13.0 | >=50k |
6 | ? | Some-college | Never-married | ? | Own-child | White | False | 26.000000 | 208993.999494 | 10.0 | <50k |
7 | Private | Some-college | Divorced | Adm-clerical | Not-in-family | White | False | 43.000000 | 174574.999446 | 10.0 | <50k |
8 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Other-service | Husband | White | False | 63.000000 | 420628.997361 | 11.0 | <50k |
9 | State-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 25.000000 | 257064.003065 | 10.0 | <50k |
to.show()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
5516 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 49.0 | 140121.0 | 9.0 | <50k |
7184 | Self-emp-inc | Some-college | Never-married | Exec-managerial | Not-in-family | White | False | 70.0 | 207938.0 | 10.0 | <50k |
2336 | Private | Some-college | Never-married | Priv-house-serv | Own-child | White | False | 23.0 | 50953.0 | 10.0 | <50k |
4342 | Private | Assoc-voc | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 46.0 | 27802.0 | 11.0 | <50k |
8474 | Self-emp-not-inc | Assoc-acdm | Married-civ-spouse | Craft-repair | Husband | White | False | 47.0 | 107231.0 | 12.0 | <50k |
5948 | Local-gov | HS-grad | Married-civ-spouse | Transport-moving | Husband | White | False | 40.0 | 55363.0 | 9.0 | <50k |
5342 | Local-gov | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 36228.0 | 9.0 | <50k |
9005 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | White | False | 38.0 | 297449.0 | 13.0 | >=50k |
1189 | Private | Assoc-voc | Divorced | Sales | Not-in-family | Amer-Indian-Eskimo | False | 31.0 | 87950.0 | 11.0 | <50k |
8784 | Private | Assoc-voc | Divorced | Prof-specialty | Own-child | Black | False | 35.0 | 491000.0 | 11.0 | <50k |
We can decode any set of transformed data by calling to.decode_row
with our raw data:
= to.items.iloc[0]
row to.decode_row(row)
age 49.0
workclass Private
fnlwgt 140121.0
education HS-grad
education-num 9.0
marital-status Divorced
occupation Exec-managerial
relationship Unmarried
race White
sex Male
capital-gain 0
capital-loss 0
hours-per-week 50
native-country United-States
salary <50k
education-num_na False
Name: 5516, dtype: object
We can make new test datasets based on the training data with the to.new()
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training
= to.new(df_test)
to_tst
to_tst.process() to_tst.items.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10000 | 0.465031 | 5 | 1.319553 | 10 | 1.176677 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 |
10001 | -0.926675 | 5 | 1.233650 | 12 | -0.420035 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 |
10002 | 1.051012 | 5 | 0.145161 | 2 | -1.218391 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 |
10003 | 0.538279 | 5 | -0.282370 | 12 | -0.420035 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 |
10004 | 0.758022 | 6 | 1.420768 | 9 | 0.378321 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 |
We can then convert it to a DataLoader
:
= dls.valid.new(to_tst)
tst_dl tst_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338105.005817 | 13.0 |
1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.002806 | 9.0 |
2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 52.999999 | 209022.000317 | 7.0 |
3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162029.998917 | 9.0 |
4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349230.006300 | 11.0 |
5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124827.002059 | 10.0 |
6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 52.999999 | 290640.002462 | 10.0 |
7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106272.998239 | 10.0 |
8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 71.999999 | 53684.001668 | 10.0 |
9 | Private | Some-college | Never-married | Sales | Own-child | White | False | 20.000000 | 505980.010609 | 10.0 |
# Create a TabWeightedDL
= to.train
train_ds = np.random.random(len(train_ds))
weights = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)
train_dl
train_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Local-gov | Masters | Never-married | Prof-specialty | Not-in-family | White | False | 31.000000 | 204469.999932 | 14.0 | <50k |
1 | Self-emp-not-inc | HS-grad | Divorced | Farming-fishing | Not-in-family | White | False | 32.000000 | 34572.002104 | 9.0 | <50k |
2 | ? | Some-college | Widowed | ? | Not-in-family | White | False | 64.000000 | 34099.998990 | 10.0 | <50k |
3 | Private | Some-college | Divorced | Exec-managerial | Not-in-family | White | False | 32.000000 | 251242.999189 | 10.0 | >=50k |
4 | Federal-gov | HS-grad | Married-civ-spouse | Exec-managerial | Husband | White | False | 55.000001 | 176903.999313 | 9.0 | <50k |
5 | Private | 11th | Married-civ-spouse | Transport-moving | Husband | White | False | 50.000000 | 192203.000000 | 7.0 | <50k |
6 | Private | 10th | Never-married | Farming-fishing | Own-child | Black | False | 36.000000 | 181720.999704 | 6.0 | <50k |
7 | Local-gov | Masters | Divorced | Prof-specialty | Not-in-family | Amer-Indian-Eskimo | False | 50.000000 | 220640.001490 | 14.0 | >=50k |
8 | Private | HS-grad | Married-civ-spouse | Adm-clerical | Wife | White | False | 36.000000 | 189381.999993 | 9.0 | >=50k |
9 | Private | Masters | Divorced | Prof-specialty | Unmarried | White | False | 42.000000 | 265697.997341 | 14.0 | <50k |
TabDataLoader’s create_item method
= pd.DataFrame([{'age': 35}])
df = TabularPandas(df)
to = to.dataloaders()
dls print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})
age 35
Name: 0, dtype: int8
Other target types
Multi-label categories
one-hot encoded label
def _mock_multi_label(df):
= [],[],[]
sal,sex,white for row in df.itertuples():
== '>=50k')
sal.append(row.salary == ' Male')
sex.append(row.sex == ' White')
white.append(row.race 'salary'] = np.array(sal)
df['male'] = np.array(sex)
df['white'] = np.array(white)
df[return df
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main))
splits =["salary", "male", "white"] y_names
CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 47.000000 | 164423.000013 | 9.0 | False | False | True |
1 | Private | Some-college | Married-civ-spouse | Transport-moving | Husband | White | False | 74.999999 | 239037.999499 | 10.0 | False | True | True |
2 | Private | HS-grad | Married-civ-spouse | Sales | Wife | White | False | 45.000000 | 228570.000761 | 9.0 | False | False | True |
3 | Self-emp-not-inc | HS-grad | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | False | 45.000000 | 285574.998753 | 9.0 | False | True | False |
4 | Private | Some-college | Never-married | Adm-clerical | Own-child | White | False | 21.999999 | 184812.999966 | 10.0 | False | True | True |
5 | Private | 10th | Married-civ-spouse | Transport-moving | Husband | White | False | 67.000001 | 274450.998865 | 6.0 | False | True | True |
6 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 53.999999 | 192862.000000 | 9.0 | False | False | True |
7 | Federal-gov | Some-college | Divorced | Tech-support | Unmarried | Amer-Indian-Eskimo | False | 37.000000 | 33486.997455 | 10.0 | False | False | False |
8 | Private | HS-grad | Never-married | Machine-op-inspct | Other-relative | White | False | 30.000000 | 219318.000010 | 9.0 | False | False | True |
9 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Sales | Husband | White | False | 44.000000 | 167279.999960 | 13.0 | False | True | True |
Not one-hot encoded
def _mock_multi_label(df):
= []
targ for row in df.itertuples():
= []
labels if row.salary == '>=50k': labels.append('>50k')
if row.sex == ' Male': labels.append('male')
if row.race == ' White': labels.append('white')
' '.join(labels))
targ.append('target'] = np.array(targ)
df[return df
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | >50k white |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | >50k male white |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | >50k male |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
@MultiCategorize
def encodes(self, to:Tabular):
#to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
return to
@MultiCategorize
def decodes(self, to:Tabular):
#to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
return to
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms
2].vocab to.procs[
['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
Regression
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms
-1].means to.procs[
{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | 12th | Never-married | Adm-clerical | Other-relative | Black | False | 503454.004078 | 8.0 | 47.0 |
1 | Federal-gov | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 586656.993690 | 13.0 | 49.0 |
2 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Farming-fishing | Husband | White | False | 164607.001243 | 11.0 | 29.0 |
3 | Private | HS-grad | Never-married | Adm-clerical | Not-in-family | Black | False | 155508.999873 | 9.0 | 48.0 |
4 | Private | 11th | Never-married | Other-service | Own-child | White | False | 318189.998679 | 7.0 | 18.0 |
5 | Private | HS-grad | Never-married | Adm-clerical | Other-relative | White | False | 140219.001104 | 9.0 | 47.0 |
6 | Private | Masters | Divorced | #na# | Unmarried | White | True | 235683.001562 | 10.0 | 47.0 |
7 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | 187321.999825 | 13.0 | 43.0 |
8 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Husband | White | False | 104196.002410 | 13.0 | 40.0 |
9 | Private | Some-college | Separated | Priv-house-serv | Other-relative | White | False | 184302.999784 | 10.0 | 25.0 |
Not being used now - for multi-modal
class TensorTabular(fastuple):
def get_ctxs(self, max_n=10, **kwargs):
= min(self[0].shape[0], max_n)
n_samples = pd.DataFrame(index = range(n_samples))
df return [df.iloc[i] for i in range(n_samples)]
def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
"A line of a dataframe that knows how to show itself"
def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row):
= (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
cats,conts return TensorTabular(tensor(cats).long(),tensor(conts).float())
def decodes(self, o):
= TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
to = self.proc.decode(to)
to return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a 1
# b_na False
# b 1
# category a
# dtype: object""")