Tabular core

Basic function to preprocess tabular data before assembling it in a DataLoaders.

Initial preprocessing

make_date

 make_date (df, date_field)

Make sure df[date_field] is of the right date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

source

add_datepart

 add_datepart (df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()

	Year	Month	Week	Day	Dayofweek	Dayofyear	Is_month_end	Is_month_start	Is_quarter_end	Is_quarter_start	Is_year_end	Is_year_start	Elapsed
0	2019.0	12.0	49.0	4.0	2.0	338.0	False	False	False	False	False	False	1.575418e+09
1	NaN	NaN	NaN	NaN	NaN	NaN	False	False	False	False	False	False	NaN
2	2019.0	11.0	46.0	15.0	4.0	319.0	False	False	False	False	False	False	1.573776e+09
3	2019.0	10.0	43.0	24.0	3.0	297.0	False	False	False	False	False	False	1.571875e+09

source

add_elapsed_times

 add_elapsed_times (df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()

	date	event	base	Afterevent	event_bw	event_fw
0	2019-12-04	False	1	5	1.0	0.0
1	2019-11-29	True	1	0	1.0	1.0
2	2019-11-15	False	2	22	1.0	0.0
3	2019-10-24	True	2	0	1.0	1.0

source

cont_cat_split

 cont_cat_split (df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)

cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`

# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)

cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

source

df_shrink_dtypes

 df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

For example we will make a sample DataFrame with int, float, bool, and object datatypes:

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes

i         int64
f       float64
e          bool
date     object
dtype: object

We can then call df_shrink_dtypes to find the smallest possible datatype that can support the data:

dt = df_shrink_dtypes(df)
dt

{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

source

df_shrink

 df_shrink (df, skip=[], obj2cat=True, int2uint=False)

Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes().

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

boolean, category, datetime64[ns] dtype columns are ignored.
‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])

Let’s compare the two:

df.dtypes

i         int64
f       float64
u         int64
date     object
dtype: object

df2.dtypes

i          int8
f       float32
u         int16
date     object
dtype: object

We can see that the datatypes changed, and even further we can look at their relative memory usages:

Initial Dataframe: 228 bytes
Reduced Dataframe: 177 bytes

Here’s another example using the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)

Initial Dataframe: 3.907452 megabytes
Reduced Dataframe: 0.818333 megabytes

We reduced the overall memory used by 79%!

source

Tabular

 Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None,
          y_block=None, splits=None, do_setup=True, device=None,
          inplace=False, reduce_memory=True)

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

df: A DataFrame of your data
cat_names: Your categorical x variables
cont_names: Your continuous x variables
y_names: Your dependent y variables
- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
splits: How to split your data
do_setup: A parameter for if Tabular will run the data through the procs upon initialization
device: cuda or cpu
inplace: If True, Tabular will not keep a separate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
reduce_memory: fastai will attempt to reduce the overall memory usage by the inputted DataFrame with df_shrink

source

TabularPandas

 TabularPandas (df, procs=None, cat_names=None, cont_names=None,
                y_names=None, y_block=None, splits=None, do_setup=True,
                device=None, inplace=False, reduce_memory=True)

A Tabular object with transforms

source

TabularProc

 TabularProc (enc=None, dec=None, split_idx=None, order=None)

Base class to write a non-lazy tabular processor for dataframes

These transforms are applied as soon as the data is available rather than as data is called from the DataLoader

source

Categorify

 Categorify (enc=None, dec=None, split_idx=None, order=None)

Transform the categorical variables to something similar to pd.Categorical

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()

	a
0	0
1	1
2	2
3	0
4	2

Each column’s unique values are stored in a dictionary of column:[values]:

cat = to.procs.categorify
cat.classes

{'a': ['#na#', 0, 1, 2]}

source

FillStrategy

 FillStrategy ()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

source

FillMissing

 FillMissing (fill_strategy=<function median>, add_col=True,
              fill_vals=None)

Fill the missing values in continuous columns.

source

ReadTabBatch

 ReadTabBatch (to)

Transform TabularPandas values into a Tensor with the ability to decode

source

TabDataLoader

 TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None,
                num_workers=0, verbose:bool=False, do_setup:bool=True,
                pin_memory=False, timeout=0, batch_size=None,
                drop_last=False, indexed=None, n=None, device=None,
                persistent_workers=False, pin_memory_device='', wif=None,
                before_iter=None, after_item=None, before_batch=None,
                after_iter=None, create_batches=None, create_item=None,
                create_batch=None, retain=None, get_idxs=None,
                sample=None, shuffle_fn=None, do_batch=None)

A transformed DataLoader for Tabular data

source

TabWeightedDL

 TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False,
                after_batch=None, num_workers=0, verbose:bool=False,
                do_setup:bool=True, pin_memory=False, timeout=0,
                batch_size=None, drop_last=False, indexed=None, n=None,
                device=None, persistent_workers=False,
                pin_memory_device='', wif=None, before_iter=None,
                after_item=None, before_batch=None, after_iter=None,
                create_batches=None, create_item=None, create_batch=None,
                retain=None, get_idxs=None, sample=None, shuffle_fn=None,
                do_batch=None)

A transformed DataLoader for Tabular Weighted data

Integration example

For a more in-depth explanation, see the tabular tutorial

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Self-emp-not-inc	Prof-school	Divorced	Prof-specialty	Not-in-family	White	False	65.000000	316093.005287	15.0	<50k
1	Private	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	69.999999	280306.998091	13.0	<50k
2	Federal-gov	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	34.000000	199933.999862	10.0	>=50k
3	Private	HS-grad	Never-married	Handlers-cleaners	Unmarried	White	False	24.000001	300584.002430	9.0	<50k
4	Private	Assoc-voc	Never-married	Other-service	Not-in-family	White	False	34.000000	220630.999335	11.0	<50k
5	Private	Bachelors	Divorced	Prof-specialty	Unmarried	White	False	45.000000	289230.003178	13.0	>=50k
6	?	Some-college	Never-married	?	Own-child	White	False	26.000000	208993.999494	10.0	<50k
7	Private	Some-college	Divorced	Adm-clerical	Not-in-family	White	False	43.000000	174574.999446	10.0	<50k
8	Self-emp-not-inc	Assoc-voc	Married-civ-spouse	Other-service	Husband	White	False	63.000000	420628.997361	11.0	<50k
9	State-gov	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	25.000000	257064.003065	10.0	<50k

to.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
5516	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	49.0	140121.0	9.0	<50k
7184	Self-emp-inc	Some-college	Never-married	Exec-managerial	Not-in-family	White	False	70.0	207938.0	10.0	<50k
2336	Private	Some-college	Never-married	Priv-house-serv	Own-child	White	False	23.0	50953.0	10.0	<50k
4342	Private	Assoc-voc	Married-civ-spouse	Machine-op-inspct	Husband	White	False	46.0	27802.0	11.0	<50k
8474	Self-emp-not-inc	Assoc-acdm	Married-civ-spouse	Craft-repair	Husband	White	False	47.0	107231.0	12.0	<50k
5948	Local-gov	HS-grad	Married-civ-spouse	Transport-moving	Husband	White	False	40.0	55363.0	9.0	<50k
5342	Local-gov	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	46.0	36228.0	9.0	<50k
9005	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	White	False	38.0	297449.0	13.0	>=50k
1189	Private	Assoc-voc	Divorced	Sales	Not-in-family	Amer-Indian-Eskimo	False	31.0	87950.0	11.0	<50k
8784	Private	Assoc-voc	Divorced	Prof-specialty	Own-child	Black	False	35.0	491000.0	11.0	<50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.iloc[0]
to.decode_row(row)

age                             49.0
workclass                    Private
fnlwgt                      140121.0
education                    HS-grad
education-num                    9.0
marital-status              Divorced
occupation           Exec-managerial
relationship               Unmarried
race                           White
sex                             Male
capital-gain                       0
capital-loss                       0
hours-per-week                    50
native-country         United-States
salary                          <50k
education-num_na               False
Name: 5516, dtype: object

We can make new test datasets based on the training data with the to.new()

Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	education-num_na
10000	0.465031	5	1.319553	10	1.176677	3	2	1	2	Male	40	Philippines	1
10001	-0.926675	5	1.233650	12	-0.420035	3	15	1	4	Male	40	United-States	1
10002	1.051012	5	0.145161	2	-1.218391	1	9	2	5	Female	37	United-States	1
10003	0.538279	5	-0.282370	12	-0.420035	7	2	5	5	Female	43	United-States	1
10004	0.758022	6	1.420768	9	0.378321	3	5	1	5	Male	60	United-States	1

We can then convert it to a DataLoader:

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	Asian-Pac-Islander	False	45.000000	338105.005817	13.0
1	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	Other	False	26.000000	328663.002806	9.0
2	Private	11th	Divorced	Other-service	Not-in-family	White	False	52.999999	209022.000317	7.0
3	Private	HS-grad	Widowed	Adm-clerical	Unmarried	White	False	46.000000	162029.998917	9.0
4	Self-emp-inc	Assoc-voc	Married-civ-spouse	Exec-managerial	Husband	White	False	49.000000	349230.006300	11.0
5	Local-gov	Some-college	Married-civ-spouse	Exec-managerial	Husband	White	False	34.000000	124827.002059	10.0
6	Self-emp-inc	Some-college	Married-civ-spouse	Sales	Husband	White	False	52.999999	290640.002462	10.0
7	Private	Some-college	Never-married	Sales	Own-child	White	False	19.000000	106272.998239	10.0
8	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	Black	False	71.999999	53684.001668	10.0
9	Private	Some-college	Never-married	Sales	Own-child	White	False	20.000000	505980.010609	10.0

# Create a TabWeightedDL
train_ds = to.train
weights = np.random.random(len(train_ds))
train_dl = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)

train_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Local-gov	Masters	Never-married	Prof-specialty	Not-in-family	White	False	31.000000	204469.999932	14.0	<50k
1	Self-emp-not-inc	HS-grad	Divorced	Farming-fishing	Not-in-family	White	False	32.000000	34572.002104	9.0	<50k
2	?	Some-college	Widowed	?	Not-in-family	White	False	64.000000	34099.998990	10.0	<50k
3	Private	Some-college	Divorced	Exec-managerial	Not-in-family	White	False	32.000000	251242.999189	10.0	>=50k
4	Federal-gov	HS-grad	Married-civ-spouse	Exec-managerial	Husband	White	False	55.000001	176903.999313	9.0	<50k
5	Private	11th	Married-civ-spouse	Transport-moving	Husband	White	False	50.000000	192203.000000	7.0	<50k
6	Private	10th	Never-married	Farming-fishing	Own-child	Black	False	36.000000	181720.999704	6.0	<50k
7	Local-gov	Masters	Divorced	Prof-specialty	Not-in-family	Amer-Indian-Eskimo	False	50.000000	220640.001490	14.0	>=50k
8	Private	HS-grad	Married-civ-spouse	Adm-clerical	Wife	White	False	36.000000	189381.999993	9.0	>=50k
9	Private	Masters	Divorced	Prof-specialty	Unmarried	White	False	42.000000	265697.997341	14.0	<50k

TabDataLoader’s create_item method

df = pd.DataFrame([{'age': 35}])
to = TabularPandas(df)
dls = to.dataloaders()
print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})

age    35
Name: 0, dtype: int8

Other target types

Multi-label categories

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	male	white
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	True	False	True
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	True	True	True
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	False	False	False
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	True	True	False
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	False	False	False

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]

CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	male	white
0	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	47.000000	164423.000013	9.0	False	False	True
1	Private	Some-college	Married-civ-spouse	Transport-moving	Husband	White	False	74.999999	239037.999499	10.0	False	True	True
2	Private	HS-grad	Married-civ-spouse	Sales	Wife	White	False	45.000000	228570.000761	9.0	False	False	True
3	Self-emp-not-inc	HS-grad	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	False	45.000000	285574.998753	9.0	False	True	False
4	Private	Some-college	Never-married	Adm-clerical	Own-child	White	False	21.999999	184812.999966	10.0	False	True	True
5	Private	10th	Married-civ-spouse	Transport-moving	Husband	White	False	67.000001	274450.998865	6.0	False	True	True
6	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	53.999999	192862.000000	9.0	False	False	True
7	Federal-gov	Some-college	Divorced	Tech-support	Unmarried	Amer-Indian-Eskimo	False	37.000000	33486.997455	10.0	False	False	False
8	Private	HS-grad	Never-married	Machine-op-inspct	Other-relative	White	False	30.000000	219318.000010	9.0	False	False	True
9	Self-emp-not-inc	Bachelors	Married-civ-spouse	Sales	Husband	White	False	44.000000	167279.999960	13.0	False	True	True

Not one-hot encoded

def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        if row.race == ' White': labels.append('white')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	target
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k	>50k white
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k	>50k male white
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k	>50k male
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

@MultiCategorize
def encodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to

@MultiCategorize
def decodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return to

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms

to.procs[2].vocab

['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

Regression

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms

to.procs[-1].means

{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	fnlwgt	education-num	age
0	Private	12th	Never-married	Adm-clerical	Other-relative	Black	False	503454.004078	8.0	47.0
1	Federal-gov	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	586656.993690	13.0	49.0
2	Self-emp-not-inc	Assoc-voc	Married-civ-spouse	Farming-fishing	Husband	White	False	164607.001243	11.0	29.0
3	Private	HS-grad	Never-married	Adm-clerical	Not-in-family	Black	False	155508.999873	9.0	48.0
4	Private	11th	Never-married	Other-service	Own-child	White	False	318189.998679	7.0	18.0
5	Private	HS-grad	Never-married	Adm-clerical	Other-relative	White	False	140219.001104	9.0	47.0
6	Private	Masters	Divorced	#na#	Unmarried	White	True	235683.001562	10.0	47.0
7	Private	Bachelors	Married-civ-spouse	Craft-repair	Husband	White	False	187321.999825	13.0	43.0
8	Private	Bachelors	Married-civ-spouse	Prof-specialty	Husband	White	False	104196.002410	13.0	40.0
9	Private	Some-college	Separated	Priv-house-serv	Other-relative	White	False	184302.999784	10.0	25.0

Not being used now - for multi-modal

class TensorTabular(fastuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def display(self, ctxs): display_df(pd.DataFrame(ctxs))

class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)

class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())

    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))

class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])

# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)

# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')

# test_stdout(lambda: print(show_at(tds, 1)), """a               1
# b_na        False
# b               1
# category        a
# dtype: object""")