Tabular core

Basic function to preprocess tabular data before assembling it in a DataLoaders.

Initial preprocessing


source

make_date

 make_date (df, date_field)

Make sure df[date_field] is of the right date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

source

add_datepart

 add_datepart (df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()
Year Month Week Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed
0 2019.0 12.0 49.0 4.0 2.0 338.0 False False False False False False 1.575418e+09
1 NaN NaN NaN NaN NaN NaN False False False False False False NaN
2 2019.0 11.0 46.0 15.0 4.0 319.0 False False False False False False 1.573776e+09
3 2019.0 10.0 43.0 24.0 3.0 297.0 False False False False False False 1.571875e+09

source

add_elapsed_times

 add_elapsed_times (df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()
date event base Afterevent Beforeevent event_bw event_fw
0 2019-12-04 False 1 5 0 1.0 0.0
1 2019-11-29 True 1 0 0 1.0 1.0
2 2019-11-15 False 2 22 0 1.0 0.0
3 2019-10-24 True 2 0 0 1.0 1.0

source

cont_cat_split

 cont_cat_split (df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)
/home/jhoward/miniconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  res = method(*args, **kwargs)
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

source

df_shrink_dtypes

 df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

For example we will make a sample DataFrame with int, float, bool, and object datatypes:

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes
i         int64
f       float64
e          bool
date     object
dtype: object

We can then call df_shrink_dtypes to find the smallest possible datatype that can support the data:

dt = df_shrink_dtypes(df)
dt
{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

source

df_shrink

 df_shrink (df, skip=[], obj2cat=True, int2uint=False)

Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes().

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

  • boolean, category, datetime64[ns] dtype columns are ignored.
  • ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
  • int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
  • columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])

Let’s compare the two:

df.dtypes
i         int64
f       float64
u         int64
date     object
dtype: object
df2.dtypes
i          int8
f       float32
u         int16
date     object
dtype: object

We can see that the datatypes changed, and even further we can look at their relative memory usages:

Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes

Here’s another example using the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)
Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes

We reduced the overall memory used by 79%!


source

Tabular

 Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None,
          y_block=None, splits=None, do_setup=True, device=None,
          inplace=False, reduce_memory=True)

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

  • df: A DataFrame of your data
  • cat_names: Your categorical x variables
  • cont_names: Your continuous x variables
  • y_names: Your dependent y variables
    • Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
  • y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
  • splits: How to split your data
  • do_setup: A parameter for if Tabular will run the data through the procs upon initialization
  • device: cuda or cpu
  • inplace: If True, Tabular will not keep a separate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
  • reduce_memory: fastai will attempt to reduce the overall memory usage by the inputted DataFrame with df_shrink

source

TabularPandas

 TabularPandas (df, procs=None, cat_names=None, cont_names=None,
                y_names=None, y_block=None, splits=None, do_setup=True,
                device=None, inplace=False, reduce_memory=True)

A Tabular object with transforms


source

TabularProc

 TabularProc (enc=None, dec=None, split_idx=None, order=None)

Base class to write a non-lazy tabular processor for dataframes

These transforms are applied as soon as the data is available rather than as data is called from the DataLoader


source

Categorify

 Categorify (enc=None, dec=None, split_idx=None, order=None)

Transform the categorical variables to something similar to pd.Categorical

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()
a
0 0
1 1
2 2
3 0
4 2

Each column’s unique values are stored in a dictionary of column:[values]:

cat = to.procs.categorify
cat.classes
{'a': ['#na#', 0, 1, 2]}

source

FillStrategy

 FillStrategy ()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.


source

FillMissing

 FillMissing (fill_strategy=<function median>, add_col=True,
              fill_vals=None)

Fill the missing values in continuous columns.


source

ReadTabBatch

 ReadTabBatch (to)

Transform TabularPandas values into a Tensor with the ability to decode


source

TabDataLoader

 TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None,
                num_workers=0, verbose:bool=False, do_setup:bool=True,
                pin_memory=False, timeout=0, batch_size=None,
                drop_last=False, indexed=None, n=None, device=None,
                persistent_workers=False, pin_memory_device='', wif=None,
                before_iter=None, after_item=None, before_batch=None,
                after_iter=None, create_batches=None, create_item=None,
                create_batch=None, retain=None, get_idxs=None,
                sample=None, shuffle_fn=None, do_batch=None)

A transformed DataLoader for Tabular data


source

TabWeightedDL

 TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False,
                after_batch=None, num_workers=0, verbose:bool=False,
                do_setup:bool=True, pin_memory=False, timeout=0,
                batch_size=None, drop_last=False, indexed=None, n=None,
                device=None, persistent_workers=False,
                pin_memory_device='', wif=None, before_iter=None,
                after_item=None, before_batch=None, after_iter=None,
                create_batches=None, create_item=None, create_batch=None,
                retain=None, get_idxs=None, sample=None, shuffle_fn=None,
                do_batch=None)

A transformed DataLoader for Tabular Weighted data

Integration example

For a more in-depth explanation, see the tabular tutorial

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Self-emp-not-inc Prof-school Divorced Prof-specialty Not-in-family White False 65.000000 316093.005287 15.0 <50k
1 Private Bachelors Married-civ-spouse Exec-managerial Husband White False 69.999999 280306.998091 13.0 <50k
2 Federal-gov Some-college Married-civ-spouse Adm-clerical Husband Black False 34.000000 199933.999862 10.0 >=50k
3 Private HS-grad Never-married Handlers-cleaners Unmarried White False 24.000001 300584.002430 9.0 <50k
4 Private Assoc-voc Never-married Other-service Not-in-family White False 34.000000 220630.999335 11.0 <50k
5 Private Bachelors Divorced Prof-specialty Unmarried White False 45.000000 289230.003178 13.0 >=50k
6 ? Some-college Never-married ? Own-child White False 26.000000 208993.999494 10.0 <50k
7 Private Some-college Divorced Adm-clerical Not-in-family White False 43.000000 174574.999446 10.0 <50k
8 Self-emp-not-inc Assoc-voc Married-civ-spouse Other-service Husband White False 63.000000 420628.997361 11.0 <50k
9 State-gov Some-college Married-civ-spouse Adm-clerical Husband Black False 25.000000 257064.003065 10.0 <50k
to.show()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
5516 Private HS-grad Divorced Exec-managerial Unmarried White False 49.0 140121.0 9.0 <50k
7184 Self-emp-inc Some-college Never-married Exec-managerial Not-in-family White False 70.0 207938.0 10.0 <50k
2336 Private Some-college Never-married Priv-house-serv Own-child White False 23.0 50953.0 10.0 <50k
4342 Private Assoc-voc Married-civ-spouse Machine-op-inspct Husband White False 46.0 27802.0 11.0 <50k
8474 Self-emp-not-inc Assoc-acdm Married-civ-spouse Craft-repair Husband White False 47.0 107231.0 12.0 <50k
5948 Local-gov HS-grad Married-civ-spouse Transport-moving Husband White False 40.0 55363.0 9.0 <50k
5342 Local-gov HS-grad Married-civ-spouse Craft-repair Husband White False 46.0 36228.0 9.0 <50k
9005 Private Bachelors Married-civ-spouse Adm-clerical Husband White False 38.0 297449.0 13.0 >=50k
1189 Private Assoc-voc Divorced Sales Not-in-family Amer-Indian-Eskimo False 31.0 87950.0 11.0 <50k
8784 Private Assoc-voc Divorced Prof-specialty Own-child Black False 35.0 491000.0 11.0 <50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.iloc[0]
to.decode_row(row)
age                             49.0
workclass                    Private
fnlwgt                      140121.0
education                    HS-grad
education-num                    9.0
marital-status              Divorced
occupation           Exec-managerial
relationship               Unmarried
race                           White
sex                             Male
capital-gain                       0
capital-loss                       0
hours-per-week                    50
native-country         United-States
salary                          <50k
education-num_na               False
Name: 5516, dtype: object

We can make new test datasets based on the training data with the to.new()

Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country education-num_na
10000 0.465031 5 1.319553 10 1.176677 3 2 1 2 Male 0 0 40 Philippines 1
10001 -0.926675 5 1.233650 12 -0.420035 3 15 1 4 Male 0 0 40 United-States 1
10002 1.051012 5 0.145161 2 -1.218391 1 9 2 5 Female 0 0 37 United-States 1
10003 0.538279 5 -0.282370 12 -0.420035 7 2 5 5 Female 0 0 43 United-States 1
10004 0.758022 6 1.420768 9 0.378321 3 5 1 5 Male 0 0 60 United-States 1

We can then convert it to a DataLoader:

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 Private Bachelors Married-civ-spouse Adm-clerical Husband Asian-Pac-Islander False 45.000000 338105.005817 13.0
1 Private HS-grad Married-civ-spouse Transport-moving Husband Other False 26.000000 328663.002806 9.0
2 Private 11th Divorced Other-service Not-in-family White False 52.999999 209022.000317 7.0
3 Private HS-grad Widowed Adm-clerical Unmarried White False 46.000000 162029.998917 9.0
4 Self-emp-inc Assoc-voc Married-civ-spouse Exec-managerial Husband White False 49.000000 349230.006300 11.0
5 Local-gov Some-college Married-civ-spouse Exec-managerial Husband White False 34.000000 124827.002059 10.0
6 Self-emp-inc Some-college Married-civ-spouse Sales Husband White False 52.999999 290640.002462 10.0
7 Private Some-college Never-married Sales Own-child White False 19.000000 106272.998239 10.0
8 Private Some-college Married-civ-spouse Protective-serv Husband Black False 71.999999 53684.001668 10.0
9 Private Some-college Never-married Sales Own-child White False 20.000000 505980.010609 10.0
# Create a TabWeightedDL
train_ds = to.train
weights = np.random.random(len(train_ds))
train_dl = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)

train_dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Local-gov Masters Never-married Prof-specialty Not-in-family White False 31.000000 204469.999932 14.0 <50k
1 Self-emp-not-inc HS-grad Divorced Farming-fishing Not-in-family White False 32.000000 34572.002104 9.0 <50k
2 ? Some-college Widowed ? Not-in-family White False 64.000000 34099.998990 10.0 <50k
3 Private Some-college Divorced Exec-managerial Not-in-family White False 32.000000 251242.999189 10.0 >=50k
4 Federal-gov HS-grad Married-civ-spouse Exec-managerial Husband White False 55.000001 176903.999313 9.0 <50k
5 Private 11th Married-civ-spouse Transport-moving Husband White False 50.000000 192203.000000 7.0 <50k
6 Private 10th Never-married Farming-fishing Own-child Black False 36.000000 181720.999704 6.0 <50k
7 Local-gov Masters Divorced Prof-specialty Not-in-family Amer-Indian-Eskimo False 50.000000 220640.001490 14.0 >=50k
8 Private HS-grad Married-civ-spouse Adm-clerical Wife White False 36.000000 189381.999993 9.0 >=50k
9 Private Masters Divorced Prof-specialty Unmarried White False 42.000000 265697.997341 14.0 <50k

TabDataLoader’s create_item method

df = pd.DataFrame([{'age': 35}])
to = TabularPandas(df)
dls = to.dataloaders()
print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})
age    35
Name: 0, dtype: int8

Other target types

Multi-label categories

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary male white
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States True False True
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States True True True
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States False False False
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States True True False
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States False False False
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]
CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary male white
0 Private HS-grad Divorced Exec-managerial Unmarried White False 47.000000 164423.000013 9.0 False False True
1 Private Some-college Married-civ-spouse Transport-moving Husband White False 74.999999 239037.999499 10.0 False True True
2 Private HS-grad Married-civ-spouse Sales Wife White False 45.000000 228570.000761 9.0 False False True
3 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander False 45.000000 285574.998753 9.0 False True False
4 Private Some-college Never-married Adm-clerical Own-child White False 21.999999 184812.999966 10.0 False True True
5 Private 10th Married-civ-spouse Transport-moving Husband White False 67.000001 274450.998865 6.0 False True True
6 Private HS-grad Divorced Exec-managerial Unmarried White False 53.999999 192862.000000 9.0 False False True
7 Federal-gov Some-college Divorced Tech-support Unmarried Amer-Indian-Eskimo False 37.000000 33486.997455 10.0 False False False
8 Private HS-grad Never-married Machine-op-inspct Other-relative White False 30.000000 219318.000010 9.0 False False True
9 Self-emp-not-inc Bachelors Married-civ-spouse Sales Husband White False 44.000000 167279.999960 13.0 False True True

Not one-hot encoded

def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        if row.race == ' White': labels.append('white')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary target
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k >50k white
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k >50k male white
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k >50k male
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
@MultiCategorize
def encodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to

@MultiCategorize
def decodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return to
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms
to.procs[2].vocab
['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

Regression

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms
to.procs[-1].means
{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na fnlwgt education-num age
0 Private 12th Never-married Adm-clerical Other-relative Black False 503454.004078 8.0 47.0
1 Federal-gov Bachelors Married-civ-spouse Exec-managerial Husband White False 586656.993690 13.0 49.0
2 Self-emp-not-inc Assoc-voc Married-civ-spouse Farming-fishing Husband White False 164607.001243 11.0 29.0
3 Private HS-grad Never-married Adm-clerical Not-in-family Black False 155508.999873 9.0 48.0
4 Private 11th Never-married Other-service Own-child White False 318189.998679 7.0 18.0
5 Private HS-grad Never-married Adm-clerical Other-relative White False 140219.001104 9.0 47.0
6 Private Masters Divorced #na# Unmarried White True 235683.001562 10.0 47.0
7 Private Bachelors Married-civ-spouse Craft-repair Husband White False 187321.999825 13.0 43.0
8 Private Bachelors Married-civ-spouse Prof-specialty Husband White False 104196.002410 13.0 40.0
9 Private Some-college Separated Priv-house-serv Other-relative White False 184302.999784 10.0 25.0

Not being used now - for multi-modal

class TensorTabular(fastuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def display(self, ctxs): display_df(pd.DataFrame(ctxs))

class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)

class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())

    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))

class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)

# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')

# test_stdout(lambda: print(show_at(tds, 1)), """a               1
# b_na        False
# b               1
# category        a
# dtype: object""")