df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))Tabular core
DataLoaders.
Initial preprocessing
make_date
make_date (df, date_field)
Make sure df[date_field] is of the right date type.
add_datepart
add_datepart (df, field_name, prefix=None, drop=True, time=False)
Helper function that adds columns relevant to a date in the column field_name of df.
For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:
df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()| Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN |
| 2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 |
| 3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 |
add_elapsed_times
add_elapsed_times (df, field_names, date_field, base_field)
Add in df for each event in field_names the elapsed time according to date_field grouped by base_field
df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()| date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
|---|---|---|---|---|---|---|---|
| 0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 |
| 1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 |
| 2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 |
| 3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 |
cont_cat_split
cont_cat_split (df, max_card=20, dep_var=None)
Helper function that returns column names of cont and cat variables from given df.
This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:
# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
})
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']
df_shrink_dtypes
df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)
Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.
For example we will make a sample DataFrame with int, float, bool, and object datatypes:
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypesi int64
f float64
e bool
date object
dtype: object
We can then call df_shrink_dtypes to find the smallest possible datatype that can support the data:
dt = df_shrink_dtypes(df)
dt{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}
df_shrink
df_shrink (df, skip=[], obj2cat=True, int2uint=False)
Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes().
df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:
boolean,category,datetime64[ns]dtype columns are ignored.- ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by
obj2cat=False. int2uint=True, to fitinttypes touinttypes, if all data in the column is >= 0.- columns can be excluded by name using
excl_cols=['col1','col2'].
To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])Let’s compare the two:
df.dtypesi int64
f float64
u int64
date object
dtype: object
df2.dtypesi int8
f float32
u int16
date object
dtype: object
We can see that the datatypes changed, and even further we can look at their relative memory usages:
Initial Dataframe: 228 bytes
Reduced Dataframe: 177 bytes
Here’s another example using the ADULT_SAMPLE dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)Initial Dataframe: 3.907452 megabytes
Reduced Dataframe: 0.818333 megabytes
We reduced the overall memory used by 79%!
Tabular
Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__
df: ADataFrameof your datacat_names: Your categoricalxvariablescont_names: Your continuousxvariablesy_names: Your dependentyvariables- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block: How to sub-categorize the type ofy_names(CategoryBlockorRegressionBlock)splits: How to split your datado_setup: A parameter for ifTabularwill run the data through theprocsupon initializationdevice:cudaorcpuinplace: IfTrue,Tabularwill not keep a separate copy of your originalDataFramein memory. You should ensurepd.options.mode.chained_assignmentisNonebefore setting thisreduce_memory:fastaiwill attempt to reduce the overall memory usage by the inputtedDataFramewithdf_shrink
TabularPandas
TabularPandas (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
A Tabular object with transforms
TabularProc
TabularProc (enc=None, dec=None, split_idx=None, order=None)
Base class to write a non-lazy tabular processor for dataframes
These transforms are applied as soon as the data is available rather than as data is called from the DataLoader
Categorify
Categorify (enc=None, dec=None, split_idx=None, order=None)
Transform the categorical variables to something similar to pd.Categorical
While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()| a | |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 0 |
| 4 | 2 |
Each column’s unique values are stored in a dictionary of column:[values]:
cat = to.procs.categorify
cat.classes{'a': ['#na#', 0, 1, 2]}
FillStrategy
FillStrategy ()
Namespace containing the various filling strategies.
Currently, filling with the median, a constant, and the mode are supported.
FillMissing
FillMissing (fill_strategy=<function median>, add_col=True, fill_vals=None)
Fill the missing values in continuous columns.
ReadTabBatch
ReadTabBatch (to)
Transform TabularPandas values into a Tensor with the ability to decode
TabDataLoader
TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
A transformed DataLoader for Tabular data
TabWeightedDL
TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
A transformed DataLoader for Tabular Weighted data
Integration example
For a more in-depth explanation, see the tabular tutorial
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)dls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Self-emp-not-inc | Prof-school | Divorced | Prof-specialty | Not-in-family | White | False | 65.000000 | 316093.005287 | 15.0 | <50k |
| 1 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 69.999999 | 280306.998091 | 13.0 | <50k |
| 2 | Federal-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 34.000000 | 199933.999862 | 10.0 | >=50k |
| 3 | Private | HS-grad | Never-married | Handlers-cleaners | Unmarried | White | False | 24.000001 | 300584.002430 | 9.0 | <50k |
| 4 | Private | Assoc-voc | Never-married | Other-service | Not-in-family | White | False | 34.000000 | 220630.999335 | 11.0 | <50k |
| 5 | Private | Bachelors | Divorced | Prof-specialty | Unmarried | White | False | 45.000000 | 289230.003178 | 13.0 | >=50k |
| 6 | ? | Some-college | Never-married | ? | Own-child | White | False | 26.000000 | 208993.999494 | 10.0 | <50k |
| 7 | Private | Some-college | Divorced | Adm-clerical | Not-in-family | White | False | 43.000000 | 174574.999446 | 10.0 | <50k |
| 8 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Other-service | Husband | White | False | 63.000000 | 420628.997361 | 11.0 | <50k |
| 9 | State-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 25.000000 | 257064.003065 | 10.0 | <50k |
to.show()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5516 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 49.0 | 140121.0 | 9.0 | <50k |
| 7184 | Self-emp-inc | Some-college | Never-married | Exec-managerial | Not-in-family | White | False | 70.0 | 207938.0 | 10.0 | <50k |
| 2336 | Private | Some-college | Never-married | Priv-house-serv | Own-child | White | False | 23.0 | 50953.0 | 10.0 | <50k |
| 4342 | Private | Assoc-voc | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 46.0 | 27802.0 | 11.0 | <50k |
| 8474 | Self-emp-not-inc | Assoc-acdm | Married-civ-spouse | Craft-repair | Husband | White | False | 47.0 | 107231.0 | 12.0 | <50k |
| 5948 | Local-gov | HS-grad | Married-civ-spouse | Transport-moving | Husband | White | False | 40.0 | 55363.0 | 9.0 | <50k |
| 5342 | Local-gov | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 36228.0 | 9.0 | <50k |
| 9005 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | White | False | 38.0 | 297449.0 | 13.0 | >=50k |
| 1189 | Private | Assoc-voc | Divorced | Sales | Not-in-family | Amer-Indian-Eskimo | False | 31.0 | 87950.0 | 11.0 | <50k |
| 8784 | Private | Assoc-voc | Divorced | Prof-specialty | Own-child | Black | False | 35.0 | 491000.0 | 11.0 | <50k |
We can decode any set of transformed data by calling to.decode_row with our raw data:
row = to.items.iloc[0]
to.decode_row(row)age 49.0
workclass Private
fnlwgt 140121.0
education HS-grad
education-num 9.0
marital-status Divorced
occupation Exec-managerial
relationship Unmarried
race White
sex Male
capital-gain 0
capital-loss 0
hours-per-week 50
native-country United-States
salary <50k
education-num_na False
Name: 5516, dtype: object
We can make new test datasets based on the training data with the to.new()
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training
to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10000 | 0.465031 | 5 | 1.319553 | 10 | 1.176677 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 |
| 10001 | -0.926675 | 5 | 1.233650 | 12 | -0.420035 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 |
| 10002 | 1.051012 | 5 | 0.145161 | 2 | -1.218391 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 |
| 10003 | 0.538279 | 5 | -0.282370 | 12 | -0.420035 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 |
| 10004 | 0.758022 | 6 | 1.420768 | 9 | 0.378321 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 |
We can then convert it to a DataLoader:
tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338105.005817 | 13.0 |
| 1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.002806 | 9.0 |
| 2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 52.999999 | 209022.000317 | 7.0 |
| 3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162029.998917 | 9.0 |
| 4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349230.006300 | 11.0 |
| 5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124827.002059 | 10.0 |
| 6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 52.999999 | 290640.002462 | 10.0 |
| 7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106272.998239 | 10.0 |
| 8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 71.999999 | 53684.001668 | 10.0 |
| 9 | Private | Some-college | Never-married | Sales | Own-child | White | False | 20.000000 | 505980.010609 | 10.0 |
# Create a TabWeightedDL
train_ds = to.train
weights = np.random.random(len(train_ds))
train_dl = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)
train_dl.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Local-gov | Masters | Never-married | Prof-specialty | Not-in-family | White | False | 31.000000 | 204469.999932 | 14.0 | <50k |
| 1 | Self-emp-not-inc | HS-grad | Divorced | Farming-fishing | Not-in-family | White | False | 32.000000 | 34572.002104 | 9.0 | <50k |
| 2 | ? | Some-college | Widowed | ? | Not-in-family | White | False | 64.000000 | 34099.998990 | 10.0 | <50k |
| 3 | Private | Some-college | Divorced | Exec-managerial | Not-in-family | White | False | 32.000000 | 251242.999189 | 10.0 | >=50k |
| 4 | Federal-gov | HS-grad | Married-civ-spouse | Exec-managerial | Husband | White | False | 55.000001 | 176903.999313 | 9.0 | <50k |
| 5 | Private | 11th | Married-civ-spouse | Transport-moving | Husband | White | False | 50.000000 | 192203.000000 | 7.0 | <50k |
| 6 | Private | 10th | Never-married | Farming-fishing | Own-child | Black | False | 36.000000 | 181720.999704 | 6.0 | <50k |
| 7 | Local-gov | Masters | Divorced | Prof-specialty | Not-in-family | Amer-Indian-Eskimo | False | 50.000000 | 220640.001490 | 14.0 | >=50k |
| 8 | Private | HS-grad | Married-civ-spouse | Adm-clerical | Wife | White | False | 36.000000 | 189381.999993 | 9.0 | >=50k |
| 9 | Private | Masters | Divorced | Prof-specialty | Unmarried | White | False | 42.000000 | 265697.997341 | 14.0 | <50k |
TabDataLoader’s create_item method
df = pd.DataFrame([{'age': 35}])
to = TabularPandas(df)
dls = to.dataloaders()
print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})age 35
Name: 0, dtype: int8
Other target types
Multi-label categories
one-hot encoded label
def _mock_multi_label(df):
sal,sex,white = [],[],[]
for row in df.itertuples():
sal.append(row.salary == '>=50k')
sex.append(row.sex == ' Male')
white.append(row.race == ' White')
df['salary'] = np.array(sal)
df['male'] = np.array(sex)
df['white'] = np.array(white)
return dfpath = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False |
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms
dls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 47.000000 | 164423.000013 | 9.0 | False | False | True |
| 1 | Private | Some-college | Married-civ-spouse | Transport-moving | Husband | White | False | 74.999999 | 239037.999499 | 10.0 | False | True | True |
| 2 | Private | HS-grad | Married-civ-spouse | Sales | Wife | White | False | 45.000000 | 228570.000761 | 9.0 | False | False | True |
| 3 | Self-emp-not-inc | HS-grad | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | False | 45.000000 | 285574.998753 | 9.0 | False | True | False |
| 4 | Private | Some-college | Never-married | Adm-clerical | Own-child | White | False | 21.999999 | 184812.999966 | 10.0 | False | True | True |
| 5 | Private | 10th | Married-civ-spouse | Transport-moving | Husband | White | False | 67.000001 | 274450.998865 | 6.0 | False | True | True |
| 6 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 53.999999 | 192862.000000 | 9.0 | False | False | True |
| 7 | Federal-gov | Some-college | Divorced | Tech-support | Unmarried | Amer-Indian-Eskimo | False | 37.000000 | 33486.997455 | 10.0 | False | False | False |
| 8 | Private | HS-grad | Never-married | Machine-op-inspct | Other-relative | White | False | 30.000000 | 219318.000010 | 9.0 | False | False | True |
| 9 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Sales | Husband | White | False | 44.000000 | 167279.999960 | 13.0 | False | True | True |
Not one-hot encoded
def _mock_multi_label(df):
targ = []
for row in df.itertuples():
labels = []
if row.salary == '>=50k': labels.append('>50k')
if row.sex == ' Male': labels.append('male')
if row.race == ' White': labels.append('white')
targ.append(' '.join(labels))
df['target'] = np.array(targ)
return dfpath = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | >50k white |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | >50k male white |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | >50k male |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
@MultiCategorize
def encodes(self, to:Tabular):
#to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
return to
@MultiCategorize
def decodes(self, to:Tabular):
#to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
return tocat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms
to.procs[2].vocab['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
Regression
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms
to.procs[-1].means{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}
dls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | 12th | Never-married | Adm-clerical | Other-relative | Black | False | 503454.004078 | 8.0 | 47.0 |
| 1 | Federal-gov | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 586656.993690 | 13.0 | 49.0 |
| 2 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Farming-fishing | Husband | White | False | 164607.001243 | 11.0 | 29.0 |
| 3 | Private | HS-grad | Never-married | Adm-clerical | Not-in-family | Black | False | 155508.999873 | 9.0 | 48.0 |
| 4 | Private | 11th | Never-married | Other-service | Own-child | White | False | 318189.998679 | 7.0 | 18.0 |
| 5 | Private | HS-grad | Never-married | Adm-clerical | Other-relative | White | False | 140219.001104 | 9.0 | 47.0 |
| 6 | Private | Masters | Divorced | #na# | Unmarried | White | True | 235683.001562 | 10.0 | 47.0 |
| 7 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | 187321.999825 | 13.0 | 43.0 |
| 8 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Husband | White | False | 104196.002410 | 13.0 | 40.0 |
| 9 | Private | Some-college | Separated | Priv-house-serv | Other-relative | White | False | 184302.999784 | 10.0 | 25.0 |
Not being used now - for multi-modal
class TensorTabular(fastuple):
def get_ctxs(self, max_n=10, **kwargs):
n_samples = min(self[0].shape[0], max_n)
df = pd.DataFrame(index = range(n_samples))
return [df.iloc[i] for i in range(n_samples)]
def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
"A line of a dataframe that knows how to show itself"
def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row):
cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
return TensorTabular(tensor(cats).long(),tensor(conts).float())
def decodes(self, o):
to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
to = self.proc.decode(to)
return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a 1
# b_na False
# b 1
# category a
# dtype: object""")