Transforms to clean and preprocess tabular data

Tabular data preprocessing

Overview

This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic TabularTransform. Preprocessing includes things like

  • replacing non-numerical variables by categories, then their ids,
  • filling missing values,
  • normalizing continuous variables.

In all those steps we have to be careful to use the correspondance we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a speciall class called TabularTransform.

The data used in this document page is a subset of the adult dataset. It gives a certain amount of data on individuals to train a model to predict wether their salary is greater than \$50k or not.

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df.iloc[:800].copy(), df.iloc[800:1000].copy()
train_df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

We see it contains numerical variables (like age or education-num) as well as categorical ones (like workclass or relationship). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables.

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Transforms for tabular data

class TabularProc[source]

TabularProc(`cat_names`:StrList, `cont_names`:StrList)

A processor for tabular dataframes.

Base class for creating transforms for dataframes with categorical variables cat_names and continuous variables cont_names. Note that any column not in one of those lists won't be touched.

__call__[source]

__call__(`df`:DataFrame, `test`:bool=`False`)

Apply the correct function to df depending on test.

apply_train[source]

apply_train(`df`:DataFrame)

Function applied to df if it's the train set.

apply_test[source]

apply_test(`df`:DataFrame)

Function applied to df if it's the test set.

The following TabularTransform are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a TabularDataset.

class Categorify[source]

Categorify(`cat_names`:StrList, `cont_names`:StrList) :: TabularProc

Transform the categorical variables to that type.

Variables in cont_names aren't affected.

apply_train[source]

apply_train(`df`:DataFrame)

Transform self.cat_names columns in categorical.

apply_test[source]

apply_test(`df`:DataFrame)

Transform self.cat_names columns in categorical using the codes decided in apply_train.

tfm = Categorify(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)

Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes.

train_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
      dtype='object')

The test set will be given the same category codes as the training set.

valid_df['workclass'].cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
      dtype='object')

class FillMissing[source]

FillMissing(`cat_names`:StrList, `cont_names`:StrList, `fill_strategy`:FillStrategy=``, `add_col`:bool=`True`, `fill_val`:float=`0.0`) :: TabularProc

Fill the missing values in continuous columns.

cat_names variables are left untouched (their missing value will be repalced by code 0 in the TabularDataset). fill_strategy is adopted to replace those nans and if add_col is True, whenever a column c has missing values, a column named c_nan is added and flags the line where the value was missing.

apply_train[source]

apply_train(`df`:DataFrame)

Fill missing values in self.cont_names according to self.fill_strategy.

apply_test[source]

apply_test(`df`:DataFrame)

Fill missing values in self.cont_names like in apply_train.

Fills the missing values in the cont_names columns with the ones picked during train.

train_df[cont_names].head()
age fnlwgt education-num capital-gain capital-loss hours-per-week
0 49 101320 12.0 0 1902 40
1 44 236746 14.0 10520 0 45
2 38 96185 NaN 0 0 32
3 38 112847 15.0 0 0 40
4 42 82297 NaN 0 0 50
tfm = FillMissing(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
train_df[cont_names].head()
age fnlwgt education-num capital-gain capital-loss hours-per-week
0 49 101320 12.0 0 1902 40
1 44 236746 14.0 10520 0 45
2 38 96185 10.0 0 0 32
3 38 112847 15.0 0 0 40
4 42 82297 10.0 0 0 50

Values issing in the education-num column are replaced by 10, which is the median of the column in train_df. Categorical variables are not changed, since nan is simply used as another category.

valid_df[cont_names].head()
age fnlwgt education-num capital-gain capital-loss hours-per-week
800 45 96975 10.0 0 0 40
801 46 192779 10.0 15024 0 60
802 36 376455 10.0 0 0 38
803 25 50053 10.0 0 0 45
804 37 164526 10.0 0 0 40

`FillStrategy`

Enum = [MEDIAN, COMMON, CONSTANT]

Enum flag represents determines how FillMissing should handle missing/nan values

  • MEDIAN: nans are replaced by the median value of the column
  • COMMON: nans are replaced by the most common value of the column
  • CONSTANT: nans are replaced by fill_val

class Normalize[source]

Normalize(`cat_names`:StrList, `cont_names`:StrList) :: TabularProc

Normalize the continuous variables.

apply_train[source]

apply_train(`df`:DataFrame)

Comput the means and stds of self.cont_names columns to normalize them.

apply_test[source]

apply_test(`df`:DataFrame)

Normalize self.cont_names with the same statistics as in apply_train.

Treating date columns

add_datepart[source]

add_datepart(`df`, `fldname`, `drop`=`True`, `time`=`False`)

Helper function that adds columns relevant to a date in the column fldname of df.

Will drop the column in df if the flag is True. The time flag decides if we go down to the time parts or stick to the date parts.

Spliting data into cat and cont

cont_cat_split[source]

cont_cat_split(`df`, `max_card`=`20`, `dep_var`=`None`)

Helper function that returns column names of cont and cat variables from given df.

Parameters:

  • df: A pandas data frame.
  • max_card: Maximum cardinality of a numerical categorical variable.
  • dep_var: A dependent variable.

Return:

  • cont_names: A list of names of continuous variables.
  • cat_names: A list of names of categorical variables.
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'a'], 'col3': [0.5, 1.2, 7.5], 'col4': ['ab', 'o', 'o']})
df
col1 col2 col3 col4
0 1 a 0.5 ab
1 2 b 1.2 o
2 3 a 7.5 o
cont_list, cat_list = cont_cat_split(df=df, max_card=20, dep_var='col4')
cont_list, cat_list
(['col3'], ['col1', 'col2'])