Tabular training

How to use the tabular application in fastai

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

from fastai.tabular.all import *

We can download a sample of this dataset with the usual untar_data command:

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]

Then we can have a look at how the data is structured:

df = pd.read_csv(path/'adult.csv')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

The last part is the list of pre-processors we apply to our data:

To further expose what’s going on below the surface, let’s rewrite this utilizing fastai’s TabularPandas class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

Once we build our TabularPandas object, our data is completely preprocessed as seen below:

to.xs.iloc[:2]
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
15780 2 16 1 5 2 5 1 0.984037 2.210372 -0.033692
17442 5 12 5 8 2 5 1 -1.509555 -0.319624 -0.425324

Now we can build our DataLoaders again:

dls = to.dataloaders(bs=64)

Later we will explore why using TabularPandas to preprocess will be valuable.

The show_batch method works like for every other application:

dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 State-gov Bachelors Married-civ-spouse Prof-specialty Wife White False 41.000000 75409.001182 13.0 >=50k
1 Private Some-college Never-married Craft-repair Not-in-family White False 24.000000 38455.005013 10.0 <50k
2 Private Assoc-acdm Married-civ-spouse Prof-specialty Husband White False 48.000000 101299.003093 12.0 <50k
3 Private HS-grad Never-married Other-service Other-relative Black False 42.000000 227465.999281 9.0 <50k
4 State-gov Some-college Never-married Prof-specialty Not-in-family White False 20.999999 258489.997130 10.0 <50k
5 Local-gov 12th Married-civ-spouse Tech-support Husband White False 39.000000 207853.000067 8.0 <50k
6 Private Assoc-voc Married-civ-spouse Sales Husband White False 36.000000 238414.998930 11.0 >=50k
7 Private HS-grad Never-married Craft-repair Not-in-family White False 19.000000 445727.998937 9.0 <50k
8 Local-gov Bachelors Married-civ-spouse #na# Husband White True 59.000000 196013.000174 10.0 >=50k
9 Private HS-grad Married-civ-spouse Prof-specialty Wife Black False 39.000000 147500.000403 9.0 <50k

We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won’t presume you are doing regression.

learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.369360 0.348096 0.840756 00:05

We can then have a look at some predictions:

learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5.0 12.0 3.0 8.0 1.0 5.0 1.0 0.324868 -1.138177 -0.424022 0.0 0.0
1 5.0 10.0 5.0 2.0 2.0 5.0 1.0 -0.482055 -1.351911 1.148438 0.0 0.0
2 5.0 12.0 6.0 12.0 3.0 5.0 1.0 -0.775482 0.138709 -0.424022 0.0 0.0
3 5.0 16.0 5.0 2.0 4.0 4.0 1.0 -1.362335 -0.227515 -0.030907 0.0 0.0
4 5.0 2.0 5.0 0.0 4.0 5.0 1.0 -1.509048 -0.191191 -1.210252 0.0 0.0
5 5.0 16.0 3.0 13.0 1.0 5.0 1.0 1.498575 -0.051096 -0.030907 1.0 1.0
6 5.0 12.0 3.0 15.0 1.0 5.0 1.0 -0.555412 0.039167 -0.424022 0.0 0.0
7 5.0 1.0 5.0 6.0 4.0 5.0 1.0 -1.582405 -1.396391 -1.603367 0.0 0.0
8 5.0 3.0 5.0 13.0 2.0 5.0 1.0 -1.362335 0.158354 -0.817137 0.0 0.0

Or use the predict method on a row:

row, clas, probs = learn.predict(df.iloc[0])
row.show()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 101319.99788 12.0 >=50k
clas, probs
(tensor(1), tensor([0.4995, 0.5005]))

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

Then Learner.get_preds will give you the predictions:

learn.get_preds(dl=dl)
(tensor([[0.4995, 0.5005],
         [0.4882, 0.5118],
         [0.9824, 0.0176],
         ...,
         [0.5324, 0.4676],
         [0.7628, 0.2372],
         [0.5934, 0.4066]]), None)
Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

fastai with Other Libraries

As mentioned earlier, TabularPandas is a powerful and easy preprocessing tool for tabular data. Integration with libraries such as Random Forests and XGBoost requires only one extra step, that the .dataloaders call did for us. Let’s look at our to again. Its values are stored in a DataFrame like object, where we can extract the cats, conts, xs and ys if we want to:

to.xs[:3]
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
25387 5 16 3 5 1 5 1 0.471582 -1.467756 -0.030907
16872 1 16 5 1 4 5 1 -1.215622 -0.649792 -0.030907
25852 5 16 3 5 1 5 1 1.865358 -0.218915 -0.030907

Now that everything is encoded, you can then send this off to XGBoost or Random Forests by extracting the train and validation sets and their values:

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

And now we can directly send this in!