Basic dataset for NLP tasks and helper functions to create a DataBunch

NLP datasets

This module contains the TextDataset class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in text.transform. It also contains all the functions to quickly get a TextDataBunch ready.

Quickly assemble your data

You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch classes:

  • raw text files in folders train, valid, test in an ImageNet style,
  • a csv where some column(s) gives the label(s) and the folowwing one the associated text,
  • a dataframe structured the same way,
  • tokens and labels arrays,
  • ids, vocabulary (correspondance id to word) and labels.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a DataBunch with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous.

Below are the classes that help assembling the raw data in a DataBunch suitable for NLP.

class TextLMDataBunch[source]

TextLMDataBunch(`train_dl`:DataLoader, `valid_dl`:DataLoader, `fix_dl`:DataLoader=`None`, `test_dl`:Optional[DataLoader]=`None`, `device`:device=`None`, `tfms`:Optional[Collection[Callable]]=`None`, `path`:PathOrStr=`'.'`, `collate_fn`:Callable=`'data_collate'`, `no_check`:bool=`False`) :: TextDataBunch

Create a TextDataBunch suitable for training a language model.

All the texts in the datasets are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.

create[source]

create(`train_ds`, `valid_ds`, `test_ds`=`None`, `path`:PathOrStr=`'.'`, `no_check`:bool=`False`, `bs`=`64`, `num_workers`:int=`0`, `device`:device=`None`, `collate_fn`:Callable=`'data_collate'`, `tfms`:Optional[Collection[Callable]]=`None`, `kwargs`) → DataBunch

Create a TextDataBunch in path from the datasets for language modelling.

class TextClasDataBunch[source]

TextClasDataBunch(`train_dl`:DataLoader, `valid_dl`:DataLoader, `fix_dl`:DataLoader=`None`, `test_dl`:Optional[DataLoader]=`None`, `device`:device=`None`, `tfms`:Optional[Collection[Callable]]=`None`, `path`:PathOrStr=`'.'`, `collate_fn`:Callable=`'data_collate'`, `no_check`:bool=`False`) :: TextDataBunch

Create a TextDataBunch suitable for training an RNN classifier.

create[source]

create(`train_ds`, `valid_ds`, `test_ds`=`None`, `path`:PathOrStr=`'.'`, `bs`=`64`, `pad_idx`=`1`, `pad_first`=`True`, `no_check`:bool=`False`, `kwargs`) → DataBunch

Function that transform the datasets in a DataBunch for classification.

All the texts are grouped by length (with a bit of randomness for the training set) then padded so that the samples have the same length to get in a batch.

class TextDataBunch[source]

TextDataBunch(`train_dl`:DataLoader, `valid_dl`:DataLoader, `fix_dl`:DataLoader=`None`, `test_dl`:Optional[DataLoader]=`None`, `device`:device=`None`, `tfms`:Optional[Collection[Callable]]=`None`, `path`:PathOrStr=`'.'`, `collate_fn`:Callable=`'data_collate'`, `no_check`:bool=`False`) :: DataBunch

General class to get a DataBunch for NLP. Subclassed by TextLMDataBunch and TextClasDataBunch.

Factory methods (TextDataBunch)

All those classes have the following factory methods.

from_folder[source]

from_folder(`path`:PathOrStr, `train`:str=`'train'`, `valid`:str=`'valid'`, `test`:Optional[str]=`None`, `classes`:ArgStar=`None`, `tokenizer`:Tokenizer=`None`, `vocab`:Vocab=`None`, `kwargs`)

Create a TextDataBunch from text files in folders.

The floders are scanned in path with a train, valid and maybe test folders. Text files in the train and valid folders should be places in subdirectories according to their classes (not applicable for a language model). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

from_csv[source]

from_csv(`path`:PathOrStr, `csv_name`, `valid_pct`:float=`0.2`, `test`:Optional[str]=`None`, `tokenizer`:Tokenizer=`None`, `vocab`:Vocab=`None`, `classes`:StrList=`None`, `header`=`'infer'`, `text_cols`:IntsOrStrs=`1`, `label_cols`:IntsOrStrs=`0`, `label_delim`:str=`None`, `kwargs`) → DataBunch

Create a TextDataBunch from texts in csv files.

This method will look for csv_name in path, and maybe a test csv file opened with header. You can specify text_cols and label_cols. If there are several text_cols, the texts will be concatenated together with an optional field token. If there are several label_cols, the labels will be assumed to be one-hot encoded and classes will default to label_cols (you can ignore that argument for a language model). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

from_df[source]

from_df(`path`:PathOrStr, `train_df`:DataFrame, `valid_df`:DataFrame, `test_df`:OptDataFrame=`None`, `tokenizer`:Tokenizer=`None`, `vocab`:Vocab=`None`, `classes`:StrList=`None`, `text_cols`:IntsOrStrs=`1`, `label_cols`:IntsOrStrs=`0`, `label_delim`:str=`None`, `kwargs`) → DataBunch

Create a TextDataBunch from DataFrames.

This method will use train_df, valid_df and maybe test_df to build the TextDataBunch in path. You can specify text_cols and label_cols. If there are several text_cols, the texts will be concatenated together with an optional field token. If there are several label_cols, the labels will be assumed to be one-hot encoded and classes will default to label_cols (you can ignore that argument for a language model). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

from_tokens[source]

from_tokens(`path`:PathOrStr, `trn_tok`:Tokens, `trn_lbls`:Collection[Union[int, float]], `val_tok`:Tokens, `val_lbls`:Collection[Union[int, float]], `vocab`:Vocab=`None`, `tst_tok`:Tokens=`None`, `classes`:ArgStar=`None`, `kwargs`) → DataBunch

Create a TextDataBunch from tokens and labels.

This function will create a DataBunch from trn_tok, trn_lbls, val_tok, val_lbls and maybe tst_tok.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the TextDataset function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels, tok_suff and lbl_suff (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

from_ids[source]

from_ids(`path`:PathOrStr, `vocab`:Vocab, `train_ids`:Collection[Collection[int]], `valid_ids`:Collection[Collection[int]], `test_ids`:Collection[Collection[int]]=`None`, `train_lbls`:Collection[Union[int, float]]=`None`, `valid_lbls`:Collection[Union[int, float]]=`None`, `classes`:ArgStar=`None`, `processor`:PreProcessor=`None`, `kwargs`) → DataBunch

Create a TextDataBunch from ids, labels and a vocab.

Texts are already preprocessed into train_ids, train_lbls, valid_ids, valid_lbls and maybe test_ids. You can specify the corresponding classes if applicable. You must specify a path and the vocab so that the RNNLearner class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.

Load and save

To avoid losing time preprocessing the text data more than once, you should save/load your TextDataBunch using thse methods.

load[source]

load(`path`:PathOrStr, `cache_name`:PathOrStr=`'tmp'`, `processor`:PreProcessor=`None`, `kwargs`)

Load a TextDataBunch from path/cache_name. kwargs are passed to the dataloader creation.

save[source]

save(`cache_name`:PathOrStr=`'tmp'`)

Save the DataBunch in self.path/cache_name folder.

Example

Untar the IMDB sample dataset if not already done:

path = untar_data(URLs.IMDB_SAMPLE)
path
PosixPath('/home/ubuntu/.fastai/data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding text_data method. Here is an overview of what your file you should look like:

pd.read_csv(path/'texts.csv').head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even ... False
1 positive This is a extremely well-made film. The acting... False
2 negative Every once in a long while a movie will come a... False
3 positive Name just says it all. I watched this movie wi... False
4 negative This movie succeeds at being one of the most u... False

And here is a simple way of creating your DataBunch for language modelling or classification.

data_lm = TextLMDataBunch.from_csv(Path(path), 'texts.csv')
data_clas = TextClasDataBunch.from_csv(Path(path), 'texts.csv')

The TextList input classes

Behind the scenes, the previous functions will create a training, validation and maybe test TextList that will be tokenized and numericalized (if needed) using PreProcessor.

class Text[source]

Text(`ids`, `text`) :: ItemBase

Basic item for text data in numericalized ids.

class TextList[source]

TextList(`items`:Iterator[T_co], `vocab`:Vocab=`None`, `pad_idx`:int=`1`, `kwargs`) :: ItemList

Basic ItemList for text data.

vocab contains the correspondance between ids and tokens, pad_idx is the id used for padding. You can pass a custom processor in the kwargs to change the defaults for tokenization or numericalization. It should have the following form:

processor = [TokenizeProcessor(tokenizer=SpacyTokenizer('en')), NumericalizeProcessor(max_vocab=30000)]

See below for all the arguments those tokenizers can take.

label_for_lm[source]

label_for_lm(`kwargs`)

A special labelling method for language models.

from_folder[source]

from_folder(`path`:PathOrStr=`'.'`, `extensions`:StrList=`{'.txt'}`, `vocab`:Vocab=`None`, `processor`:PreProcessor=`None`, `kwargs`) → TextList

Get the list of files in path that have a text suffix. recurse determines if we search subfolders.

show_xys[source]

show_xys(`xs`, `ys`, `max_len`:int=`70`)

Show the xs (inputs) and ys (targets). max_len is the maximum number of tokens displayed.

show_xyzs[source]

show_xyzs(`xs`, `ys`, `zs`, `max_len`:int=`70`)

Show xs (inputs), ys (targets) and zs (predictions). max_len is the maximum number of tokens displayed.

class OpenFileProcessor[source]

OpenFileProcessor(`ds`:Collection[T_co]=`None`) :: PreProcessor

PreProcessor that opens the filenames and read the texts.

open_text[source]

open_text(`fn`:PathOrStr, `enc`=`'utf-8'`)

Read the text in fn.

class TokenizeProcessor[source]

TokenizeProcessor(`ds`:ItemList=`None`, `tokenizer`:Tokenizer=`None`, `chunksize`:int=`10000`, `mark_fields`:bool=`False`) :: PreProcessor

PreProcessor that tokenizes the texts in ds.

tokenizer is uded on bits of chunsize. If mark_fields=True, add field tokens between each parts of the texts (given when the texts are read in several columns of a dataframe). See more about tokenizers in the transform documentation.

class NumericalizeProcessor[source]

NumericalizeProcessor(`ds`:ItemList=`None`, `vocab`:Vocab=`None`, `max_vocab`:int=`60000`, `min_freq`:int=`2`) :: PreProcessor

PreProcessor that numericalizes the tokens in ds.

Uses vocab for this (if not None), otherwise create one with max_vocab and min_freq from tokens.

Language Model data

A language model is trained to guess what the next word is inside a flow of words. We don't feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs chuncks of continuous texts. Note that in all NLP tasks, we don't use the usual convention of sequence length being the first dimension so batch size is the first dimension and sequence lenght is the second. Here you can read the chunks of texts in lines.

path = untar_data(URLs.IMDB_SAMPLE)
data = TextLMDataBunch.from_csv(path, 'texts.csv')
x,y = next(iter(data.train_dl))
example = x[:15,:15].cpu()
texts = pd.DataFrame([data.train_ds.vocab.textify(l).split(' ') for l in example])
texts
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 xxbos xxmaj name just says it all . i watched this movie with my dad
1 going to see anything you 'll remember . xxbos xxmaj evidently when you offer a
2 of the xxunk xxunk xxmaj superbly shot , this thrilling adult adventure certainly contains some
3 \n\n xxmaj one bright light in the midst of this is xxmaj fred xxmaj xxunk
4 leave you xxunk that you watched it . i feel really bad for those xxmaj
5 xxmaj ginger 's blonde hair ) has a couple of emotional solo numbers , including
6 . xxmaj this is what really attracted me to this film . i was impressed
7 into the hype that one you are somehow white or superior ... you are not
8 come unintentionally , like when they try to explain that an invisible man 's xxunk
9 just watch the movie , and i dear say you 'll see things a bit
10 for the reform movement and meets xxmaj eponine . xxmaj except ... not xxmaj eponine
11 xxmaj well , given the target audience , that may not have been too bad
12 gets roles in movies , in my opinion though she should stick to movies of
13 its share , though far smaller than xxunk even including a basic view of the
14 ruins - scene is xxup almost european - like cinema ( the movie is eager

This is all done internally when we use TextLMDataBunch, by wrapping the dataset in the following pre-loader before calling a DataLoader.

class LanguageModelPreLoader[source]

LanguageModelPreLoader(`dataset`:LabelList, `lengths`:Collection[int]=`None`, `bs`:int=`64`, `bptt`:int=`70`, `backwards`:bool=`False`, `shuffle`:bool=`False`, `drop_last`:bool=`False`) :: Callback

Transforms the texts in dataset in a stream for language modelling.

Takes the texts from dataset that have certain lengths (if this argument isn't passed, lengths are computed at initiliazation). It will prepare the data for batches with a batch size of bs and a sequence length bptt. If backwards=True, reverses the original text. If shuffle=True, we shuffle the texts before going through them, at the start of each epoch. If batch_first=True, the last batch of texts (with a sequence length < bptt) is discarded.

Classifier data

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don't all have the same length, we can't easily collate them together in batches. To help with this we use two different techniques:

  • padding: each text is padded with the PAD token to get all the ones we picked to the same size
  • sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of PAD tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path, 'texts.csv')
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[-10:,:20]
tensor([[   1,    1,    1,    1,    1,    1,    1,    2,   18,  310,    9,    0,
           11,    0,    9,   48,    8,    0,   11, 2301],
        [   1,    1,    1,    1,    1,    1,    1,    2,    4, 1427,   15,    8,
          521,   10,    4,   90,  131,    9, 1427,  242],
        [   1,    1,    1,    1,    1,    1,    1,    2,   18,  175,   55, 2063,
         4677,   14,    8,  209,   22, 1343,   26,   20],
        [   1,    1,    1,    1,    1,    1,    1,    1,    2,    4,   20,   30,
           24,    8,  110,  616,   30,  164,  745,   18],
        [   1,    1,    1,    1,    1,    1,    1,    1,    2,   18,   24, 3560,
           14,  130,    8,   30,   26,   85,  193,    9],
        [   1,    1,    1,    1,    1,    1,    1,    1,    2,   18,  101,    0,
           20,  153,   71,   18,   24, 4055,   17,    4],
        [   1,    1,    1,    1,    1,    1,    1,    1,    2,    4, 1998,  256,
            4,    0,    4,    0,  273,   34,    8,    0],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4,   20,
           24,   12,  119,   30,   19,   83,   12,  202],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    2,    4,    8,
           79, 1031,  185,   13,   20,   30,   24,    8],
        [   1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    2,   18,
           61,   36,  143,  104,   20,   30, 1408,   51]], device='cuda:0')

This is all done internally when we use TextClasDataBunch, by using the following classes:

class SortSampler[source]

SortSampler(`data_source`:NPArrayList, `key`:KeyFunc) :: Sampler

Go through the text data by order of length.

This pytorch Sampler is used for the validation and (if applicable) the test set.

class SortishSampler[source]

SortishSampler(`data_source`:NPArrayList, `key`:KeyFunc, `bs`:int) :: Sampler

Go through the text data by order of length with a bit of randomness.

This pytorch Sampler is generally used for the training set.

pad_collate[source]

pad_collate(`samples`:BatchSamples, `pad_idx`:int=`1`, `pad_first`:bool=`True`) → Tuple[LongTensor, LongTensor]

Function that collect samples and adds padding.

This will collate the samples in batches while adding padding with pad_idx. If pad_first=True, padding is applied at the beginning (before the sentence starts) otherwise it's applied at the end.