Wikitext data tutorial

Using Datasets, Pipeline, TfmdLists and Transform in text

from fastai.basics import *
from fastai.callback.all import *
from fastai.text.all import *

In this tutorial, we explore the mid-level API for data collection in the text application. We will use the bases introduced in the pets tutorial so you should be familiar with Transform, Pipeline, TfmdLists and Datasets already.

Data

path = untar_data(URLs.WIKITEXT_TINY)

The dataset comes with the articles in two csv files, so we read it and concatenate them in one dataframe.

df_train = pd.read_csv(path/'train.csv', header=None)
df_valid = pd.read_csv(path/'test.csv', header=None)
df_all = pd.concat([df_train, df_valid])

df_all.head()

	0
0	\n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...
1	\n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...
2	\n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...
3	\n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...
4	\n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....

We could tokenize it based on spaces to compare (as is usually done) but here we’ll use the standard fastai tokenizer.

splits = [list(range_of(df_train)), list(range(len(df_train), len(df_all)))]
tfms = [attrgetter("text"), Tokenizer.from_df(0), Numericalize()]
dsets = Datasets(df_all, [tfms], splits=splits, dl_type=LMDataLoader)

/home/jhoward/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

bs,sl = 104,72
dls = dsets.dataloaders(bs=bs, seq_len=sl)

dls.show_batch(max_n=3)

	text	text_
0	xxbos = xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \n▁\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic	= xxmaj mexico xxmaj city xxmaj metropolitan xxmaj cathedral = \n▁\n▁ xxmaj the xxmaj metropolitan xxmaj cathedral of the xxmaj assumption of the xxmaj most xxmaj blessed xxmaj virgin xxmaj mary into xxmaj heaven ( xxmaj spanish : xxunk xxunk de la xxunk de la xxmaj santísima xxunk xxmaj maría a los xxunk ) is the largest cathedral in the xxmaj americas , and seat of the xxmaj roman xxmaj catholic xxmaj
1	, who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for	who had campaigned for a negotiated peace with xxmaj nazi xxmaj germany , was interned by the xxmaj british xxmaj authorities under xxmaj defence xxmaj regulation xxunk , along with most other active fascists in xxmaj britain . xxmaj lady xxmaj mosley was imprisoned a month later . xxmaj max and his brother xxmaj alexander were not included in this internship and as a result were separated from their parents for the
2	jewish xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \n▁\n▁ = = = xxmaj articles and chapters = = = \n▁\n▁ " xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj	xxmaj question to the xxmaj jewish xxmaj state : xxmaj an xxmaj essay on the xxmaj theory of xxmaj zionism ( thesis ) , xxmaj princeton xxmaj university . \n▁\n▁ = = = xxmaj articles and chapters = = = \n▁\n▁ " xxunk and the xxmaj palestine xxmaj question : xxmaj the not - so - strange xxmaj case of xxmaj joan xxmaj peter 's ' xxmaj from xxmaj time xxmaj immemorial

Model

config = awd_lstm_lm_config.copy()
config.update({'input_p': 0.6, 'output_p': 0.4, 'weight_p': 0.5, 'embed_p': 0.1, 'hidden_p': 0.2})
model = get_language_model(AWD_LSTM, len(dls.vocab), config=config)

opt_func = partial(Adam, wd=0.1, eps=1e-7)
cbs = [MixedPrecision(), GradientClip(0.1)] + rnn_cbs(alpha=2, beta=1)

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), opt_func=opt_func, cbs=cbs, metrics=[accuracy, Perplexity()])

learn.fit_one_cycle(1, 5e-3, moms=(0.8,0.7,0.8), div=10)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	5.503713	5.095897	0.237340	163.350342	02:07

#learn.fit_one_cycle(90, 5e-3, moms=(0.8,0.7,0.8), div=10)