= dropout_mask(torch.randn(3,4), [4,3], 0.25)
t 4,3])
test_eq(t.shape, [assert ((t == 4/3) + (t==0)).all()
AWD-LSTM
Basic NLP modules
On top of the pytorch or the fastai layers
, the language models use some custom layers specific to NLP.
dropout_mask
dropout_mask (x:torch.Tensor, sz:list, p:float)
Return a dropout mask of the same type as x
, size sz
, with probability p
to cancel an element.
Type | Details | |
---|---|---|
x | Tensor | Source tensor, output will be of the same type as x |
sz | list | Size of the dropout mask as int s |
p | float | Dropout probability |
Returns | Tensor | Multiplicative dropout mask |
RNNDropout
RNNDropout (p:float=0.5)
Dropout with probability p
that is consistent on the seq_len dimension.
= RNNDropout(0.3)
dp = torch.randn(4,3,7)
tst_inp = dp(tst_inp)
tst_out for i in range(4):
for j in range(7):
if tst_out[i,0,j] == 0: assert (tst_out[i,:,j] == 0).all()
else: test_close(tst_out[i,:,j], tst_inp[i,:,j]/(1-0.3))
It also supports doing dropout over a sequence of images where time dimesion is the 1st axis, 10 images of 3 channels and 32 by 32.
= dp(torch.rand(4,10,3,32,32)) _
WeightDropout
WeightDropout (module:nn.Module, weight_p:float, layer_names:str|MutableSequence='weight_hh_l0')
A module that wraps another layer in which some weights will be replaced by 0 during training.
Type | Default | Details | |
---|---|---|---|
module | nn.Module | Wrapped module | |
weight_p | float | Weight dropout probability | |
layer_names | str | MutableSequence | weight_hh_l0 | Name(s) of the parameters to apply dropout to |
= nn.LSTM(5,7)
module = WeightDropout(module, 0.4)
dp_module = dp_module.module.weight_hh_l0
wgts = torch.randn(10,20,5)
tst_inp = torch.zeros(1,20,7), torch.zeros(1,20,7)
h
dp_module.reset()= dp_module(tst_inp,h)
x,h = x.sum()
loss
loss.backward()= getattr(dp_module.module, 'weight_hh_l0')
new_wgts getattr(dp_module, 'weight_hh_l0_raw'))
test_eq(wgts, assert 0.2 <= (new_wgts==0).sum().float()/new_wgts.numel() <= 0.6
assert dp_module.weight_hh_l0_raw.requires_grad
assert dp_module.weight_hh_l0_raw.grad is not None
assert ((dp_module.weight_hh_l0_raw.grad == 0.) & (new_wgts == 0.)).any()
EmbeddingDropout
EmbeddingDropout (emb:nn.Embedding, embed_p:float)
Apply dropout with probability embed_p
to an embedding layer emb
.
Type | Details | |
---|---|---|
emb | nn.Embedding | Wrapped embedding layer |
embed_p | float | Embdedding layer dropout probability |
= nn.Embedding(10, 7, padding_idx=1)
enc = EmbeddingDropout(enc, 0.5)
enc_dp = torch.randint(0,10,(8,))
tst_inp = enc_dp(tst_inp)
tst_out for i in range(8):
assert (tst_out[i]==0).all() or torch.allclose(tst_out[i], 2*enc.weight[tst_inp[i]])
AWD_LSTM
AWD_LSTM (vocab_sz:int, emb_sz:int, n_hid:int, n_layers:int, pad_token:int=1, hidden_p:float=0.2, input_p:float=0.6, embed_p:float=0.1, weight_p:float=0.5, bidir:bool=False)
AWD-LSTM inspired by https://arxiv.org/abs/1708.02182
Type | Default | Details | |
---|---|---|---|
vocab_sz | int | Size of the vocabulary | |
emb_sz | int | Size of embedding vector | |
n_hid | int | Number of features in hidden state | |
n_layers | int | Number of LSTM layers | |
pad_token | int | 1 | Padding token id |
hidden_p | float | 0.2 | Dropout probability for hidden state between layers |
input_p | float | 0.6 | Dropout probability for LSTM stack input |
embed_p | float | 0.1 | Embedding layer dropout probabillity |
weight_p | float | 0.5 | Hidden-to-hidden wight dropout probability for LSTM layers |
bidir | bool | False | If set to True uses bidirectional LSTM layers |
This is the core of an AWD-LSTM model, with embeddings from vocab_sz
and emb_sz
, n_layers
LSTMs potentially bidir
stacked, the first one going from emb_sz
to n_hid
, the last one from n_hid
to emb_sz
and all the inner ones from n_hid
to n_hid
. pad_token
is passed to the PyTorch embedding layer. The dropouts are applied as such:
- the embeddings are wrapped in
EmbeddingDropout
of probabilityembed_p
; - the result of this embedding layer goes through an
RNNDropout
of probabilityinput_p
; - each LSTM has
WeightDropout
applied with probabilityweight_p
; - between two of the inner LSTM, an
RNNDropout
is applied with probabilityhidden_p
.
THe module returns two lists: the raw outputs (without being applied the dropout of hidden_p
) of each inner LSTM and the list of outputs with dropout. Since there is no dropout applied on the last output, those two lists have the same last element, which is the output that should be fed to a decoder (in the case of a language model).
= AWD_LSTM(100, 20, 10, 2, hidden_p=0.2, embed_p=0.02, input_p=0.1, weight_p=0.2)
tst = torch.randint(0, 100, (10,5))
x = tst(x)
r 10)
test_eq(tst.bs, len(tst.hidden), 2)
test_eq(for h_ in tst.hidden[0]], [[1,10,10], [1,10,10]])
test_eq([h_.shape for h_ in tst.hidden[1]], [[1,10,20], [1,10,20]])
test_eq([h_.shape
10,5,20])
test_eq(r.shape, [-1], tst.hidden[-1][0][0]) #hidden state is the last timestep in raw outputs
test_eq(r[:,
eval()
tst.
tst.reset();
tst(x); tst(x)
awd_lstm_lm_split
awd_lstm_lm_split (model)
Split a RNN model
in groups for differential learning rates.
awd_lstm_clas_split
awd_lstm_clas_split (model)
Split a RNN model
in groups for differential learning rates.