This callback regroups a few tweaks to properly train RNNs. They all come from this article by Stephen Merity et al.
Adjusting the learning rate to sequence length: since we're modifying the bptt at each batch, sometimes by a lot (we divide it by 2 randomly), the learning rate has to be adjusted to take this into account, mainly being multiplied by the ratio
Activation Regularization: on top of weight decay, we apply another form of regularization that is pretty similar and consists in adding to the loss a scaled factor of the sum of all the squares of the ouputs (with dropout applied) of the various layers of the RNN. Intuitively, weight decay tries to get the network to learn small weights, this is to get the model to learn to produce smaller activations.
Temporal Activation Regularization: lastly, we add to the loss a scaled factor of the sum of the squares of the
h_(t+1) - h_t, where
h_i is the output (before dropout is applied) of one layer of the RNN at the time step i (word i of the sentence). This will encourage the model to produce activations that don’t vary too fast between two consecutive words of the sentence.
Callback that adds to learner the RNN tweaks for training on data with
alpha is the scale for AR,
beta is the scale for TAR. If
adjust is False, the learning rate isn't adjusted to the sequence length.
The fastai RNNs return
last_ouput that are tuples of three elements, the true output (that is returned) and the hidden states before and after dropout (which are saved internally for the next function).
Adjusts the learning rate to the size of
last_input. Adds to
last_loss the AR and TAR.