Training callbacks

Various callbacks to customize training behavior

source

ShortEpochCallback

 ShortEpochCallback (pct=0.01, short_valid=True)

Fit just pct of an epoch, then stop

learn = synth_learner()
learn.fit(1, cbs=ShortEpochCallback())
epoch train_loss valid_loss time
0 00:00
learn = synth_learner()
learn.fit(1, cbs=ShortEpochCallback(short_valid=False))
epoch train_loss valid_loss time
0 8.432135 00:00

source

GradientAccumulation

 GradientAccumulation (n_acc=32)

Accumulate gradients before updating weights

When the number of steps per accumulation is higher than the number of batches, the parameters (and therefore validation loss) don’t change at all:

learn = synth_learner()
learn.fit(1, lr=0.01, cbs=GradientAccumulation(n_acc=1000))
# ensure valid_loss didn't change
assert learn.recorder.values[-1][1] == learn.recorder.values[0][1]
epoch train_loss valid_loss time
0 20.987558 26.849480 00:00

source

GradientClip

 GradientClip (max_norm:float=1.0, norm_type:float=2.0)

Clip norm of gradients

Normally if we use a learning rate that is too high, our training will diverge. This even happens if we use mixed precision training, which avoid infinities by using dynamic loss scaling, but still diverges:

fp16 = MixedPrecision()
set_seed(99)
learn = synth_learner(lr=1.1, cuda=True)
learn.fit(3, cbs=fp16)
epoch train_loss valid_loss time
0 38.214138 25.269005 00:00
1 377.145508 890.010376 00:00
2 839.392883 9965.747070 00:00

By adding the GradientClip callback, the gradient norm_type (default:2) norm is clipped to at most max_norm (default:1) using nn.utils.clip_grad_norm_, which can avoid loss divergence:

set_seed(99)
learn = synth_learner(lr=1.1, cuda=True)
learn.fit(3, cbs=[GradientClip,fp16])
epoch train_loss valid_loss time
0 2.039428 2.372177 00:00
1 1.402425 0.300728 00:00
2 1.013548 0.332610 00:00

BnFreeze


source

BnFreeze

 BnFreeze (after_create=None, before_fit=None, before_epoch=None,
           before_train=None, before_batch=None, after_pred=None,
           after_loss=None, before_backward=None,
           after_cancel_backward=None, after_backward=None,
           before_step=None, after_cancel_step=None, after_step=None,
           after_cancel_batch=None, after_batch=None,
           after_cancel_train=None, after_train=None,
           before_validate=None, after_cancel_validate=None,
           after_validate=None, after_cancel_epoch=None, after_epoch=None,
           after_cancel_fit=None, after_fit=None)

Basic class handling tweaks of the training loop by changing a Learner in various events


source

set_bn_eval

 set_bn_eval (m:torch.nn.modules.module.Module, use_eval=True)

Set bn layers in eval mode for all recursive children of m.

BnFreeze is useful when you’d like to train two separate models that have a common feature extractor / body. The only part of the model that’s different is the head that you attach for transfer learning.

Learner.freeze() doesn’t suffice here as the BatchNorm layers are trainable by default, and running mean and std of batches are tracked. For feature extractors to fully match, you need to set train_bn=False and these stats need to be frozen as well, which is precisely the function of BnFreeze.

path = untar_data(URLs.MNIST_TINY)
dls  = ImageDataLoaders.from_folder(path, valid_pct=0.2)

https://pytorch.org/tutorials/intermediate/memory_format_tutorial.htmlWe first demonstrate the mismatch of the running stats when using only train_bn=False, by creating a Learner…:

learn1 = vision_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False)

…and grab the first BatchNorm layer, and store its running mean:

m = learn1.model[0][1].running_mean.clone()

You can see that now that running mean has changed:

learn1.fit(1, lr=0.02)
test_ne(to_detach(learn1.model[0][1].running_mean), m)
epoch train_loss valid_loss time
0 1.148303 0.739404 00:12

When we use the BnFreeze callback, the running statistics will not be changed during training. This is often important for getting good results from transfer learning.

learn1 = vision_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False, cbs=BnFreeze)
m = learn1.model[0][1].running_mean.detach().clone()
learn1.fit(1, lr=0.02)
test_eq(to_detach(learn1.model[0][1].running_mean), m)
epoch train_loss valid_loss time
0 0.478594 0.270772 00:10

Channels Last

[Beta] A simple callback to use channels last memory format

This callback sets your model in channels_last memory format before training. This will enable speed ups in modern GPUs. You can read more about this beta feature on the PyTorch docs.


source

ChannelsLast

 ChannelsLast (after_create=None, before_fit=None, before_epoch=None,
               before_train=None, before_batch=None, after_pred=None,
               after_loss=None, before_backward=None,
               after_cancel_backward=None, after_backward=None,
               before_step=None, after_cancel_step=None, after_step=None,
               after_cancel_batch=None, after_batch=None,
               after_cancel_train=None, after_train=None,
               before_validate=None, after_cancel_validate=None,
               after_validate=None, after_cancel_epoch=None,
               after_epoch=None, after_cancel_fit=None, after_fit=None)

Basic class handling tweaks of the training loop by changing a Learner in various events

learn1 = vision_learner(deepcopy(dls), resnet18, pretrained=True, cbs=ChannelsLast())
learn1.fit(1, lr=0.02)
epoch train_loss valid_loss time
0 0.698547 7.504092 00:29
Note

It works with most timm models