The Lottery Ticket Hypothesis for Pre-trained BERT Networks

NeurIPS 2020

  • Lottery Ticket Hypothesis - In a huge neural network, there exists a sub-network that is responsible for almost all of the huge network's performance (accuracy). Extracting and training the sub-network can reduce model size with no sacrifice in performance.
  • This paper explores if Lottery Ticket Hypothesis applies to BERT. Turns out it does!
  • Achieves 40% ~ 90% sparsity. Sub-network for masked language modeling task had sparsity of 70%.
  • Sub-networks are mostly downstream-task-specific. Sub-network for a specific task cannot be used for another task.
  • In earlier research regarding the Lottery Ticket Hypothesis, good sub-networks had to be captured in early stages of pretraining. This paper explores sub-network capturing during fine-tuning.
  • Thoughts: If tickets (sub-networks) are not transferable between tasks, the results are not very exciting in POV of building smaller generalized language models.

How to Train BERT with an Academic Budget


  • Presents methods to train BERT in 24 hours (with 8 GPUs with 12GB VRAM), achieving competitive results with the reference implementation.
  • Juxtaposition of various economically optimized training methods suggested so far:
    • Use Deepspeed.
    • Replace masked language model prediction head with sparse token prediction.
    • Pre-mask 10 copies of original corpus. Pre-load corpus to RAM to reduce IO bottleneck.
    • Sequence length of 128. Tried batch sizes 4,096, 8,192, and 16,384.
    • Significantly smaller learning rate warmup proportions (0% ~ 6%) than ones used by original BERT.

Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing

NAACL 2019

  • Posterior collapse is a huge problem in VAE. One has to balance prior loss on the encoder output and reconstruction loss on the decoder output.
  • Often, VAE learns to prioritize the reconstruction loss, ignoring output from the encoder.
  • Several studies try to mitigate this phenomenon. One famous example is KL annealing suggested in Bowman et al., where weight of the prior loss is slowly warmed-up (annealed) during initial training steps so that the prior loss is learned after the decoder learns to make more sense.
  • This paper proposes a new scheme for KL annealing, which is to repeat (cycle) KL weight warm-ups during training.
  • Paper results show cyclically annealing is almost always better than monotonically annealing.


ICLR 2019

  • To combat posterior collapse, train the VAE encoder to completion before training the decoder.

The Curious Case of Neural Text Degeneration

ICLR 2020

  • Autoregressive decoders have a hard time generating human-like input. The network must find a balance between probabilistically selecting likely tokens and mimicking human token variance.
  • This study suggests nucleus sampling (a.k.a. top-p sampling).
  • In top-k sampling, next tokens are sampled from a pool of top k most likely tokens.
  • In top-p sampling, next tokens are sampled from a pool of tokens with probability above p.
  • Results in the paper show top-p sampling better matches human performance in terms of variance and accuracy.

Neural Text Generation with Unlikelihood Training

ICLR 2019

  • This paper tackles neural text degeneration in the training phase (as opposed to decoding phase).
  • The paper proposes unlikelihood training, which adds penalty if decoder repeats tokens it saw before.
  • There are two kinds of unlikelihood training: token-level and sequence-level. They can be used simultaneously.
  • Token level unlikelihood training adds penalty to tokens seen before decoding.
  • Sequence level unlikelihood training adds penalty for repeated n-grams.
  • Autoregressive decoders trained this way generates sentences that are less dull (producing common tokens that don't contribute much) and repetitive.

Quantifying Exposure Bias for Open-ended Language Generation


  • Teacher forcing - Providing delayed ground truth decoder outputs during autoregressive model training. This allows parallelized decoding and faster convergence. Mitigates gradient vanishing/exploding.
  • Exposure bias - Teacher forcing is a problem in autoregressive models because model operation differs bewteen training and test time.Leads to degeneration.
  • Paper says it's actually okay to use teacher forcing.

Revisiting Neural Language Modelling with Syllables

ICLR 2020

  • Language model was trained with syllable tokenization on Penn Treebank.
  • Outperforms character, morphological, and BPE tokenizations in character-level perplexity.

Tagged in:

nlp, papers

Last Update: July 04, 2024