By Seongmin Park in nlp — May 28, 2021

Interesting Papers - April 2021

Will someone please talk some sense into my decoder

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Lottery Ticket Hypothesis - In a huge neural network, there exists a sub-network that is responsible for almost all of the huge network's performance (accuracy). Extracting and training the sub-network can reduce model size with no sacrifice in performance.
This paper explores if Lottery Ticket Hypothesis applies to BERT. Turns out it does!
Achieves 40% ~ 90% sparsity. Sub-network for masked language modeling task had sparsity of 70%.
Sub-networks are mostly downstream-task-specific. Sub-network for a specific task cannot be used for another task.
In earlier research regarding the Lottery Ticket Hypothesis, good sub-networks had to be captured in early stages of pretraining. This paper explores sub-network capturing during fine-tuning.
Thoughts: If tickets (sub-networks) are not transferable between tasks, the results are not very exciting in POV of building smaller generalized language models.

Presents methods to train BERT in 24 hours (with 8 GPUs with 12GB VRAM), achieving competitive results with the reference implementation.
Juxtaposition of various economically optimized training methods suggested so far:
- Use Deepspeed.
- Replace masked language model prediction head with sparse token prediction.
- Pre-mask 10 copies of original corpus. Pre-load corpus to RAM to reduce IO bottleneck.
- Sequence length of 128. Tried batch sizes 4,096, 8,192, and 16,384.
- Significantly smaller learning rate warmup proportions (0% ~ 6%) than ones used by original BERT.

Posterior collapse is a huge problem in VAE. One has to balance prior loss on the encoder output and reconstruction loss on the decoder output.
Often, VAE learns to prioritize the reconstruction loss, ignoring output from the encoder.
Several studies try to mitigate this phenomenon. One famous example is KL annealing suggested in Bowman et al., where weight of the prior loss is slowly warmed-up (annealed) during initial training steps so that the prior loss is learned after the decoder learns to make more sense.
This paper proposes a new scheme for KL annealing, which is to repeat (cycle) KL weight warm-ups during training.
Paper results show cyclically annealing is almost always better than monotonically annealing.

To combat posterior collapse, train the VAE encoder to completion before training the decoder.

Autoregressive decoders have a hard time generating human-like input. The network must find a balance between probabilistically selecting likely tokens and mimicking human token variance.
This study suggests nucleus sampling (a.k.a. top-p sampling).
In top-k sampling, next tokens are sampled from a pool of top k most likely tokens.
In top-p sampling, next tokens are sampled from a pool of tokens with probability above p.
Results in the paper show top-p sampling better matches human performance in terms of variance and accuracy.

This paper tackles neural text degeneration in the training phase (as opposed to decoding phase).
The paper proposes unlikelihood training, which adds penalty if decoder repeats tokens it saw before.
There are two kinds of unlikelihood training: token-level and sequence-level. They can be used simultaneously.
Token level unlikelihood training adds penalty to tokens seen before decoding.
Sequence level unlikelihood training adds penalty for repeated n-grams.
Autoregressive decoders trained this way generates sentences that are less dull (producing common tokens that don't contribute much) and repetitive.

Teacher forcing - Providing delayed ground truth decoder outputs during autoregressive model training. This allows parallelized decoding and faster convergence. Mitigates gradient vanishing/exploding.
Exposure bias - Teacher forcing is a problem in autoregressive models because model operation differs bewteen training and test time.Leads to degeneration.
Paper says it's actually okay to use teacher forcing.

Language model was trained with syllable tokenization on Penn Treebank.
Outperforms character, morphological, and BPE tokenizations in character-level perplexity.