Interesting Papers - April 2021
Will someone please talk some sense into my decoder
The Lottery Ticket Hypothesis for Pre-trained BERT Networks
NeurIPS 2020
- Lottery Ticket Hypothesis - In a huge neural network, there exists a sub-network that is responsible for almost all of the huge network's performance (accuracy). Extracting and training the sub-network can reduce model size with no sacrifice in performance.
- This paper explores if Lottery Ticket Hypothesis applies to BERT. Turns out it does!
- Achieves 40% ~ 90% sparsity. Sub-network for masked language modeling task had sparsity of 70%.
- Sub-networks are mostly downstream-task-specific. Sub-network for a specific task cannot be used for another task.
- In earlier research regarding the Lottery Ticket Hypothesis, good sub-networks had to be captured in early stages of pretraining. This paper explores sub-network capturing during fine-tuning.
- Thoughts: If tickets (sub-networks) are not transferable between tasks, the results are not very exciting in POV of building smaller generalized language models.
How to Train BERT with an Academic Budget
2021
- Presents methods to train BERT in 24 hours (with 8 GPUs with 12GB VRAM), achieving competitive results with the reference implementation.
- Juxtaposition of various economically optimized training methods suggested so far:
- Use Deepspeed.
- Replace masked language model prediction head with sparse token prediction.
- Pre-mask 10 copies of original corpus. Pre-load corpus to RAM to reduce IO bottleneck.
- Sequence length of 128. Tried batch sizes 4,096, 8,192, and 16,384.
- Significantly smaller learning rate warmup proportions (0% ~ 6%) than ones used by original BERT.
Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing
NAACL 2019
- Posterior collapse is a huge problem in VAE. One has to balance prior loss on the encoder output and reconstruction loss on the decoder output.
- Often, VAE learns to prioritize the reconstruction loss, ignoring output from the encoder.
- Several studies try to mitigate this phenomenon. One famous example is KL annealing suggested in Bowman et al., where weight of the prior loss is slowly warmed-up (annealed) during initial training steps so that the prior loss is learned after the decoder learns to make more sense.
- This paper proposes a new scheme for KL annealing, which is to repeat (cycle) KL weight warm-ups during training.
- Paper results show cyclically annealing is almost always better than monotonically annealing.
LAGGING INFERENCE NETWORKS AND POSTERIOR COLLAPSE IN VARIATIONAL AUTOENCODERS
ICLR 2019
- To combat posterior collapse, train the VAE encoder to completion before training the decoder.
The Curious Case of Neural Text Degeneration
ICLR 2020
- Autoregressive decoders have a hard time generating human-like input. The network must find a balance between probabilistically selecting likely tokens and mimicking human token variance.
- This study suggests nucleus sampling (a.k.a. top-p sampling).
- In top-k sampling, next tokens are sampled from a pool of top k most likely tokens.
- In top-p sampling, next tokens are sampled from a pool of tokens with probability above p.
- Results in the paper show top-p sampling better matches human performance in terms of variance and accuracy.
Neural Text Generation with Unlikelihood Training
ICLR 2019
- This paper tackles neural text degeneration in the training phase (as opposed to decoding phase).
- The paper proposes unlikelihood training, which adds penalty if decoder repeats tokens it saw before.
- There are two kinds of unlikelihood training: token-level and sequence-level. They can be used simultaneously.
- Token level unlikelihood training adds penalty to tokens seen before decoding.
- Sequence level unlikelihood training adds penalty for repeated n-grams.
- Autoregressive decoders trained this way generates sentences that are less dull (producing common tokens that don't contribute much) and repetitive.
Quantifying Exposure Bias for Open-ended Language Generation
2021
- Teacher forcing - Providing delayed ground truth decoder outputs during autoregressive model training. This allows parallelized decoding and faster convergence. Mitigates gradient vanishing/exploding.
- Exposure bias - Teacher forcing is a problem in autoregressive models because model operation differs bewteen training and test time.Leads to degeneration.
- Paper says it's actually okay to use teacher forcing.
Revisiting Neural Language Modelling with Syllables
ICLR 2020
- Language model was trained with syllable tokenization on Penn Treebank.
- Outperforms character, morphological, and BPE tokenizations in character-level perplexity.