By Seongmin Park in nlp — Apr 16, 2021

Interesting Papers - March 2021

Hashes!

🎗️

Probabilistic Formulation of Unsupervised Style Transfer

Proposes a generative method for unsupervised text style transfer and machine translation (SOTA in both).
Uses a standard generative text model with an encoder-decoder architecture. KL annealing was applied for training.
Given an observed sentence in a domain, assumes a latent variable exists for the sentence in another domain.
Each domain has a pretrained language model that acts as Bayesian prior to the latent variable. This prior is what enables the unsupervised training, because without it, output from the encoder will be indistinguishable for sentences from different domains.
Uses variational inference to jointly train transduction between two domains. Maximizes lower bound on the marginal log likelihood p(x) (= p(sentence)).
Uses a SINGLE encoder-decoder (VAE) for both directions of transduction. This is not explicitly mentioned in the paper but can be deduced in the "parameter sharing" section, and is confirmed in the review process, as well as in the official code. Transduction direction is specified by supplying a conditioning variable to every token in the decoder.
Method is elegant and flexible, and can be generalized to most seq2seq tasks.

Modern language models employ token embedding layers in their structure to convert discrete token sequences to continuous word embeddings. This design limits the size of a language model's vocabulary, because the number of embedding parameters & resulting computation explodes with vocabulary size.
The authors suggest a hashing mechanism to keep the embedding parameter size in check. The technique can also be extended to do away with pre-defined model vocabulary dictionaries altogether.
The whole point of this is to create a layer in our model that is better (compared to traditional token embeddings) at flexibly converting discrete tokens to continuous embeddings.
There are two versions of this methodology:

For each token, learn unique combinations of pre-defined features instead of learning separate embeddings.
If the number of pre-defined features << vocab size, we achive parameter savings at no performance loss.
Optionally do away with vocabulary dictionary altogether.

Does away with pre-defined tokenizers. Tokenization becomes converting text to a sequence of character level Unicode codepoints.
Makes model flexible, especially in multi-language and and OOV situations.
The whole point of subword encodings is to create smallest chunks of text that contain meaning. CANINE solves this by applying convolution (downsampling) to character sequence inputs.
A reverse process (upsampling) is conducted on downsampled hidden states to obtain outputs that conform to model output dimensions. Downsampled representations are contatenated with character-level input tokens at each chacter position to generate character level output representation.
Can be used as drop-in replacements for existing word embedding layers.
Uses hash embeddings to save parameters (28% fewer than mBert).
Thoughts: meaningful and reliable tokenization is the hardest part in dealing with text in the wild. I hope we see more research into enabling character level text inputs. Or ditching tokenizers altogether.
Low hanging fruit for future research: Experiment with strides and advanced methods of convolution.

SOTA in grammatical error correction (GEC).
Phrases GEC as a sequence tagging problem. Not the first to do this, though. LaserTagger and PIE were precursors.
5000 possible token transformation tags.
Two types of token tags: Basic transformations (KEEP, DELETE, APPEND, REPLACE) and g-transformations (CASE, MERGE, SPLIT, NOUN_FORM, VERB_FORM)
APPEND and REPLACE are token dependent. APPEND_s is different from APPEND_ed. Transformation tags are pre-determined from data.
Iterative sequence tagging approach. Most corrections are done after the second pass.
Low hanging fruit for future research: Deal with OOV in token dependant tags (APPEND, REPLACE).

Demonstrates a text VAE with an encoder-decoder architecture.
Senteces can be sampled and interpolated from latent space.
Uses KL annealing, where the regularization loss in VAE's ELBO objective is gradually weighted more as training iterations increase. This is to ensure the decoder is somewhat functional before we add the regularization loss.
No need for Gumbel softmax like in Probabilistic formulation of unsupervised style transfer, because latent distribution if fitted towards Gaussian distribution.

Introduces the famous Aho-Corasick algorithm. Multiple patterns (=keywords) can be searched within text in a single pass.
Builds a suffix trie. Vanilla implementation is not really memory efficient.
A failure link exists for every node in the trie. If the next character in text is not a valid path from the current node, jump to the state that is the longest suffix of the current node, and resume searching for the character.
A node concatenates all output patterns of its failure node to its set of output patterns, because otherwise patterns that are substrings of other patterns won't be found.
The paper omits an important information in the pseudo code for failure link building. The while loop for traversing failure links should have a stopping condition for when current node is root.

Creates a nondeterministic finite automation of Aho-Corasick algorithm.
Need transition matrices between characters in advance so not much use for me eh.