These are some questions I that bugged me the most during my exploration of Word2Vec. These questions mostly delve on specific implementation decisions in the original paper.

How are Word2Vec embeddings trained?

Word2Vec is an unsupervised algorithm. Training pairs are generated by sliding a "window" across the text dataset.

How does Continuous-Bag-of-Words (CBOW) and Skip-gram differ in their training?

CBOW trains on {input, input, ... , context} pairs. Skip-gram trains on {input, context} pairs.

What should be the dimension (hyperparameter) of the resulting word embedding?

Everyone says to start with 300 and tune to your needs.

Before beginning the taining process, many implementations preprocess the training data to create a dictionary (and often a index-to-word and word-to-index map) of the whole vocabulary. Wouldn't this be implausible for huuge datasets?

Dictionaries (that index all vocabularies of the dataset) scale according to the size of the vocabulary, not the total size of the dataset.

Many discussions mention training two vectors (context and embedding). What is that all about?

When the loss function / gradient decent is applied, the only back-propagated weight updates of the input words are applied to the final word embeddings. We keep a separate embedding matrix for context words, that reflect the error correction. We discard that matrix after training.


To differentiate words that appear multiple times in a sentence. A word used for context is likely to not convey the same meaning as the input word. If a single matrix is used to represent both context and target words, repetitive words in a sentence will lose the nuance that their position in the sentence entails.

Tagged in:

nlp, word2vec

Last Update: February 24, 2024