Negative Sampling (Mikolov et al., 2013)

December 04, 2022

Goals

improve skip-gram model in terms of speed and accuracy
make it possible to make word representation not limited to individual words; rather, it aims to make it possible to represent idiomatic expressions (e.g., Boston Globe)

The original skip-gram model

Skip-gram model: aims to predict the words that surround the current words; thus, aims to high log probability

*c: the size of the training context (can be a function of the center word w_t)* / p(w_t+j|w_t): determined by softmax

v_w: input, v'_w: output vector representations of, w: the number of words in the vocabulary

Hierarchical Softmax(HS): replacing softmax to reduce time cost due to enormous amount of calculations

Methods for improvement

Negative samplings (NEG): let's reduce unnecessary calculations by not updating irrelevant words

i) Noise Contrastive Estimation (NCE): a good model should separate data from noise using logistic regression.
ii) a simplified version of NCE, which both requires numerical probabilities of the noise distribution and samples, while NEG requires samples only
iii) NCE aims to improve log probabilities of softmax (accuracy). The skip-gram model aims to improve the quality of vector representation. NEG is also considered in this regard.

if the first term gets bigger, it closes to 0;, the calculation for wrong pairs (-v'_wi^tV_wI) will show the opposite

Subsampling of frequent words: frequent words tend not to be informative; thus, to solve this problem, this method is proposed

P(wi): the probability of discarding the word; f(wi): the frequency of the world; t(threshold)

The more frequent a word is, the more likely it is to be discarded => This solves some problems caused by imbalance of frequency; This also increases the training speed and vector accuracy.

Experiments (analogical reasoning task for phrases)

i) Task: what is 4th phrase based on the previous three phrases? (e.g., Germany: Berlin: France: ?), including syntactic and semantic analogies
ii) Data: an internal Google dataset with one billion words (training data), a total of 692k after taking out words whose frequency is less than 5
iii) Result: (Accuracy) NEG > HS, NEG > NCE, (Speed) Subsampling > the others / The linear property of the skip-gram model contributes this analogical test, but if a much bigger training dataset is provided, the skip-gram model can be successful in non-linear models

Word-based => Phrase-based using a data-driven approach
Experiments (Analogical reasoning task for phrases)

i) Task: New analogical task

ii) Results: NEG-15 (k=15) > NEG-5 (k=5), Subsampling models > No subsampling ones, Subsampling HS-Huffman > No subsampling one
iii) With 33 billion words, HS, dim = 1,000, c = entire sentence, 72% accuracy; A combination of HS & Subsampling showed the best performance

Additive compositionally

The skip-gram model can make meaningful words using element-wise addition

Categories: Paper Review, NLP

Original post: https://cheonkamjeong.blogspot.com/2022/12/paper-review-negative-sampling-mikolov.html