Negative Sampling (Mikolov et al., 2013)

December 04, 2022

 Goals

The original skip-gram model

c: the size of the training context (can be a function of the center word w_t) / p(w_t+j|w_t): determined by softmax
v_w: input, v'_w: output vector representations of, w: the number of words in the vocabulary 

Methods for improvement

i) Noise Contrastive Estimation (NCE): a good model should separate data from noise using logistic regression.

ii) a simplified version of NCE, which both requires numerical probabilities of the noise distribution and samples, while NEG requires samples only

iii) NCE aims to improve log probabilities of softmax (accuracy). The skip-gram model aims to improve the quality of vector representation. NEG is also considered in this regard.

if the first term gets bigger, it closes to 0;, the calculation for wrong pairs (-v'_wi^tV_wI) will show the opposite
P(wi): the probability of discarding the word; f(wi): the frequency of the world; t(threshold)

The more frequent a word is, the more likely it is to be discarded => This solves some problems caused by imbalance of frequency; This also increases the training speed and vector accuracy.

i) Task: what is 4th phrase based on the previous three phrases? (e.g., Germany: Berlin: France: ?), including syntactic and semantic analogies

ii) Data: an internal Google dataset with one billion words (training data), a total of 692k after taking out words whose frequency is less than 5

iii) Result: (Accuracy) NEG > HS, NEG > NCE, (Speed) Subsampling > the others / The linear property of the skip-gram model contributes this analogical test, but if a much bigger training dataset is provided, the skip-gram model can be successful in non-linear models

i) Task: New analogical task

ii) Results: NEG-15 (k=15) > NEG-5 (k=5), Subsampling models > No subsampling ones, Subsampling HS-Huffman > No subsampling one

iii) With 33 billion words, HS, dim = 1,000, c = entire sentence, 72% accuracy; A combination of HS & Subsampling showed the best performance

Additive compositionally

The skip-gram model can make meaningful words using element-wise addition

Categories: Paper Review, NLP

Original post: https://cheonkamjeong.blogspot.com/2022/12/paper-review-negative-sampling-mikolov.html