Representing Words, Phrases & their Compositionality — Skip Gram Model

Representing words in a vector space helps achieve better performances on NLP tasks as it helps learning the algorithms better. I have briefed on a classical NLP concept which used to be an efficient method for learning high-quality distributed vector representations capturing a large number of syntactic and semantic word relationships — the Skip Gram Model.
SKIP GRAM MODEL
|| paper ||
Clearly, vocabulary of any given language in use is quite large, so we have to look for unsupervised learning techniques that can learn the context of any word on its own. Skip Gram is one (semi-supervised; we don’t have labels associated with words, so learning word embeddings is not supervised learning.)

The skip-gram model takes a target word and tries to predict the surrounding context words. Given a sequence of training words w1 , w2 , w3 , . . . , wT , the objective of the Skip-gram model is to maximise the average log probability. (‘c’ is the size of the training context.) (Better to put a -ve sign and try minimising this function, it’s better to minimise a loss function, instead of maximising it.)

The Skip-gram formula for p(wt+j | wt) using softmax :

Vw & V’w are the input and output vector representations of w, & W, the number of words in the vocabulary.
Beyond the math, what we’re doing is creating context-target word pairs to train our model. Intuitively, one can think — multiple similar pairs will make the model remember the words are similar in context. The softmax layer converts the output vector to a probability vector.
THE INFAMOUS FAKE TASK
Once we have our context-target pairs, we train a neural network with very few hidden layers to perform a certain task, which we will never use to perform this task that it was trained for. We are instead interested in the weights of the hidden layers that have been learned during the training process. So, this is a ‘Fake Task’ because we are not interested in the prediction of the model but the by-product(word vectors/embeddings).
I’ll be adding in the implementation soon -> here.