
Softmax is the output layer function which activates the nodes in the last step of the neural network computation. It is a very popular activation function used with several neural net architecture along with tanh & sigmoid.
Here is a graph showing it’s dominance over other variants :

I’ll be briefing about a variant which runs extremely fast compared to conventional softmax.
“O(n) reduced to O(log n)” kind of fast.
If you’re into algorithms, you’ll know how beneficial this is in terms of computational complexity.
CONVENTIONAL SOFTMAX

Each of the elements of the activated output vector denotes the probability of the word to be equal to the j-th word in the vocabulary given our input word I. It goes without saying, sum of all the elements in the output vector add upto 1 & hence each element reside in [0,1].
Clearly, the computational complexity of this algorithm is O(V), V — size of our vocabulary. Vocabulary sizes can be very high, which in turn makes the softmax take a very long time.
Turns out, we can significantly reduce this time complexity by using a conventional data structure — a Binary Tree.
HIERARCHICAL SOFTMAX
Our main objective here is to bring down the complexity :
O(V) to O(log V)
It helps with much faster evaluation compared to conventional softmax. The leaf nodes indicate the word probabilities. Specifically, a leaf at index, i, is the i-th word probability and has position, i in the output softmax vector. The probability of each word will be the product of the probability of choosing the branch that is on the path from the root to the word. Effectively, the probability of a word is calculated through the product of probabilities on each edge on the path to that node.
These probability values are produced using the sigmoid function. The variable in the sigmoid function is the result of the dot product of the input and output vector representations of the word we’re working with.
TRAINING MECHANISM
It is very similar to that of the Skip Gram model — a training pair consisting of a context-target words. The next word is trained by maximising the log likelihood (better minimise the -ve log likelihood, as it’s a loss function).
There are several steps to construct the binary tree :
- Every split in decision tree occurs based on the cleanest separation.
- Words with similar context should be on the same side of a node.
- A balanced tree helps with faster training and testing, as the words are at the same level.
Note that it is not an approximate optimisation algorithm. It helps with acceleration of the optimisation by adding modifications, but it could well be very biased because of the same.
I’ll be adding in the implementation soon -> here.