Word Embeddings
NLP algorithms make use of machine learning and neural networks to train models that can accomplish tasks. But machine learning and neural networks are fundamentally mathematics, and require numerical values as input. This presents a conundrum due to the fields of Natural Language Processing and Computational Linguistics lying within the realm of text analysis, as opposed to numbers. In short, machine learning and neural networks require numbers as input but NLP can only offer text. So how can text be fed into such models?
This is where word embeddings comes into play. The basic idea is that each word in a text is represented by a vector. Consider a vocabulary of unique words with respect to a sample sentence. Each word in the sample sentence corresponds to an index in the unique vocabulary. If a word in the sample sentence corresponds to the ith index of a vocabulary, then the representative matrix is characterized by a 1 in the ith index and a 0 everywhere else. That is a very basic word embedding known as a one-hot vector which is used in Machine Learning to attribute values to categories and non-numeric data
One-hot encoding is one way to have a numeric representation of a word. However, there are problems with this method, particularily when it comes to NLP. There exist too many dimensions for representations of a word like above which complicates computation. Furthermore, there is no meaning associated with each numeric value. ‘Glasses’ have a numeric representation, and ‘aviators’ have another numeric representation, but through one-hot vectors ensure that glasses are as close to aviators as they are to water bottles.
Term Frequency — Inverse Document Frequency
TF-IDF is a word embedding designed to measured the relevancy of a word in a text. So, it can be useful in determining how important a word is in a document. As the name implies, TF-IDF requires collection of data regarding the frequency of words. One might think that the most relevant words in a text are those with the high frequency, but generally that is not true. Instead, words with higher frequencies generally tend to be stopwords (“the”, “a”, etc.) , or are merely irrelevant and inconsequential. TF-IDF is interesting in that it takes into account both word frequency as well as overall rarity / scarcity in order to calculate relavance of a word to a document. Specifically, the calculation for TF-IDF is:
Word frequency is simple in that it is meant to account for the popularity of term i in document j as a factor towards relevance. Inverse document frequency is more interesting in that it measures rarity of a term i in a dataset because the more rare term i is, the more value its presence brings to a document j. Now, consider a separate word i that appears in every document of the dataset. If every document contains this word, then it is not a rare word and holds no important relavance to any specific topic. This is represented by log(1) equal to 0, effectively nullifying the TF-IDF score of any word that has no rarity and therefore no relavance / value to a document j. The inverse document frequency of a word decreases as a word decreases, and it approaches 0 if the said word appears in almost every document. Thus, inverse document frequency showcases that words that are rare, when present are most relevant.
At first, I wondered why inverse document frequency is itself is not enough to calculate the relavance of word i in document j. However, I imagine the existence of passing remarks in a document that hold words that are rare above all else despite not being relevant at all to the document. Thus, I believe that inverse document frequency cannot account for individual use of a word in a single document, which is why it is multiplied by word frequency.
Word2Vec
Word2Vec is a more advanced word embedding technique that is meant to learn the association between accompanying words. Foundationally, it is based on the assumption that nearby words are similar. There are two main variants of Word2Vec: Continuous Bag of Words (CBOW) and Skip-Gram. The two variants are similar, but at the same time opposite of one another. Word2Vec involves a shallow neural network with 1 input layer, 1 hidden layer, and 1 output layer. For CBOW, the input is the surrounding context (of course vectorized), which is used to predict the current word as the target in the neural network. Whilst training, the window formed from both the input and output iteratively shifts all across the document, encodes each word in vector format, and sends it to the CBOW neural network.
For Skip-Gram, the opposite applies. The input is the current word, and it is sent into the neural network for the purposes of predicting the target context. Thus, the process is inherently similar but also opposite as you can see below.
Ultimately, Word2Vec is a word embedding technique that captures the meaning of words by analyzing the relationship between them through its very interesting training method. Unfortunately, I will not be going deeper into the inner workings of this technique, but check out the resources below because it certainly is fascinating!
Sources and other interesting reading material:
https://lena-voita.github.io/nlp_course/word_embeddings.html#research_thinking
https://towardsdatascience.com/word2vec-explained-49c52b4ccb71
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa