# Word vectors - Why they are fundamental for NLP and how to create them (part 1)

Who is this for: If you're a developer interested in NLP and don't know where to begin. Or you have a data science background and want to learn NLP.

How do you represent language in a way that can be used in NLP tasks? This means to represent words in a way that captures their meaning. For example, this is the meaning of `probability: the extent to which an event is likely to occur`.

Can you just use the default way computers represent strings? Ultimately, all data on a computer is stored as sequences of 0s and 1s. Strings are a sequence of characters and each character is represented by a number. Numbers are then internally stored in a 0/1 format. This is only useful for computers to manage words, but it doesn't capture the meaning of a word like `probability` at all.

If this was a comp sci lecture, I would be telling you about wordnets and one-hot vectors. Maybe even show you how to implement them, get you all excited, and then tell you that they don't really work well. I'll mention briefly what they are so you can understand why modern methods work better.

## WordNet

WordNet is like a thesaurus in that it groups together words based on senses of words. In the below image, you can see that `motorcar` is a motor vehicle, and it also contains subjects like `compact` and `gas guzzler`. Trying to capture relationships of each word, and all the senses of each word, is extremely difficult. Agreeing on the senses and boundaries of a word is also not simple. These are just some of the limitations of using wordnet. ## One hot vectors

One method you may see in many NLP tutorials is to use one hot vectors to represent words.

``Sentence: I have a blue dogI:    [ 1 0 0 0 0 ]have: [ 0 1 0 0 0 ]a:    [ 0 0 1 0 0 ]blue: [ 0 0 0 1 0 ]dog:  [ 0 0 0 0 1 ]``

First you have to create a `vocabulary`. This means a list of all the words that you will create vectors for, and that any word that is not part of your vocabulary you will represent it as an unknown word `<UNK>`. Otherwise, for words that are in the vocabulary, they are represented with a `1` that matches the index of their location in the vocabulary list, and `0` for all other words in the vocabularly.

So in the above example, we have 5 words in our vocabularly. `Blue` is in the fourth spot in our vocabularly, so it has a `1` in the 4th spot and `0` for all other words.

Although this is commonly used in beginner NLP tutorials, you can see that it doesn't capture the meaning of a word.

## Contextual meaning

In most cases, the meaning of a word is its use
Wittgenstein

Contextual meaning brings us to how modern NLP represents words. The meaning of a word is determined by how it is used. Try this thought experiment. If you didn't know what `tiktok` was, one way to find out is to gather all the sentences where `tiktok` is used. Then figure out the word by how it's used in all the sentences.

``Example list of sentences:Tiktok attracts these users in so many ways.Tiktok users can shoot, edit, and share 15-second videos.Download tiktok on the App store.``

From these example sentences, you can figure out that it's a mobile app that has many users, and people use it to make and share 15-second videos. This kind of resembles how people actually learn what new words mean. But how do we create a program to do this for us?

## Co-occurrence Embeddings This plot shows a co-occurrence matrix for the words Siddhartha, Govinda, Buddha, and enlightenment from the book Siddhartha by Herman Hesse. It looks like Buddha is closer to enlightenment compared to Siddhartha!

It looks interesting, but what does it actually mean? So the above plot is made using all the words from the book Siddhartha. As you go through each word, its context is the set of words that appear nearby.

``... A goal stood before *Siddhartha*, a single goal...... he'll find the same *Siddhartha* and Govinda ...... and teachers, *Siddhartha* began to speak ...``

If our window is just 1 word, then in the first sentence we see that `before` and `a` are in the context. In the second sentence, `same` and `and` are in the context. In the third sentence, `teachers` and `began` are in the context.

As a simple example, we can try to calculate the co-occurrences of this sentence: `['a', 'goal', 'stood', 'before', 'Siddhartha', 'a', 'single', 'goal']`.

* Siddhartha a before goal single stood
Siddhartha 0 1 1 0 0 0
a 1 0 0 1 1 0
before 1 0 0 0 0 1
goal 0 1 0 0 1 1
single 0 1 0 1 0 0
stood 0 0 1 1 0 0

Let's go down the left side column. The first word is `Siddhartha`, and in the sentence above we can see that there is only instance of this word. For that instance, it has two context words (one on each side): `before` and `a`. In that row, you can see there is a `1` under both of these words.

Now, let's program!

First we have to get the text from the book. Fortunately, the copyright has expired, so it's available on Project Gutenberg.

``import urllib.requestimport stringbook = []book_url_txt = 'https://www.gutenberg.org/cache/epub/2500/pg2500.txt'# Append each line in the book to the listfor line in urllib.request.urlopen(book_url_txt):     book.append(line.decode('utf-8').strip())# Remove the beginning opening remarks that are not usefulbook = book[48:]# Split up the sentence into individual wordsbook = [sent.split() for sent in book if sent != '']# Remove punctuation from each wordbook = [word.translate(str.maketrans('', '', string.punctuation))         for li in book for word in li]# Take a look at a random set of 20 wordsprint(book[:20])"""['house', 'in', 'the', 'sunshine', 'of', 'the', 'riverbank', 'near', 'the', 'boats', 'in', 'the', 'shade', 'of', 'the', 'Salwood', 'forest', 'in', 'the', 'shade']"""``

Now that we have our dataset, the next step is to get all the distinct words.

``def distinct_words(corpus):    """ Determine a list of distinct words for the corpus.        Params:            corpus (list of list of strings): corpus of documents        Return:            corpus_words (list of strings): sorted list of distinct words across the corpus            num_corpus_words (integer): number of distinct words across the corpus    """    corpus_words = []    num_corpus_words = -1        all_corpus_words = [y for x in corpus for y in x]    distinct_words = set(all_corpus_words)    corpus_words = sorted(list(distinct_words))    num_corpus_words = len(corpus_words)    return corpus_words, num_corpus_wordscorpus_words, num_corpus_words = distinct_words(book)print(num_corpus_words)# 4470``

We have 4470 distinct words in our book! If you `print(corpus_words)` you can see what they are. Now we're ready to create a co-occurrence matrix just like the small example above.

``def compute_co_occurrence_matrix(corpus, window_size=4):    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).            Note: Each word in a document should be at the center of a window. Words near edges will have a smaller              number of co-occurring words.                            For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".            Params:            corpus (list of list of strings): corpus of documents            window_size (int): size of context window        Return:            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):                 Co-occurence matrix of word counts.                 The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.    """    words, num_words = distinct_words(corpus)    M = None    word2ind = {}    word2ind = dict(zip(words, range(len(words))))        M = np.zeros((num_words, num_words))        # Iterate over each sentence in our book    for text in corpus:        # Iterate over each word in each sentence        for i, word in enumerate(text):            if i <= window_size:                # We don't want start_index to be less than 0                start_index = 0            else:                start_index = i - window_size            end_index = i + window_size + 1 # Need to + 1 because python slices do not include the last index number            window_words = text[start_index:end_index]            window_words.remove(word) # Don't want the center word in window_words            column_index_of_centre_word = word2ind[word] # In our sample matrix above, the left hand column contains the "centre" words            # Iterate over each context/window word            for w in window_words:                row_index_of_context_word = word2ind[w]                M[row_index_of_context_word, column_index_of_centre_word] += 1                    return M, word2ind# This function assumes that there are many corpora that are passed in.# Since we only have one corpus (1 book), we have to insert it in a listM, word2ind = compute_co_occurrence_matrix([book], window_size=6)M.shape# (4470, 4470)print(word2ind)"""{  ...  'anew': 700,  'anger': 701,  'angry': 702,  ...}"""``

The shape of this new matrix is 4470 by 4470 (number of distinct words in the book). We're almost there! How do we turn this huge matrix into something we can plot out on a 2-d graph? This means we have to turn one dimension of the matrix from 4470 down to 2. What is this magic?!

There's a technique called Singular Value Decomposition (SVD) that can do this (let me know if you want a post explaining SVD). We will use `scikit-learn`'s TruncatedSVD method which can work well with sparse matrices.

``from sklearn.decomposition import TruncatedSVDdef reduce_to_k_dim(M, k=2):    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:            - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html            Params:            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts            k (int): embedding size of each word after dimension reduction        Return:            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.                    In terms of the SVD from math class, this actually returns U * S    """        n_iters = 10    M_reduced = None    print("Running Truncated SVD over %i words..." % (M.shape))        svd = TruncatedSVD(n_components=k, n_iter=n_iters)    M_reduced = svd.fit_transform(M)    return M_reducedM_reduced = reduce_to_k_dim(M)# Running Truncated SVD over 4470 words...M_reduced.shape# (4470, 2)``

Now that we have a 4470x2 matrix, we can visualize it! We're going to use Matplotlib to create a scatter plot. Substitute the words in `words_to_check` for any words you want to plot.

``def plot_embeddings(M_reduced, word2ind, words):    """ Plot in a scatterplot the embeddings of the words specified in the list "words".                Params:            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings            word2ind (dict): dictionary that maps word to indices for matrix M            words (list of strings): words whose embeddings we want to visualize    """    for word in words:        i = word2ind[word]        x = M_reduced[i]        y = M_reduced[i]        plt.scatter(x, y, marker='*', color='green')        plt.text(x, y, word, fontsize=12)    plt.show()words_to_check = ['Siddhartha', 'Govinda', 'Buddha', 'enlightenment', ]plot_embeddings(M_reduced, word2ind, words_to_check)``

There you have it! Your own 2-d plot of word embeddings based on a co-occurrence matrix. It's not magic at all ;)

References: