Word2Vec using continuous bag-of-words

Word2Vec is a method of producing word embeddings, which are dense vector representations of words. In contrast to our previous article which had sparse representations (mostly zeros), dense vectors contain real valued numbers.

The algorithm we will be focusing on here is called continuous bag-of-words (CBOW).

cbow

CBOW is basically just predicting a word given its context, which is the n words on the left and right of the target word. If you do this with a large corpora using a sliding window then you will be able to create embeddings for each word.

Since each word will be both a context and a target word in our vocabularly, the output will be two embeddings for each word: the target embedding and the context embedding. We will only need the target embedding. (Image below from Juravsky and Martin, SPL 3rd ed.)

cbow_embeddings

First, we will import the needed modules from Pytorch and use the book Siddhartha as our corpus.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import urllib.request
import string

book = []
book_url_txt = 'https://www.gutenberg.org/cache/epub/2500/pg2500.txt'

# Append each line in the book to the list
for line in urllib.request.urlopen(book_url_txt):
book.append(line.decode('utf-8').strip())

# Remove the beginning opening remarks that are not useful
book = book[48:]

# Split up the sentence into individual words
book = [sent.split() for sent in book if sent != '']

# Remove punctuation from each word
book = [word.lower().translate(str.maketrans('', '', string.punctuation))
for li in book for word in li]

Here, we are doing some preprocessing work to make the corpus useful:

  • splitting up the sentence into individual words
  • removing punctuation
  • keeping distinct words only

At the end, our vocabularly will only contain distinct words (4076).

def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
num_corpus_words (integer): number of distinct words across the corpus
"""

corpus_words = []
num_corpus_words = -1

all_corpus_words = [y for x in corpus for y in x]
distinct_words = set(all_corpus_words)
corpus_words = sorted(list(distinct_words))
num_corpus_words = len(corpus_words)

return corpus_words, num_corpus_words


corpus_words, num_corpus_words = distinct_words([book])

num_corpus_words
# 4076

Now we want to create two indexes. First is a word to index so that we can easily look up the index of each word in our vocabulary.
The other index is an index to word that allows us to do the reverse, lookup a word using an index.

word_to_ix = dict(zip(corpus_words, range(num_corpus_words)))
ix_to_word = {ix:word for ix, word in enumerate(corpus_words)}

word_to_ix['enlightenment']
# 1171

This part is where we create the context and target words for our training set. We are using a window size of n=2, meaning that for each target word, we will be looking at 2 words on the left and 2 words on the right as the context.

If you take a look at the first instance of our training data below:

  • sliding window is the son of the brahman
  • context words are the, son, the, brahman
  • target word is the center word in the sliding window of

Each training instance is created by moving this sliding window one word to the right.

data = []
for i in range(2, len(book) - 2):
context = [book[i-2], book[i-1], book[i+1], book[i+2]]
target = book[i]
data.append((context, target))

print(data[:5])
"""
[(['the', 'son', 'the', 'brahman'], 'of'),
(['son', 'of', 'brahman', 'in'], 'the'),
(['of', 'the', 'in', 'the'], 'brahman'),
(['the', 'brahman', 'the', 'shade'], 'in'),
(['brahman', 'in', 'shade', 'of'], 'the')]
"""

Now we will create our dataset that makes it easier for training using mini-batches. We don't want to train using one instance sliding window, but rather a mini-batch of training instances.

We're using the Pytorch Dataset module. The most important piece is the __getitem__ function which loads a sample (1 sliding window context and target) at a givein index. Based on the index, it returns the context_vector and target_index. The context_vector is created by using the index of each word in our vocabulary, and returning it as a tensor.

One thing to note for the target_index is that we have to squeeze it to make the output shape (batch_size,) so that it will work for matrix multiplication.

def make_context_vector(context, word_to_ix):
idxs = [word_to_ix[w] for w in context]
return torch.tensor(idxs, dtype=torch.long)

class CBOWDataset(Dataset):
def __init__(self, dataset):
self.dataset = dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, index):
row = self.dataset[index]
context = row[0]
target = row[1]
context_vector = make_context_vector(context, word_to_ix)
target_index = torch.tensor([word_to_ix[target]]).squeeze_(-1) # to make the output shape (batch_size,)
return {'x_data': context_vector,
'y_target': target_index}

dataset = CBOWDataset(data)

Once we have a dataset of context and target vectors, we want to turn this dataset into batches using the Pytorch Dataloader. The Dataloader just wraps an iterable around the Dataset to make it easy to access samples. Our batch size is 32 samples and we've set drop_last=True to drop the last incomplete batch if the dataset size is not divisible by batch size.

def generate_batches(dataset, batch_size, shuffle=True, drop_last=True, device="cpu"):
dataloader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
for data_dict in dataloader:
out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name].to(device)
yield out_data_dict

batch_generator = generate_batches(dataset, 32)

Now we have the model. It's a simple 2 layer model with the relu activation function applied to the first layer's output. Here we don't apply softmax since our loss function does that more efficiently.

Our model will be learning the embeddings which is initialized by nn.Embeddings. This is basically a lookup table for each word in our vocabulary, and each word will have a dimension size of embedding_dim=100 (set below). The embedding dimensions usually aren't interpretable unlike manually created features.

Once the embeddings are initialized, they are updated via gradient descent.

class CBOW(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(CBOW, self).__init__()
self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim, padding_idx=0)
self.linear1 = nn.Linear(embedding_dim, 32)
self.linear2 = nn.Linear(32, vocab_size)

def forward(self, inputs, apply_softmax=False):
embeds = self.embeddings(inputs).sum(dim=1)
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
if apply_softmax:
out = F.log_softmax(out, dim=1)
return out

Some parameters we are setting for our model. Each word will have 100 dimensions in our embeddings. We're using the cross entropy loss which outputs probabilities for each word in our vocabularly, making it simple to determine the most likely predicted target word (highest probability). We're also using the Adam optimizer instead of stochastic gradient descent to update our model weights.

EMBEDDING_DIM = 100
losses = []
loss_func = nn.CrossEntropyLoss()
model = CBOW(num_corpus_words, EMBEDDING_DIM)
optimizer = optim.Adam(model.parameters(), lr=0.01)

Here's our standard training procedure.

for epoch in range(1000):
total_loss = 0
for batch_index, batch_dict in enumerate(batch_generator):
optimizer.zero_grad()
log_probs = model(batch_dict['x_data'])
loss = loss_func(log_probs, batch_dict['y_target'])
total_loss += loss.item()
loss.backward()
optimizer.step()
losses.append(total_loss)

torch.save(model.state_dict(), 'cbow')

Now we can test out how the embeddings have been trained! For this manual test, we want to look at the closest n words to our target word. The distance is calculated with torch.dist which looks at the p-norm between two tensors. Since it defaults to p=2, we're looking at the Euclidean distance. Check out an example using the word river.

# Testing
def pretty_print(results):
"""
Pretty print embedding results.
"""

for item in results:
print("...[%.2f] - %s"%(item[1], item[0]))

def get_closest(target_word, word_to_ix, embeddings, n=5):
"""
Get the n closest
words to your word.
"""


# Calculate distances to all other words

word_embedding = embeddings[word_to_ix[target_word.lower()]]
distances = []
for word, index in word_to_ix.items():
if word == "<MASK>" or word == target_word:
continue
distances.append((word, torch.dist(word_embedding, embeddings[index])))

results = sorted(distances, key=lambda x: x[1])[1:n+2]
return results

word = input('Enter a word: ')
embeddings = model.embeddings.weight.data
pretty_print(get_closest(word, word_to_ix, embeddings, n=5))

"""
Enter a word: river

...[11.68] - carrying
...[12.22] - herons
...[12.25] - herd
...[12.39] - faithful
...[12.41] - lazy
...[12.45] - constant
"""

References