In [66]:
import nltk
import re
import numpy as np

# You'll need to run, in the terminal
# pip install tabulate
from tabulate import tabulate

# The Distributional Hypothesis

The basic idea behind the distributional hypothesis can be directly taken from Wikipedia:

> Linguistic items that appear in similar contexts have similar meanings.

Here, we can think of "linguistic items" as words. A more precise (and radical) version can be found in [these slides by Marco Baroni](https://www.cs.utexas.edu/~mooney/cs388/slides/dist-sem-intro-NLP-class-UT.pdf):

- "The meaning of a word _is_ the _set of contexts_ in which it occurs in texts."

- "_Important aspects of_ the meaning of a word _are a function of_ the set of contexts in which it occurs in texts."

Those slides also offer us a few examples of how we are able to extract the meaning of certain words from the context. What does the word "wampimuk" mean in the following sentences?

> `(1) Ugh, I think I had too much wampimuk last night!`

> `(2) The other day when I was walking through the woods, I saw a wampimuk sleeping just besides the path.`

> `(3) Goofey was running late for his appointment at the wampimuk.`

The slides cite as reference McDonald & Ramscar (2001), from which I got these examples:

> **Context A: ‘urn’**
>
> On his recent holiday in Ghazistan, Joe slipped easily into the customs of the locals. In the hotel
restaurant there was a samovar dispensing tea at every table. Guests simply served themselves from
the samovar whenever they liked. Joe’s table had an elaborately crafted samovar. It was the first
earthenware samovar that he had seen.
>
> **Context B: ‘kettle’**
>
> On his recent holiday in Ghazistan, Joe slipped easily into the customs of the locals. His hotel room
featured a samovar and a single hob. Each morning Joe boiled water in the samovar for tea. Like others
he had seen on his holiday, Joe’s samovar was blackened from years of use. He imagined that at some
point it would be replaced with an electric samovar


# Context

In our first introduction to Distributional Semantics, we said that we wanted to calculate semantic similarities between "linguistic items". We argued that deciding whether two items are semantically similar was roughly the same as deciding whether their meaning is similar. Then we introduced an indirection to our problem: now we needed to define what _meaning_ was. As we saw, meaning is a difficult concept, and then, we concluded that we would try to get it indirectly based on the contexts in which the words appear. Now we have once another problem: what is context?

As mentioned already in the Corpus Linguistics class, generally speaking, context can include a huge variety of things:
- previous sentences
- other words in the same sentence
- socioeconomical status of the speaker
- pretty much any other characteristic of the speaker
- thematic context of the utterance
- date and time
- who is the current president of the USA
- the weather
- ...

For example, a sentence like

> Donald talked nonsense before and for sure he will talk nonsense again.

may or may not be interpreted to give you information about the speaker's political views, depending on several of the variables mentioned above. Unfortunately, when performing distributional semantics, several of these interpretations are not applicable: you usually only have a text corpus without any "external" details about the context in which these data were collected. Therefore, you typically use something like the following:
- words of the same sentence/paragraph/document

> The quick brown fox jumps over the lazy dog .

- words within a window of a certain size around the word of interest (in the example, the window size is 2)

> The quick brown fox jumps over the lazy dog .

- variations of the former two, e.g. using other information such as POS-tags or stopword lists

> The quick brown fox jumps over the lazy dog .

Still... in principle it's up to you to define what context makes sense for your application.

## Representing Context

In the Machine Learning classes, we saw how we normally want to transform our sentences into _something_ numerical, that we can use in our models. We did it using a Bag of Words representation, in which each sentence was converted into a list of numbers. Here we want to do something similar: we want to convert our words into vectors that _somehow_ give us information about all the contexts in which it appears. As our definition of "context", let's use all the other words of the same sentence.

So, to make it more concrete: our goal is going to be to convert each word into a vector that _somehow_ gives us information about all the sentences in which it appears.

Let's first define a list of sentences, that we can work with:

In [6]:
sentences = [
 'The owner put the dog on a leash for taking a walk outside.',
 'In the park recently, a dog was barking.',
 'The new tiger was the attraction of the zoo.',
 'The cat didn\'t care about its owner.',
 'To the surprise of the dog, the cat was not in the mood for playing.',
 'Many kids were playing in the park.',
 'There is only one park in this ugly city.',
]

### Preprocessing

Our first step is going to be to tokenize them, in the same way as we did it before:

In [108]:
sentences = [nltk.word_tokenize(s) for s in sentences]

In [109]:
# Let's just see what we've got with the previous line
for s in sentences:
 print(s)

['The', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside', '.']
['In', 'the', 'park', 'recently', ',', 'a', 'dog', 'was', 'barking', '.']
['The', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo', '.']
['The', 'cat', 'did', "n't", 'care', 'about', 'its', 'owner', '.']
['To', 'the', 'surprise', 'of', 'the', 'dog', ',', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing', '.']
['Many', 'kids', 'were', 'playing', 'in', 'the', 'park', '.']
['There', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city', '.']


Our next step is going to be to filter out things that do not look like words (since, well, they will probably not contribute a lot to the meanings of the words). In serious applications, we may want to something complicated here. In our example, we will just remove any token that is not composed entirely by letters. (notice that this would eliminate words like `n't` or `M.Sc.`)

In [110]:
new_sentences = []

for s in sentences:
 # Creates a new list of tokens keeping only tokens containing exclusively letters
 new_s = [w for w in s if re.match('[\w]+', w)]
 # Inserts the new list (i.e., `new_s`) in the new_sentences list
 new_sentences.append(new_s)

In [111]:
sentences = new_sentences

for s in new_sentences:
 print(s)

['The', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']
['In', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']
['The', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo']
['The', 'cat', 'did', "n't", 'care', 'about', 'its', 'owner']
['To', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']
['Many', 'kids', 'were', 'playing', 'in', 'the', 'park']
['There', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city']


Finally, we want to uncapitalize those capital letters:

In [112]:
new_sentences = []

for s in sentences:
 # Creates a new list of tokens, uncapitalizing everything
 new_s = [w.lower() for w in s]
 # Inserts the new list (i.e., `new_s`) in the new_sentences list
 new_sentences.append(new_s)

In [113]:
sentences = new_sentences

for s in sentences:
 print(s)

['the', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']
['in', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']
['the', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo']
['the', 'cat', 'did', "n't", 'care', 'about', 'its', 'owner']
['to', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']
['many', 'kids', 'were', 'playing', 'in', 'the', 'park']
['there', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city']


### Calculating co-occurrences between `dog` and other words

Let's focus on a single word, `dog`, to see what it would take to represent the context around that single word. Since we defined (for this example) "context" as "the sentence in which the word appears", we want to go through all our sentences and only keep those that contain the word `dog`:

In [23]:
# Get all sentences with dog
sentences_with_dog = [s for s in sentences if 'dog' in s]

for s in sentences_with_dog:
 print(s)

['the', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']
['in', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']
['to', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']


In our Corpus Linguistics class, we saw how we could count the frequencies of the words in a corpus using a `FreqDist` object. We will use it again here. Let's first initialize a `FreqDist` object without any data:

In [27]:
fd_dog = nltk.FreqDist()

In [26]:
fd_dog

FreqDist({})

The `FreqDist` object has a function `update()` that receives a list of tokens and includes them in its countings. We'll use it to count the occurrences of all the words in our three sentences.

The problem is: we do not want to count the occurrences of `dog`. After all, we know that `dog` occurs in all of these sentences. So we will first remove the word dog from each of the sentence using the same kind of code we used before:

In [28]:
new_sentences_with_dog = []

for s in sentences_with_dog:
 # Creates a new list of tokens, without dog
 new_s = [w for w in s if w != 'dog']
 # Inserts the new list (i.e., `new_s`) in the sentences_with_dog list
 new_sentences_with_dog.append(new_s)

sentences_with_dog = new_sentences_with_dog

In [29]:
for s in sentences_with_dog:
 print(s)

['the', 'owner', 'put', 'the', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']
['in', 'the', 'park', 'recently', 'a', 'was', 'barking']
['to', 'the', 'surprise', 'of', 'the', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']


Now we can finally use the `update` function of the `FreqDist` object:

In [30]:
for s in sentences_with_dog:
 fd_dog.update(s)

Let's see what we've got:

In [32]:
print('dog\n', fd_dog.items())

dog
 dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)])


### Doing the same with `park` (in less code)

Of course, we could the exact same thing for another word, say `park`:

In [35]:
fd_park = nltk.FreqDist()

# Iterates through all sentences that contain 'park'...
for s in [s for s in sentences if 'park' in s]:
 # ... updating the `fd_park` variable with the counts of all words except 'park'
 fd_park.update([w for w in s if w != 'park'])

# Prints the result
print('park\n', fd_park.items())

park
 dict_items([('in', 3), ('the', 2), ('recently', 1), ('a', 1), ('dog', 1), ('was', 1), ('barking', 1), ('many', 1), ('kids', 1), ('were', 1), ('playing', 1), ('there', 1), ('is', 1), ('only', 1), ('one', 1), ('this', 1), ('ugly', 1), ('city', 1)])


### Representing the context of a list of words of interest

Now we want to make our example a little more complicated and create code that would do the exact same as we did for `dog` or `park` above, only for a longer list of words. Let's call them our "words of interest". Our goal will be to represent the context of (i.e., create word vectors for) each one of them.

In [38]:
# Let's define a set of words of interest
words_of_interest = ['dog', 'park', 'owner', 'tiger', 'cat', 'a', 'was']

Again, we want to initialize one `FreqDist` object for each one of our words. For that, we will use a dictionary. If you don't remember how dictionaries work, take a look at the Python tutorial of the second class. The `for word in words_of_interest` syntax below is analogous to the list comprehension syntax used for lists.

In [42]:
# We initialize empty FreqDist
co_occurrences = {word: nltk.FreqDist() for word in words_of_interest}
co_occurrences

{'dog': FreqDist({}),
 'park': FreqDist({}),
 'owner': FreqDist({}),
 'tiger': FreqDist({}),
 'cat': FreqDist({}),
 'a': FreqDist({}),
 'was': FreqDist({})}

Now we will adapt the code we used for `park` to run for several words. We basically "wrap" it with a new for loop, iterating through each of our words of interest.

_(this may be a little complicated. Try playing a little with the code here to understand what is going on)_

In [44]:
for w_of_interest in words_of_interest:
 # Iterates through all sentences that contain 'w_of_interest'...
 for s in [s for s in sentences if w_of_interest in s]:
 # ... updating the `FreqDist` object associated with 'w_of_interest' with the counts
 # of all words except the 'w_of_interest'
 co_occurrences[w_of_interest].update([w for w in s if w != w_of_interest])

Let's print what we've got, to see what it looks like.

In [45]:
for w_of_interest in words_of_interest:
 print(w_of_interest, "\n", co_occurrences[w_of_interest].items(), "\n--\n")

dog 
 dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)]) 
--

park 
 dict_items([('in', 3), ('the', 2), ('recently', 1), ('a', 1), ('dog', 1), ('was', 1), ('barking', 1), ('many', 1), ('kids', 1), ('were', 1), ('playing', 1), ('there', 1), ('is', 1), ('only', 1), ('one', 1), ('this', 1), ('ugly', 1), ('city', 1)]) 
--

owner 
 dict_items([('the', 3), ('put', 1), ('dog', 1), ('on', 1), ('a', 2), ('leash', 1), ('for', 1), ('taking', 1), ('walk', 1), ('outside', 1), ('cat', 1), ('did', 1), ("n't", 1), ('care', 1), ('about', 1), ('its', 1)]) 
--

tiger 
 dict_items([('the', 3), ('new', 1), ('was', 1), ('attraction', 1), ('of', 1), ('zoo', 1)]) 
--

cat 
 dict_items([('the', 5), ('did', 1), ("n't", 1), ('care', 1), ('about', 1), ('its', 1), ('

After we've done this, we can have a look at how often each of our words of interest co-occur with any of the other words in our sentences. Let's try this with `dog`, to see if we get the exact same results as before:

In [143]:
# Now we can have a look at how often each word co-occurred
# with, say, 'dog'
print(co_occurrences['dog'].items())

dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)])


### From co-occurrence counts to vectors

The current format of our counts is just too cumbersome. As already mentioned several times, we want vectors, but what we currently have is just a list containing pairs $(word, count)$. Notice that the list has a variable length:

In [46]:
# Length of the list associated with the word dog
len(co_occurrences['dog'].items())

22

In [47]:
# Length of the list associated with the word park
len(co_occurrences['park'].items())

18

This is not very handy. Transforming this list into a vector, however, is not super hard. What you usually do is to fix a vocabulary for the context words. Then, for each word $w$ of our list words of interest, you count the co-occurrences of $w$ with each of the words in the vocabulary. We ignore the words that are not in the vocabulary.

Having a vocabulary will cause our "list of counts" to always have the same number of elements. Hence, the whole data can be stored and displayed as a matrix.

Let's do an example. Let's start by getting all the words that are "relevant" to us, i.e., the words that co-occur with any of our words of interest:

In [49]:
# First we get all words that co-occur with each of our words of interest.
# This is done with the following line.
#
# The first for-loop gets each of our "words of interest"
# The second for-loop gets each of the words that co-occur with them
all_words = [w for word in co_occurrences
 for w in co_occurrences[word]]

all_words

# If this is not clear, try running
#
# for word in co_occurrences:
# print(word)
# 
# , which will give you each of our words of interest. Then let's
# fixate into one of our words of interest (say, 'dog'), and run
# the next for loop. This will give all the words that co-occur
# with 'dog':
#
# for w in co_occurrences['dog']:
# print(w)

['the',
 'a',
 'for',
 'in',
 'was',
 'owner',
 'put',
 'on',
 'leash',
 'taking',
 'walk',
 'outside',
 'park',
 'recently',
 'barking',
 'to',
 'surprise',
 'of',
 'cat',
 'not',
 'mood',
 'playing',
 'in',
 'the',
 'recently',
 'a',
 'dog',
 'was',
 'barking',
 'many',
 'kids',
 'were',
 'playing',
 'there',
 'is',
 'only',
 'one',
 'this',
 'ugly',
 'city',
 'the',
 'a',
 'put',
 'dog',
 'on',
 'leash',
 'for',
 'taking',
 'walk',
 'outside',
 'cat',
 'did',
 "n't",
 'care',
 'about',
 'its',
 'the',
 'new',
 'was',
 'attraction',
 'of',
 'zoo',
 'the',
 'did',
 "n't",
 'care',
 'about',
 'its',
 'owner',
 'to',
 'surprise',
 'of',
 'dog',
 'was',
 'not',
 'in',
 'mood',
 'for',
 'playing',
 'the',
 'dog',
 'owner',
 'put',
 'on',
 'leash',
 'for',
 'taking',
 'walk',
 'outside',
 'in',
 'park',
 'recently',
 'was',
 'barking',
 'the',
 'in',
 'dog',
 'of',
 'park',
 'recently',
 'a',
 'barking',
 'new',
 'tiger',
 'attraction',
 'zoo',
 'to',
 'surprise',
 'cat',
 'not',
 'mood',


The next step is similar to what we did when we created a vocabulary: we filter our set of words to get only the most common words. Calling the `most_common` function, however, provides pairs $(word, count)$. Since we want to create a vocabulary, we are only interested in the words. Therefore, we remove the counts.

To try to make things concrete, we'll create a very small vocabulary, containing only 10 words.

In [58]:
most_commons = nltk.FreqDist(all_words).most_common(10)
most_commons

[('the', 7),
 ('for', 5),
 ('in', 5),
 ('was', 5),
 ('dog', 5),
 ('a', 4),
 ('recently', 4),
 ('barking', 4),
 ('of', 4),
 ('playing', 4)]

In [59]:
# And then we get rid of the numbers (the frequencies)
vocab = [c for c,count in most_commons]
vocab

['the', 'for', 'in', 'was', 'dog', 'a', 'recently', 'barking', 'of', 'playing']

Ok... now we have a vocabulary containing all the words that appear the most along with our words of interest. This vocabulary has a length. The rest of it is very similar to our Bag of Words representation, but instead of counting the occurrences of words in a sentence, we are counting the co-ocurrences (i.e., the appearances in the same sentence) of the words with our words of interest.

In [60]:
# Now, the `co_matrix` contains a matrix with the following format:
# * rows: each of our words of interest ('dog', 'park', 'owner', ...)
# * columns: each of our `context` words (the most common words)
co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words_of_interest])
co_matrix

array([[7, 2, 2, 2, 0, 3, 1, 1, 1, 1],
 [2, 0, 3, 1, 1, 1, 1, 1, 0, 1],
 [3, 1, 0, 0, 1, 2, 0, 0, 0, 0],
 [3, 0, 0, 1, 0, 0, 0, 0, 1, 0],
 [5, 1, 1, 1, 1, 0, 0, 0, 1, 1],
 [3, 1, 1, 1, 2, 0, 1, 1, 0, 0],
 [8, 1, 2, 0, 2, 1, 1, 1, 2, 1]])

In the code below, we create a function that shows this in a nicer format, that hopefully more intuitive. See if you can understand what the functioin is doing:

In [72]:
def show_co_matrix(mat, words, vocab, max_vocab_size=None):
 if max_vocab_size:
 print(tabulate([[word]+list(mat[i,:max_vocab_size]) for i,word in enumerate(words)],
 headers=['word \ context']+vocab[:max_vocab_size]))
 else:
 print(tabulate([[word]+list(mat[i]) for i,word in enumerate(words)],
 headers=['word \ context']+vocab))

Now we use the function to see our data in a nice way:

In [74]:
show_co_matrix(co_matrix, words_of_interest, vocab, max_vocab_size=10)

word \ context the for in was dog a recently barking of playing
---------------- ----- ----- ---- ----- ----- --- ---------- --------- ---- ---------
dog 7 2 2 2 0 3 1 1 1 1
park 2 0 3 1 1 1 1 1 0 1
owner 3 1 0 0 1 2 0 0 0 0
tiger 3 0 0 1 0 0 0 0 1 0
cat 5 1 1 1 1 0 0 0 1 1
a 3 1 1 1 2 0 1 1 0 0
was 8 1 2 0 2 1 1 1 2 1


This is what we call a **co-occurrence matrix**. If you find that the coding above is hard to follow, you may want to watch [this](https://www.youtube.com/watch?v=-F-PlgX8GcY) and [this](https://www.youtube.com/watch?v=Qy-Lq9Ae2KM), in which a person calculates the co-occurences by hand (I don't really love these videos, but they are the only minimally decent video I've found on the topic).

### Stop word Removal

The notion _stop word_ is a bit fuzzy, but you can see it as a _frequent word_ that is assumed to be _irrelevant_ to the task at hand. In our case, the idea is that these words would appear in almost any sentence, and therefore they wouldn't contribute a lot of information about the context.

Typically the most common words are considered to be stop words.

Luckily NLTK comes with a list of stop words that we can use:

In [78]:
stopwords = set(nltk.corpus.stopwords.words())
stopwords

{'sebisanya',
 'hubiésemos',
 'houveria',
 'terkira',
 'foram',
 "you've",
 'štirinajstega',
 'mint',
 'memperlihatkan',
 'nismo',
 'osmo',
 'dari',
 'triintridesetim',
 'stesse',
 'mu',
 'usai',
 'vsakogar',
 'persoalan',
 'hab',
 'валекин ',
 'морт',
 'masih',
 'и',
 'štiriindvajsetim',
 'with',
 'multi',
 'osemdesetega',
 'ellos',
 'себя',
 'әлденеше',
 'vsakršno',
 'enajstima',
 'enkraten',
 'esses',
 'tretjima',
 'o',
 'noilta',
 'nekaka',
 'sesuatunya',
 'deiner',
 'pukul',
 'houverem',
 'seusai',
 'mojimi',
 'facciano',
 'stanno',
 'kakšni',
 'maraš',
 'smelo',
 'ujarnya',
 'terus',
 'kedua',
 'यहाँसम्म',
 'seluruh',
 'shouldn',
 'după',
 'nəhayət',
 'verjetno',
 'पटक',
 'izpod',
 'petnajste',
 'deyil',
 'lesz',
 'sunt',
 'jim',
 'untuk',
 'isso',
 'هلا',
 'ed',
 'εκεινο',
 'itulah',
 'эй',
 'dovolite',
 'eres',
 'dvanajstih',
 'muito',
 'ert',
 'тот',
 'dvakratni',
 'замон',
 'ले',
 'le-tema',
 'lagi',
 'pe',
 'hänessä',
 'mengingatkan',
 'warst',
 'ӯббо',
 'egyre',
 'bodimo',


To use this, we just need to add one more step into our "preprocessing" stage:

In [118]:
new_sentences = []

for s in sentences:
 # Creates a new list of tokens keeping only tokens containing exclusively letters
 new_s = [w for w in s if w not in stopwords]
 # Inserts the new list (i.e., `new_s`) in the new_sentences list
 new_sentences.append(new_s)

In [119]:
new_sentences

[['owner', 'put', 'leash', 'taking', 'walk', 'outside'],
 ['park', 'recently', 'barking'],
 ['new', 'tiger', 'attraction', 'zoo'],
 ["n't", 'owner'],
 ['surprise', 'mood', 'playing'],
 ['many', 'kids', 'playing', 'park'],
 ['park', 'ugly', 'city']]

Then we would need to rerun all the other steps above to generate the co-ocurrence matrix. The following function does that. See if you can understand what it is doing:

In [120]:
# `compute_context_stuff` does exactly what we did above. We only wrap
# things into a function so that we can reuse all of it later
def compute_context_stuff(sentences, words, remove_stopwords=False, vocab_size=10):
 co_occurrences = {word: nltk.FreqDist() for word in words}

 for sentence in sentences:
 for word in words:
 if word in sentence:
 co_occurrences[word].update([w.lower() for w in sentence
 if re.match('[\w]+', w) and w!=word
 # We remove stop words with this line!
 # We only include a word in our list if it
 # is not in the stopwords list
 and (not remove_stopwords or w.lower() not in stopwords)])
 
 vocab = [c for c,count in nltk.FreqDist([w for word in co_occurrences
 for w in co_occurrences[word]]).most_common(vocab_size)]

 co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words])
 return vocab, co_matrix

Then we just run the function, passing our data, our words of interest, and our vocabulary size:

In [121]:
vocab, co_matrix = compute_context_stuff(sentences, words_of_interest, remove_stopwords=True, vocab_size=10)

In [122]:
show_co_matrix(co_matrix, words_of_interest, vocab)

word \ context recently barking playing owner put leash taking walk outside park
---------------- ---------- --------- --------- ------- ----- ------- -------- ------ --------- ------
dog 1 1 1 1 1 1 1 1 1 1
park 1 1 1 0 0 0 0 0 0 0
owner 0 0 0 0 1 1 1 1 1 0
tiger 0 0 0 0 0 0 0 0 0 0
cat 0 0 1 1 0 0 0 0 0 0
a 1 1 0 1 1 1 1 1 1 1
was 1 1 1 0 0 0 0 0 0 1


### The Effect of Context

In the entire discussion above, we defined meaning in terms of context, and we arbitrarily decided that our context was "the entire sentence". It should be quite clear, therefore, that changing what we use as context should have an impact on the vectors we obtain to represent the meaning of our words of interest. We already saw how removing stop words changes the context vectors that we obtain. Removing stop words can be seen as one way of modifying what we define as context.

**Question**: How exactly does stop word removal affect the "meaning" we obtain? Which information is lost in this process and which nuances of meaning might not be accessible?

#### The effect of window sizes

We discussed above another way of defining context, namely, by defining a window around our word of interest, and only considering that words that are inside that window. The function below redefines our `compute_context_stuff` function to take into account a window size.

In [123]:
# -> window size stuff should go into an earlier section already
def compute_context_stuff(sentences, words, remove_stopwords=False, vocab_size=10, window_size=5):
 co_occurrences = {word: nltk.FreqDist() for word in words}

 for sentence in sentences:
 for word in words:
 if word in sentence:
 word_pos = sentence.index(word)
 co_occurrences[word].update([w.lower() for w in sentence[max(0,word_pos-window_size):min(word_pos+window_size,len(sentence)-1)]
 if re.match('[\w]+', w) and w!=word
 and (not remove_stopwords or w.lower() not in stopwords)])
 
 vocab = [c for c,count in nltk.FreqDist([w for word in co_occurrences
 for w in co_occurrences[word]]).most_common(10)]
 
 co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words])
 return vocab, co_matrix

We want to take a look at a few possible window sizes, to see what it looks like. Let's try window sizes 1, 3 and 6:

In [124]:
for window_size in [1,3,6]:
 print("\ncontexts for window size %d\n"%window_size)
 contexts, co_matrix = compute_context_stuff(sentences, words_of_interest, window_size=window_size)
 show_co_matrix(co_matrix, words_of_interest, contexts, max_vocab_size=10)


contexts for window size 1

word \ context the a one its new on recently dog tiger cat
---------------- ----- --- ----- ----- ----- ---- ---------- ----- ------- -----
dog 2 1 0 0 0 0 0 0 0 0
park 2 0 1 0 0 0 0 0 0 0
owner 1 0 0 1 0 0 0 0 0 0
tiger 0 0 0 0 1 0 0 0 0 0
cat 2 0 0 0 0 0 0 0 0 0
a 0 0 0 0 0 1 1 0 0 0
was 0 0 0 0 0 0 0 1 1 1

contexts for window size 3

word \ context the recently was a dog put on park cat in
---------------- ----- ---------- ----- --- ----- ----- ---- ------ ----- ----
dog 3 1 1 2 0 1 1 1 1 0
park 2 1 0 1 0 0 0 0 0 3
owner 2 0 0 0 0 1 0 0 0 0
tiger 2 0 1 0 0 0 0 0 0 0
cat 3 0 1 0 1 0 0 0 0 0
a 2 1 1 0 2 0 1 1 0 0
was 3 1 0 1 2 0 0 0 1 1

contexts for window size 6

word \ context the in was dog a recently of put on for
---------------- ----- ---- ----- ----- --- ---------- ---- ----- ---- -----
dog 6 2 2 0 2 1 1 1 1 1
park 2 3 1 1 1 1 0 0 0 0
owner 2 0 0 1 1 0 0 1 1 0
tiger 3 0 1 0 0 0 1 0 0 0
cat 5 1 1 1 0 0 1 0 0 0
a 3 1 1 2 0 1 0 1 1 1
was 7 2 0 2 1 1 

### References

 * McDonald, S., & Ramscar, M. (2001). Testing the distributioanl hypothesis: The influence of context on judgements of semantic similarity. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 23, No. 23).