{
"cells": [
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"import re\n",
"import numpy as np\n",
"\n",
"# You'll need to run, in the terminal\n",
"# pip install tabulate\n",
"from tabulate import tabulate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Distributional Hypothesis\n",
"\n",
"The basic idea behind the distributional hypothesis can be directly taken from Wikipedia:\n",
"\n",
"> Linguistic items that appear in similar contexts have similar meanings.\n",
"\n",
"Here, we can think of \"linguistic items\" as words. A more precise (and radical) version can be found in [these slides by Marco Baroni](https://www.cs.utexas.edu/~mooney/cs388/slides/dist-sem-intro-NLP-class-UT.pdf):\n",
"\n",
"- \"The meaning of a word _is_ the _set of contexts_ in which it occurs in texts.\"\n",
"\n",
"- \"_Important aspects of_ the meaning of a word _are a function of_ the set of contexts in which it occurs in texts.\"\n",
"\n",
"Those slides also offer us a few examples of how we are able to extract the meaning of certain words from the context. What does the word \"wampimuk\" mean in the following sentences?\n",
"\n",
"> `(1) Ugh, I think I had too much wampimuk last night!`\n",
"\n",
"> `(2) The other day when I was walking through the woods, I saw a wampimuk sleeping just besides the path.`\n",
"\n",
"> `(3) Goofey was running late for his appointment at the wampimuk.`\n",
"\n",
"The slides cite as reference McDonald & Ramscar (2001), from which I got these examples:\n",
"\n",
"> **Context A: ‘urn’**\n",
">\n",
"> On his recent holiday in Ghazistan, Joe slipped easily into the customs of the locals. In the hotel\n",
"restaurant there was a samovar dispensing tea at every table. Guests simply served themselves from\n",
"the samovar whenever they liked. Joe’s table had an elaborately crafted samovar. It was the first\n",
"earthenware samovar that he had seen.\n",
">\n",
"> **Context B: ‘kettle’**\n",
">\n",
"> On his recent holiday in Ghazistan, Joe slipped easily into the customs of the locals. His hotel room\n",
"featured a samovar and a single hob. Each morning Joe boiled water in the samovar for tea. Like others\n",
"he had seen on his holiday, Joe’s samovar was blackened from years of use. He imagined that at some\n",
"point it would be replaced with an electric samovar\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Context\n",
"\n",
"In our first introduction to Distributional Semantics, we said that we wanted to calculate semantic similarities between \"linguistic items\". We argued that deciding whether two items are semantically similar was roughly the same as deciding whether their meaning is similar. Then we introduced an indirection to our problem: now we needed to define what _meaning_ was. As we saw, meaning is a difficult concept, and then, we concluded that we would try to get it indirectly based on the contexts in which the words appear. Now we have once another problem: what is context?\n",
"\n",
"As mentioned already in the Corpus Linguistics class, generally speaking, context can include a huge variety of things:\n",
"- previous sentences\n",
"- other words in the same sentence\n",
"- socioeconomical status of the speaker\n",
"- pretty much any other characteristic of the speaker\n",
"- thematic context of the utterance\n",
"- date and time\n",
"- who is the current president of the USA\n",
"- the weather\n",
"- ...\n",
"\n",
"For example, a sentence like\n",
"\n",
"> Donald talked nonsense before and for sure he will talk nonsense again.\n",
"\n",
"may or may not be interpreted to give you information about the speaker's political views, depending on several of the variables mentioned above. Unfortunately, when performing distributional semantics, several of these interpretations are not applicable: you usually only have a text corpus without any \"external\" details about the context in which these data were collected. Therefore, you typically use something like the following:\n",
"- words of the same sentence/paragraph/document\n",
"\n",
"> The quick brown fox jumps over the lazy dog .\n",
"\n",
"- words within a window of a certain size around the word of interest (in the example, the window size is 2)\n",
"\n",
"> The quick brown fox jumps over the lazy dog .\n",
"\n",
"- variations of the former two, e.g. using other information such as POS-tags or stopword lists\n",
"\n",
"> The quick brown fox jumps over the lazy dog .\n",
"\n",
"Still... in principle it's up to you to define what context makes sense for your application."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Representing Context\n",
"\n",
"In the Machine Learning classes, we saw how we normally want to transform our sentences into _something_ numerical, that we can use in our models. We did it using a Bag of Words representation, in which each sentence was converted into a list of numbers. Here we want to do something similar: we want to convert our words into vectors that _somehow_ give us information about all the contexts in which it appears. As our definition of \"context\", let's use all the other words of the same sentence.\n",
"\n",
"So, to make it more concrete: our goal is going to be to convert each word into a vector that _somehow_ gives us information about all the sentences in which it appears.\n",
"\n",
"Let's first define a list of sentences, that we can work with:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"sentences = [\n",
" 'The owner put the dog on a leash for taking a walk outside.',\n",
" 'In the park recently, a dog was barking.',\n",
" 'The new tiger was the attraction of the zoo.',\n",
" 'The cat didn\\'t care about its owner.',\n",
" 'To the surprise of the dog, the cat was not in the mood for playing.',\n",
" 'Many kids were playing in the park.',\n",
" 'There is only one park in this ugly city.',\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preprocessing\n",
"\n",
"Our first step is going to be to tokenize them, in the same way as we did it before:"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"sentences = [nltk.word_tokenize(s) for s in sentences]"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['The', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside', '.']\n",
"['In', 'the', 'park', 'recently', ',', 'a', 'dog', 'was', 'barking', '.']\n",
"['The', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo', '.']\n",
"['The', 'cat', 'did', \"n't\", 'care', 'about', 'its', 'owner', '.']\n",
"['To', 'the', 'surprise', 'of', 'the', 'dog', ',', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing', '.']\n",
"['Many', 'kids', 'were', 'playing', 'in', 'the', 'park', '.']\n",
"['There', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city', '.']\n"
]
}
],
"source": [
"# Let's just see what we've got with the previous line\n",
"for s in sentences:\n",
" print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our next step is going to be to filter out things that do not look like words (since, well, they will probably not contribute a lot to the meanings of the words). In serious applications, we may want to something complicated here. In our example, we will just remove any token that is not composed entirely by letters. (notice that this would eliminate words like `n't` or `M.Sc.`)"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
"new_sentences = []\n",
"\n",
"for s in sentences:\n",
" # Creates a new list of tokens keeping only tokens containing exclusively letters\n",
" new_s = [w for w in s if re.match('[\\w]+', w)]\n",
" # Inserts the new list (i.e., `new_s`) in the new_sentences list\n",
" new_sentences.append(new_s)"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['The', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']\n",
"['In', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']\n",
"['The', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo']\n",
"['The', 'cat', 'did', \"n't\", 'care', 'about', 'its', 'owner']\n",
"['To', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']\n",
"['Many', 'kids', 'were', 'playing', 'in', 'the', 'park']\n",
"['There', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city']\n"
]
}
],
"source": [
"sentences = new_sentences\n",
"\n",
"for s in new_sentences:\n",
" print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we want to uncapitalize those capital letters:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"new_sentences = []\n",
"\n",
"for s in sentences:\n",
" # Creates a new list of tokens, uncapitalizing everything\n",
" new_s = [w.lower() for w in s]\n",
" # Inserts the new list (i.e., `new_s`) in the new_sentences list\n",
" new_sentences.append(new_s)"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']\n",
"['in', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']\n",
"['the', 'new', 'tiger', 'was', 'the', 'attraction', 'of', 'the', 'zoo']\n",
"['the', 'cat', 'did', \"n't\", 'care', 'about', 'its', 'owner']\n",
"['to', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']\n",
"['many', 'kids', 'were', 'playing', 'in', 'the', 'park']\n",
"['there', 'is', 'only', 'one', 'park', 'in', 'this', 'ugly', 'city']\n"
]
}
],
"source": [
"sentences = new_sentences\n",
"\n",
"for s in sentences:\n",
" print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculating co-occurrences between `dog` and other words\n",
"\n",
"Let's focus on a single word, `dog`, to see what it would take to represent the context around that single word. Since we defined (for this example) \"context\" as \"the sentence in which the word appears\", we want to go through all our sentences and only keep those that contain the word `dog`:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the', 'owner', 'put', 'the', 'dog', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']\n",
"['in', 'the', 'park', 'recently', 'a', 'dog', 'was', 'barking']\n",
"['to', 'the', 'surprise', 'of', 'the', 'dog', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']\n"
]
}
],
"source": [
"# Get all sentences with dog\n",
"sentences_with_dog = [s for s in sentences if 'dog' in s]\n",
"\n",
"for s in sentences_with_dog:\n",
" print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our Corpus Linguistics class, we saw how we could count the frequencies of the words in a corpus using a `FreqDist` object. We will use it again here. Let's first initialize a `FreqDist` object without any data:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"fd_dog = nltk.FreqDist()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FreqDist({})"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fd_dog"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `FreqDist` object has a function `update()` that receives a list of tokens and includes them in its countings. We'll use it to count the occurrences of all the words in our three sentences.\n",
"\n",
"The problem is: we do not want to count the occurrences of `dog`. After all, we know that `dog` occurs in all of these sentences. So we will first remove the word dog from each of the sentence using the same kind of code we used before:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"new_sentences_with_dog = []\n",
"\n",
"for s in sentences_with_dog:\n",
" # Creates a new list of tokens, without dog\n",
" new_s = [w for w in s if w != 'dog']\n",
" # Inserts the new list (i.e., `new_s`) in the sentences_with_dog list\n",
" new_sentences_with_dog.append(new_s)\n",
"\n",
"sentences_with_dog = new_sentences_with_dog"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['the', 'owner', 'put', 'the', 'on', 'a', 'leash', 'for', 'taking', 'a', 'walk', 'outside']\n",
"['in', 'the', 'park', 'recently', 'a', 'was', 'barking']\n",
"['to', 'the', 'surprise', 'of', 'the', 'the', 'cat', 'was', 'not', 'in', 'the', 'mood', 'for', 'playing']\n"
]
}
],
"source": [
"for s in sentences_with_dog:\n",
" print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can finally use the `update` function of the `FreqDist` object:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"for s in sentences_with_dog:\n",
" fd_dog.update(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see what we've got:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dog\n",
" dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)])\n"
]
}
],
"source": [
"print('dog\\n', fd_dog.items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Doing the same with `park` (in less code)\n",
"\n",
"Of course, we could the exact same thing for another word, say `park`:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"park\n",
" dict_items([('in', 3), ('the', 2), ('recently', 1), ('a', 1), ('dog', 1), ('was', 1), ('barking', 1), ('many', 1), ('kids', 1), ('were', 1), ('playing', 1), ('there', 1), ('is', 1), ('only', 1), ('one', 1), ('this', 1), ('ugly', 1), ('city', 1)])\n"
]
}
],
"source": [
"fd_park = nltk.FreqDist()\n",
"\n",
"# Iterates through all sentences that contain 'park'...\n",
"for s in [s for s in sentences if 'park' in s]:\n",
" # ... updating the `fd_park` variable with the counts of all words except 'park'\n",
" fd_park.update([w for w in s if w != 'park'])\n",
"\n",
"# Prints the result\n",
"print('park\\n', fd_park.items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Representing the context of a list of words of interest\n",
"\n",
"Now we want to make our example a little more complicated and create code that would do the exact same as we did for `dog` or `park` above, only for a longer list of words. Let's call them our \"words of interest\". Our goal will be to represent the context of (i.e., create word vectors for) each one of them."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"# Let's define a set of words of interest\n",
"words_of_interest = ['dog', 'park', 'owner', 'tiger', 'cat', 'a', 'was']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, we want to initialize one `FreqDist` object for each one of our words. For that, we will use a dictionary. If you don't remember how dictionaries work, take a look at the Python tutorial of the second class. The `for word in words_of_interest` syntax below is analogous to the list comprehension syntax used for lists."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dog': FreqDist({}),\n",
" 'park': FreqDist({}),\n",
" 'owner': FreqDist({}),\n",
" 'tiger': FreqDist({}),\n",
" 'cat': FreqDist({}),\n",
" 'a': FreqDist({}),\n",
" 'was': FreqDist({})}"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We initialize empty FreqDist\n",
"co_occurrences = {word: nltk.FreqDist() for word in words_of_interest}\n",
"co_occurrences"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will adapt the code we used for `park` to run for several words. We basically \"wrap\" it with a new for loop, iterating through each of our words of interest.\n",
"\n",
"_(this may be a little complicated. Try playing a little with the code here to understand what is going on)_"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"for w_of_interest in words_of_interest:\n",
" # Iterates through all sentences that contain 'w_of_interest'...\n",
" for s in [s for s in sentences if w_of_interest in s]:\n",
" # ... updating the `FreqDist` object associated with 'w_of_interest' with the counts\n",
" # of all words except the 'w_of_interest'\n",
" co_occurrences[w_of_interest].update([w for w in s if w != w_of_interest])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's print what we've got, to see what it looks like."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dog \n",
" dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)]) \n",
"--\n",
"\n",
"park \n",
" dict_items([('in', 3), ('the', 2), ('recently', 1), ('a', 1), ('dog', 1), ('was', 1), ('barking', 1), ('many', 1), ('kids', 1), ('were', 1), ('playing', 1), ('there', 1), ('is', 1), ('only', 1), ('one', 1), ('this', 1), ('ugly', 1), ('city', 1)]) \n",
"--\n",
"\n",
"owner \n",
" dict_items([('the', 3), ('put', 1), ('dog', 1), ('on', 1), ('a', 2), ('leash', 1), ('for', 1), ('taking', 1), ('walk', 1), ('outside', 1), ('cat', 1), ('did', 1), (\"n't\", 1), ('care', 1), ('about', 1), ('its', 1)]) \n",
"--\n",
"\n",
"tiger \n",
" dict_items([('the', 3), ('new', 1), ('was', 1), ('attraction', 1), ('of', 1), ('zoo', 1)]) \n",
"--\n",
"\n",
"cat \n",
" dict_items([('the', 5), ('did', 1), (\"n't\", 1), ('care', 1), ('about', 1), ('its', 1), ('owner', 1), ('to', 1), ('surprise', 1), ('of', 1), ('dog', 1), ('was', 1), ('not', 1), ('in', 1), ('mood', 1), ('for', 1), ('playing', 1)]) \n",
"--\n",
"\n",
"a \n",
" dict_items([('the', 3), ('owner', 1), ('put', 1), ('dog', 2), ('on', 1), ('leash', 1), ('for', 1), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 1), ('park', 1), ('recently', 1), ('was', 1), ('barking', 1)]) \n",
"--\n",
"\n",
"was \n",
" dict_items([('in', 2), ('the', 8), ('park', 1), ('recently', 1), ('a', 1), ('dog', 2), ('barking', 1), ('new', 1), ('tiger', 1), ('attraction', 1), ('of', 2), ('zoo', 1), ('to', 1), ('surprise', 1), ('cat', 1), ('not', 1), ('mood', 1), ('for', 1), ('playing', 1)]) \n",
"--\n",
"\n"
]
}
],
"source": [
"for w_of_interest in words_of_interest:\n",
" print(w_of_interest, \"\\n\", co_occurrences[w_of_interest].items(), \"\\n--\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we've done this, we can have a look at how often each of our words of interest co-occur with any of the other words in our sentences. Let's try this with `dog`, to see if we get the exact same results as before:"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_items([('the', 7), ('owner', 1), ('put', 1), ('on', 1), ('a', 3), ('leash', 1), ('for', 2), ('taking', 1), ('walk', 1), ('outside', 1), ('in', 2), ('park', 1), ('recently', 1), ('was', 2), ('barking', 1), ('to', 1), ('surprise', 1), ('of', 1), ('cat', 1), ('not', 1), ('mood', 1), ('playing', 1)])\n"
]
}
],
"source": [
"# Now we can have a look at how often each word co-occurred\n",
"# with, say, 'dog'\n",
"print(co_occurrences['dog'].items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From co-occurrence counts to vectors\n",
"\n",
"The current format of our counts is just too cumbersome. As already mentioned several times, we want vectors, but what we currently have is just a list containing pairs $(word, count)$. Notice that the list has a variable length:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"22"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Length of the list associated with the word dog\n",
"len(co_occurrences['dog'].items())"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Length of the list associated with the word park\n",
"len(co_occurrences['park'].items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is not very handy. Transforming this list into a vector, however, is not super hard. What you usually do is to fix a vocabulary for the context words. Then, for each word $w$ of our list words of interest, you count the co-occurrences of $w$ with each of the words in the vocabulary. We ignore the words that are not in the vocabulary.\n",
"\n",
"Having a vocabulary will cause our \"list of counts\" to always have the same number of elements. Hence, the whole data can be stored and displayed as a matrix.\n",
"\n",
"Let's do an example. Let's start by getting all the words that are \"relevant\" to us, i.e., the words that co-occur with any of our words of interest:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['the',\n",
" 'a',\n",
" 'for',\n",
" 'in',\n",
" 'was',\n",
" 'owner',\n",
" 'put',\n",
" 'on',\n",
" 'leash',\n",
" 'taking',\n",
" 'walk',\n",
" 'outside',\n",
" 'park',\n",
" 'recently',\n",
" 'barking',\n",
" 'to',\n",
" 'surprise',\n",
" 'of',\n",
" 'cat',\n",
" 'not',\n",
" 'mood',\n",
" 'playing',\n",
" 'in',\n",
" 'the',\n",
" 'recently',\n",
" 'a',\n",
" 'dog',\n",
" 'was',\n",
" 'barking',\n",
" 'many',\n",
" 'kids',\n",
" 'were',\n",
" 'playing',\n",
" 'there',\n",
" 'is',\n",
" 'only',\n",
" 'one',\n",
" 'this',\n",
" 'ugly',\n",
" 'city',\n",
" 'the',\n",
" 'a',\n",
" 'put',\n",
" 'dog',\n",
" 'on',\n",
" 'leash',\n",
" 'for',\n",
" 'taking',\n",
" 'walk',\n",
" 'outside',\n",
" 'cat',\n",
" 'did',\n",
" \"n't\",\n",
" 'care',\n",
" 'about',\n",
" 'its',\n",
" 'the',\n",
" 'new',\n",
" 'was',\n",
" 'attraction',\n",
" 'of',\n",
" 'zoo',\n",
" 'the',\n",
" 'did',\n",
" \"n't\",\n",
" 'care',\n",
" 'about',\n",
" 'its',\n",
" 'owner',\n",
" 'to',\n",
" 'surprise',\n",
" 'of',\n",
" 'dog',\n",
" 'was',\n",
" 'not',\n",
" 'in',\n",
" 'mood',\n",
" 'for',\n",
" 'playing',\n",
" 'the',\n",
" 'dog',\n",
" 'owner',\n",
" 'put',\n",
" 'on',\n",
" 'leash',\n",
" 'for',\n",
" 'taking',\n",
" 'walk',\n",
" 'outside',\n",
" 'in',\n",
" 'park',\n",
" 'recently',\n",
" 'was',\n",
" 'barking',\n",
" 'the',\n",
" 'in',\n",
" 'dog',\n",
" 'of',\n",
" 'park',\n",
" 'recently',\n",
" 'a',\n",
" 'barking',\n",
" 'new',\n",
" 'tiger',\n",
" 'attraction',\n",
" 'zoo',\n",
" 'to',\n",
" 'surprise',\n",
" 'cat',\n",
" 'not',\n",
" 'mood',\n",
" 'for',\n",
" 'playing']"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# First we get all words that co-occur with each of our words of interest.\n",
"# This is done with the following line.\n",
"#\n",
"# The first for-loop gets each of our \"words of interest\"\n",
"# The second for-loop gets each of the words that co-occur with them\n",
"all_words = [w for word in co_occurrences\n",
" for w in co_occurrences[word]]\n",
"\n",
"all_words\n",
"\n",
"# If this is not clear, try running\n",
"#\n",
"# for word in co_occurrences:\n",
"# print(word)\n",
"# \n",
"# , which will give you each of our words of interest. Then let's\n",
"# fixate into one of our words of interest (say, 'dog'), and run\n",
"# the next for loop. This will give all the words that co-occur\n",
"# with 'dog':\n",
"#\n",
"# for w in co_occurrences['dog']:\n",
"# print(w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is similar to what we did when we created a vocabulary: we filter our set of words to get only the most common words. Calling the `most_common` function, however, provides pairs $(word, count)$. Since we want to create a vocabulary, we are only interested in the words. Therefore, we remove the counts.\n",
"\n",
"To try to make things concrete, we'll create a very small vocabulary, containing only 10 words."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 7),\n",
" ('for', 5),\n",
" ('in', 5),\n",
" ('was', 5),\n",
" ('dog', 5),\n",
" ('a', 4),\n",
" ('recently', 4),\n",
" ('barking', 4),\n",
" ('of', 4),\n",
" ('playing', 4)]"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"most_commons = nltk.FreqDist(all_words).most_common(10)\n",
"most_commons"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['the', 'for', 'in', 'was', 'dog', 'a', 'recently', 'barking', 'of', 'playing']"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# And then we get rid of the numbers (the frequencies)\n",
"vocab = [c for c,count in most_commons]\n",
"vocab"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok... now we have a vocabulary containing all the words that appear the most along with our words of interest. This vocabulary has a length. The rest of it is very similar to our Bag of Words representation, but instead of counting the occurrences of words in a sentence, we are counting the co-ocurrences (i.e., the appearances in the same sentence) of the words with our words of interest."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[7, 2, 2, 2, 0, 3, 1, 1, 1, 1],\n",
" [2, 0, 3, 1, 1, 1, 1, 1, 0, 1],\n",
" [3, 1, 0, 0, 1, 2, 0, 0, 0, 0],\n",
" [3, 0, 0, 1, 0, 0, 0, 0, 1, 0],\n",
" [5, 1, 1, 1, 1, 0, 0, 0, 1, 1],\n",
" [3, 1, 1, 1, 2, 0, 1, 1, 0, 0],\n",
" [8, 1, 2, 0, 2, 1, 1, 1, 2, 1]])"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now, the `co_matrix` contains a matrix with the following format:\n",
"# * rows: each of our words of interest ('dog', 'park', 'owner', ...)\n",
"# * columns: each of our `context` words (the most common words)\n",
"co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words_of_interest])\n",
"co_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the code below, we create a function that shows this in a nicer format, that hopefully more intuitive. See if you can understand what the functioin is doing:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"def show_co_matrix(mat, words, vocab, max_vocab_size=None):\n",
" if max_vocab_size:\n",
" print(tabulate([[word]+list(mat[i,:max_vocab_size]) for i,word in enumerate(words)],\n",
" headers=['word \\ context']+vocab[:max_vocab_size]))\n",
" else:\n",
" print(tabulate([[word]+list(mat[i]) for i,word in enumerate(words)],\n",
" headers=['word \\ context']+vocab))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we use the function to see our data in a nice way:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"word \\ context the for in was dog a recently barking of playing\n",
"---------------- ----- ----- ---- ----- ----- --- ---------- --------- ---- ---------\n",
"dog 7 2 2 2 0 3 1 1 1 1\n",
"park 2 0 3 1 1 1 1 1 0 1\n",
"owner 3 1 0 0 1 2 0 0 0 0\n",
"tiger 3 0 0 1 0 0 0 0 1 0\n",
"cat 5 1 1 1 1 0 0 0 1 1\n",
"a 3 1 1 1 2 0 1 1 0 0\n",
"was 8 1 2 0 2 1 1 1 2 1\n"
]
}
],
"source": [
"show_co_matrix(co_matrix, words_of_interest, vocab, max_vocab_size=10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"This is what we call a **co-occurrence matrix**. If you find that the coding above is hard to follow, you may want to watch [this](https://www.youtube.com/watch?v=-F-PlgX8GcY) and [this](https://www.youtube.com/watch?v=Qy-Lq9Ae2KM), in which a person calculates the co-occurences by hand (I don't really love these videos, but they are the only minimally decent video I've found on the topic)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stop word Removal\n",
"\n",
"The notion _stop word_ is a bit fuzzy, but you can see it as a _frequent word_ that is assumed to be _irrelevant_ to the task at hand. In our case, the idea is that these words would appear in almost any sentence, and therefore they wouldn't contribute a lot of information about the context.\n",
"\n",
"Typically the most common words are considered to be stop words.\n",
"\n",
"Luckily NLTK comes with a list of stop words that we can use:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{'sebisanya',\n",
" 'hubiésemos',\n",
" 'houveria',\n",
" 'terkira',\n",
" 'foram',\n",
" \"you've\",\n",
" 'štirinajstega',\n",
" 'mint',\n",
" 'memperlihatkan',\n",
" 'nismo',\n",
" 'osmo',\n",
" 'dari',\n",
" 'triintridesetim',\n",
" 'stesse',\n",
" 'mu',\n",
" 'usai',\n",
" 'vsakogar',\n",
" 'persoalan',\n",
" 'hab',\n",
" 'валекин ',\n",
" 'морт',\n",
" 'masih',\n",
" 'и',\n",
" 'štiriindvajsetim',\n",
" 'with',\n",
" 'multi',\n",
" 'osemdesetega',\n",
" 'ellos',\n",
" 'себя',\n",
" 'әлденеше',\n",
" 'vsakršno',\n",
" 'enajstima',\n",
" 'enkraten',\n",
" 'esses',\n",
" 'tretjima',\n",
" 'o',\n",
" 'noilta',\n",
" 'nekaka',\n",
" 'sesuatunya',\n",
" 'deiner',\n",
" 'pukul',\n",
" 'houverem',\n",
" 'seusai',\n",
" 'mojimi',\n",
" 'facciano',\n",
" 'stanno',\n",
" 'kakšni',\n",
" 'maraš',\n",
" 'smelo',\n",
" 'ujarnya',\n",
" 'terus',\n",
" 'kedua',\n",
" 'यहाँसम्म',\n",
" 'seluruh',\n",
" 'shouldn',\n",
" 'după',\n",
" 'nəhayət',\n",
" 'verjetno',\n",
" 'पटक',\n",
" 'izpod',\n",
" 'petnajste',\n",
" 'deyil',\n",
" 'lesz',\n",
" 'sunt',\n",
" 'jim',\n",
" 'untuk',\n",
" 'isso',\n",
" 'هلا',\n",
" 'ed',\n",
" 'εκεινο',\n",
" 'itulah',\n",
" 'эй',\n",
" 'dovolite',\n",
" 'eres',\n",
" 'dvanajstih',\n",
" 'muito',\n",
" 'ert',\n",
" 'тот',\n",
" 'dvakratni',\n",
" 'замон',\n",
" 'ले',\n",
" 'le-tema',\n",
" 'lagi',\n",
" 'pe',\n",
" 'hänessä',\n",
" 'mengingatkan',\n",
" 'warst',\n",
" 'ӯббо',\n",
" 'egyre',\n",
" 'bodimo',\n",
" 'ataukah',\n",
" 'для',\n",
" 'ба',\n",
" 'eûmes',\n",
" 'khususnya',\n",
" 'don',\n",
" 'निम्न',\n",
" 'about',\n",
" 'zich',\n",
" 'nossa',\n",
" 'până',\n",
" 'هاك',\n",
" 'sintiendo',\n",
" 'бізбен',\n",
" 'सबै',\n",
" 'stoterem',\n",
" 'können',\n",
" 'fata',\n",
" 'sí',\n",
" 'iar',\n",
" 'poate',\n",
" 'τα',\n",
" 'kelihatan',\n",
" 'egyetlen',\n",
" 'sebuah',\n",
" 'meiner',\n",
" 'enimi',\n",
" 'تين',\n",
" 'voi',\n",
" 'desetem',\n",
" 'svojih',\n",
" 'sementara',\n",
" 'care',\n",
" 'spričo',\n",
" 'опять',\n",
" 'šesto',\n",
" 'svojem',\n",
" 'desi',\n",
" 'naših',\n",
" 'pertama',\n",
" 'dvojnimi',\n",
" 'τὸ',\n",
" 'moja',\n",
" 'менен',\n",
" 'sendirian',\n",
" 'ama',\n",
" 'trojnem',\n",
" 'biasanya',\n",
" 'akhiri',\n",
" 'ž',\n",
" 'estuvieses',\n",
" 'sedemindvajsetem',\n",
" 'estoy',\n",
" 'онда',\n",
" 'ο',\n",
" 'dvainšestdeset',\n",
" 'aki',\n",
" 'petindvajsete',\n",
" 'unos',\n",
" 'فيها',\n",
" 'nostri',\n",
" 'yox',\n",
" 'sama',\n",
" 'ste',\n",
" 'هيهات',\n",
" 'أنتما',\n",
" 'чун-ки',\n",
" 'intre',\n",
" 'petinosemdeset',\n",
" 'fossero',\n",
" 'ponjo',\n",
" 'štiridesetemu',\n",
" 'avendo',\n",
" 'enako',\n",
" 'mei',\n",
" 'sudah',\n",
" 'peti',\n",
" 'estada',\n",
" 'त',\n",
" 'le-to',\n",
" 'jenes',\n",
" 'šest',\n",
" 'tu',\n",
" 'setidaknya',\n",
" 'šeste',\n",
" 'kenessä',\n",
" 'osemnajsto',\n",
" 'нибудь',\n",
" 'هيا',\n",
" 'petintridesetim',\n",
" 'tai',\n",
" 'very',\n",
" 'va',\n",
" 'dovoljen',\n",
" 'асло',\n",
" 'petinštiridesetega',\n",
" 'қаңқ-қаңқ',\n",
" 'емес',\n",
" 'tristotim',\n",
" 'dahulu',\n",
" 'totusi',\n",
" 'fuimos',\n",
" 'καί',\n",
" 'tolikšnima',\n",
" 'josta',\n",
" 'бірдеме',\n",
" 'dilihat',\n",
" 'nekakšne',\n",
" 'kakršnimakoli',\n",
" 'között',\n",
" 'com',\n",
" 'osmih',\n",
" 'nekakim',\n",
" 'he',\n",
" 'moči',\n",
" 'لا',\n",
" 'eurent',\n",
" 'myself',\n",
" 'meissä',\n",
" 'οὐδ',\n",
" 'sai',\n",
" 'før',\n",
" 'želeti',\n",
" 'kakimi',\n",
" 'amelyek',\n",
" 'бірнеше',\n",
" 'während',\n",
" 'să',\n",
" 'sea',\n",
" 'jelas',\n",
" 'барои он ки',\n",
" 'saya',\n",
" 'sewaktu',\n",
" 'otuz',\n",
" 'tolik',\n",
" 'эх',\n",
" 'sənə',\n",
" 'mitt',\n",
" 'petinosemdeseti',\n",
" 'eller',\n",
" 'maka',\n",
" 'ati',\n",
" 'tambah',\n",
" 'degli',\n",
" 'menegaskan',\n",
" 'does',\n",
" 'demikianlah',\n",
" 'ain',\n",
" 'eens',\n",
" 'можно',\n",
" 'bodite',\n",
" 'hotiva',\n",
" 'tuve',\n",
" 'ses',\n",
" 'onim',\n",
" 'ҳм',\n",
" 'egy',\n",
" 'ponj',\n",
" 'nell',\n",
" 'jenem',\n",
" 'bunların',\n",
" 'mulanya',\n",
" 'vsakršen',\n",
" 'katerimi',\n",
" 'vagyok',\n",
" 'bilakah',\n",
" 'by',\n",
" 'le-tistih',\n",
" 'كل',\n",
" 'nerde',\n",
" 'manches',\n",
" 'but',\n",
" 'le-taki',\n",
" 'διὰ',\n",
" 'berapalah',\n",
" 'إليكم',\n",
" 'қайқаң-құйқаң',\n",
" 'noče',\n",
" 'enaka',\n",
" 'devetnajstemu',\n",
" 'ise',\n",
" 'esteve',\n",
" 'mungkinkah',\n",
" 'olin',\n",
" \"shan't\",\n",
" 'एउटै',\n",
" 'τὰς',\n",
" 'ničemur',\n",
" 'le-take',\n",
" 'secukupnya',\n",
" 'same',\n",
" 'tue',\n",
" 'era',\n",
" 'хоть',\n",
" 'qırx',\n",
" 'korleis',\n",
" 'sentido',\n",
" 'अन्तर्गत',\n",
" 'hvordan',\n",
" 'ins',\n",
" 'han',\n",
" 'sinulle',\n",
" 'pred',\n",
" 'ungkap',\n",
" 'å',\n",
" 'tandas',\n",
" 'tristotima',\n",
" 'dvojemu',\n",
" 'ohne',\n",
" 'لستما',\n",
" 'multe',\n",
" 'petnajsta',\n",
" 'deja',\n",
" 'eden',\n",
" 'petdesetih',\n",
" 'dimulailah',\n",
" 'कसैले',\n",
" 'petdesetimi',\n",
" 'avez',\n",
" 'sedemindvajseti',\n",
" 'keine',\n",
" 'jag',\n",
" 'هم',\n",
" 'no',\n",
" 'вақте ки',\n",
" 'аз афташ',\n",
" 'petintrideseta',\n",
" 'hotel',\n",
" 'suya',\n",
" 'siksi',\n",
" 'berlalu',\n",
" 'li',\n",
" 'sangatlah',\n",
" 'кәне',\n",
" 'noastre',\n",
" 'joiksi',\n",
" 'jadilah',\n",
" 'hanno',\n",
" 'petinosemdeseto',\n",
" 'oricît',\n",
" 'ذواتا',\n",
" 'petinštiridesete',\n",
" 'μη',\n",
" 'dvakratne',\n",
" 'ओठ',\n",
" 'petinštiridesetimi',\n",
" 'akik',\n",
" 'birşey',\n",
" 'यस्तो',\n",
" 'miért',\n",
" 'tretjemu',\n",
" 'del',\n",
" 'ilyenkor',\n",
" 'mulţi',\n",
" 'onsuzda',\n",
" 'наход ки',\n",
" 'stotih',\n",
" 'mistä',\n",
" 'saremmo',\n",
" 'naproti',\n",
" 'sebe',\n",
" 'memisalkan',\n",
" 'midva',\n",
" 'τό',\n",
" 'зеро',\n",
" 'menyebutkan',\n",
" 'pet',\n",
" 'kira',\n",
" 'căci',\n",
" 'takšen',\n",
" 'weren',\n",
" 'multă',\n",
" 'acela',\n",
" 'deira',\n",
" 'dulu',\n",
" 'keistä',\n",
" 'want',\n",
" 'құрау-құрау',\n",
" 'لدى',\n",
" 'faiz',\n",
" 'kala',\n",
" 'joilta',\n",
" 'замоно',\n",
" 'atat',\n",
" 'deseter',\n",
" 'εἴτε',\n",
" 'ditujukan',\n",
" 'ὅπερ',\n",
" 'sinusta',\n",
" 'барша',\n",
" 'bodita',\n",
" 'namreč',\n",
" 'जसबाट',\n",
" 'estadas',\n",
" 'sedmo',\n",
" 'vart',\n",
" 'لهن',\n",
" 'berkali-kali',\n",
" 'muy',\n",
" 'гар-чи',\n",
" 'stesti',\n",
" 'हुन',\n",
" 'inginkan',\n",
" 'eni',\n",
" 'will',\n",
" 'nisem',\n",
" 'آه',\n",
" 'tiste',\n",
" 'ҳмм',\n",
" 'τον',\n",
" 'далаң-далаң',\n",
" 'ezen',\n",
" 'eram',\n",
" 'nami',\n",
" 'vrh',\n",
" 'marale',\n",
" 'nečemu',\n",
" 'dua',\n",
" 'sale',\n",
" 'ander',\n",
" 'ешқайсы',\n",
" 'οσο',\n",
" 'takšni',\n",
" 'enak',\n",
" 'тағы',\n",
" 'boleh',\n",
" 'यसपछि',\n",
" 'čigava',\n",
" 'deinen',\n",
" 'dvajsete',\n",
" 'inikah',\n",
" 'ай',\n",
" 'meyakinkan',\n",
" 'seas',\n",
" 'triintridesetega',\n",
" 'temuintemu',\n",
" 'охир',\n",
" 'számára',\n",
" 'našega',\n",
" 'kenenä',\n",
" 'hänet',\n",
" 'apoi',\n",
" 'zijn',\n",
" 'sela',\n",
" 'dasselbe',\n",
" 'šestintridesetih',\n",
" 'durch',\n",
" 'meinem',\n",
" 'morava',\n",
" 'onedve',\n",
" 'हरेकog',\n",
" 'мой',\n",
" 'ведь',\n",
" 'nyt',\n",
" 'enem',\n",
" 'zase',\n",
" 'suo',\n",
" 'بما',\n",
" 'कतै',\n",
" 'peta',\n",
" 'siete',\n",
" 'vsem',\n",
" 'bəzən',\n",
" 'atitea',\n",
" 'semata',\n",
" 'morete',\n",
" 'takšnem',\n",
" 'that',\n",
" 'dieses',\n",
" 'мен',\n",
" 'hayas',\n",
" 'منها',\n",
" 'всех',\n",
" 'estado',\n",
" 'fueses',\n",
" 'estarán',\n",
" 'e',\n",
" 'kolikima',\n",
" 'enkratnemu',\n",
" 'mednju',\n",
" 'meski',\n",
" 'у',\n",
" 'oseminštiridesetih',\n",
" 'skoznjo',\n",
" 'moramo',\n",
" 'zo',\n",
" 'estivemos',\n",
" 'morala',\n",
" 'skozte',\n",
" 'tridesetim',\n",
" 'das',\n",
" 'τὸν',\n",
" 'sepantasnya',\n",
" 'τῶν',\n",
" 'бұл',\n",
" 'tuyos',\n",
" 'олар',\n",
" 'onlardan',\n",
" 'diperlukannya',\n",
" 'निम्नानुसार',\n",
" 'naokoli',\n",
" 'etmək',\n",
" 'yaitu',\n",
" 'tvojem',\n",
" 'zmoreva',\n",
" 'dvakratno',\n",
" 'denne',\n",
" 'starò',\n",
" 'вдруг',\n",
" 'zapored',\n",
" 'minhas',\n",
" 'bertutur',\n",
" 'melainkan',\n",
" 'لم',\n",
" 'moralo',\n",
" 'kakršne',\n",
" 'marsikoga',\n",
" 'هذا',\n",
" 'tengáis',\n",
" 'εκεινοσ',\n",
" 'ποῦ',\n",
" 'هنالك',\n",
" 'tretjimi',\n",
" 'istima',\n",
" 'begitu',\n",
" 'cele',\n",
" 'sekiranya',\n",
" 'lhes',\n",
" 'what',\n",
" 'ذلك',\n",
" 'pak',\n",
" 'какой',\n",
" 'сенен\\tонан',\n",
" 'ισωσ',\n",
" 'vaš',\n",
" 'temale',\n",
" 'onimi',\n",
" 'zal',\n",
" 'δαίσ',\n",
" 'τησ',\n",
" 'altfel',\n",
" 'padahal',\n",
" 'सक्छ',\n",
" 'acei',\n",
" 'mojih',\n",
" 'dituturkannya',\n",
" 'sedeminšestdesetimi',\n",
" 'kalau',\n",
" 'deim',\n",
" 'ott',\n",
" 'serez',\n",
" 'أيها',\n",
" 'bung',\n",
" 'zu',\n",
" 'पनि',\n",
" 'ваҳ',\n",
" 'marate',\n",
" 'nagy',\n",
" 'впрочем',\n",
" 'ben',\n",
" 'уже',\n",
" 'szerint',\n",
" 'tisočerim',\n",
" 'nate',\n",
" 'avrebbero',\n",
" 'sebegitu',\n",
" 'было',\n",
" 'houveríamos',\n",
" 'marsikaterimi',\n",
" 'weiter',\n",
" 'takemu',\n",
" 'δαὶ',\n",
" 'hendaklah',\n",
" 'qu',\n",
" 'fummo',\n",
" 'vannak',\n",
" 'cite',\n",
" 'әлдене',\n",
" 'tocmai',\n",
" 'le-tiste',\n",
" 'donde',\n",
" 'siitä',\n",
" 'kva',\n",
" 'də',\n",
" 'ayant',\n",
" 'tutti',\n",
" 'сіздің',\n",
" 'sarò',\n",
" 'deseterega',\n",
" 'ол',\n",
" 'εκεινοι',\n",
" 'petinosemdesetemu',\n",
" 'sebabnya',\n",
" 'hotita',\n",
" 'joksi',\n",
" 'temile',\n",
" 'njihovega',\n",
" 'min',\n",
" 'nikomur',\n",
" 'uneori',\n",
" 'дегенмен',\n",
" 'बीचमा',\n",
" 'trideset',\n",
" 'erais',\n",
" 'šestindvajset',\n",
" 'və',\n",
" 'tolikšnih',\n",
" 'हुन्छ',\n",
" 'boste',\n",
" 'қабл',\n",
" 'yoxdur',\n",
" 'mengatakan',\n",
" 'kogarkoli',\n",
" 'तपाई',\n",
" 'bulan',\n",
" 'kateremu',\n",
" 'numa',\n",
" 'etmə',\n",
" 'najinem',\n",
" 'katerem',\n",
" 'sebaik-baiknya',\n",
" 'nerede',\n",
" 'eues',\n",
" 'təəssüf',\n",
" 'diibaratkan',\n",
" 'desetima',\n",
" 'hendaknya',\n",
" 'faranno',\n",
" 'арнайы',\n",
" 'misalnya',\n",
" 'andere',\n",
" 'dă',\n",
" 'stotemu',\n",
" 'митың-митың',\n",
" 'prvih',\n",
" 'осындай',\n",
" 'ἐὰν',\n",
" 'alla',\n",
" 'njunima',\n",
" 'beginikah',\n",
" 'दुई',\n",
" 'uten',\n",
" 'petstotih',\n",
" 'petdeseto',\n",
" 'll',\n",
" 'nokor',\n",
" 'देखेको',\n",
" 'enakimi',\n",
" 'onder',\n",
" 'nekom',\n",
" 'waktunya',\n",
" 'con',\n",
" 'αυτων',\n",
" 'ena',\n",
" 'ἧς',\n",
" 'tristoti',\n",
" 'utolsó',\n",
" 'disebutkannya',\n",
" 'acel',\n",
" 'le-onemu',\n",
" 'तिर',\n",
" 'yours',\n",
" 'nekemu',\n",
" 'avesse',\n",
" 'ذلكم',\n",
" 'ayants',\n",
" 'auriez',\n",
" 'كليكما',\n",
" 'houve',\n",
" 'nimeni',\n",
" 'hubieron',\n",
" 'suatu',\n",
" 'vajinemu',\n",
" 'чуть',\n",
" 'le-tista',\n",
" 'obe',\n",
" 'kepadanya',\n",
" 'drugo',\n",
" 'भए',\n",
" 'zakaj',\n",
" 'بنا',\n",
" 'δέ',\n",
" 'takle',\n",
" 'το',\n",
" 'avem',\n",
" 'كيف',\n",
" 'olitte',\n",
" 'tampak',\n",
" 'қыңқ',\n",
" 'күрт',\n",
" 'tivera',\n",
" 'povrh',\n",
" 'marava',\n",
" 'porque',\n",
" 'trojega',\n",
" 'alles',\n",
" 'аз рӯи',\n",
" 'düz',\n",
" 'छु',\n",
" 'dvajsetem',\n",
" 'facevo',\n",
" 'hoti',\n",
" 'petinštiridesetem',\n",
" 'dvoji',\n",
" 'devetnajste',\n",
" 'starai',\n",
" 'том',\n",
" 'panjang',\n",
" 'mykje',\n",
" 'kolikimi',\n",
" 'былп',\n",
" 'devetdeseto',\n",
" 'dovoljeno',\n",
" 'борт',\n",
" 'notre',\n",
" 'euer',\n",
" 'estivessem',\n",
" 'diesen',\n",
" 'sebagaimana',\n",
" 'kolikšnemu',\n",
" 'takole',\n",
" 'néha',\n",
" 'nimic',\n",
" 'takale',\n",
" 'anderer',\n",
" 'sedemsto',\n",
" 'štirinajstih',\n",
" 'sebelumnya',\n",
" 'dvajsetimi',\n",
" 'faremmo',\n",
" 'någon',\n",
" 'eussiez',\n",
" 'hajamos',\n",
" 'bei',\n",
" 'cuma',\n",
" 'just',\n",
" 'tiver',\n",
" 'dvestoti',\n",
" 'manchen',\n",
" 'tistima',\n",
" 'petega',\n",
" 'd',\n",
" 'sampai',\n",
" 'njihovima',\n",
" 'tuvieses',\n",
" 'petinosemdesetima',\n",
" 'чем',\n",
" 'teve',\n",
" 'ἐν',\n",
" 'fost',\n",
" 'berakhir',\n",
" 'ولو',\n",
" 'осы',\n",
" 'tretja',\n",
" 'četrtemu',\n",
" 'eusses',\n",
" 'وإذ',\n",
" 'नत्र',\n",
" 'fra',\n",
" 'dvainšestdesetih',\n",
" 'mənə',\n",
" 'enaindvajseta',\n",
" 'kiranya',\n",
" 'inainte',\n",
" 'menjelaskan',\n",
" 'fussions',\n",
" 'lenni',\n",
" 'desetero',\n",
" 'där',\n",
" 'diberikan',\n",
" 'seinem',\n",
" 'بمن',\n",
" 'nobenima',\n",
" 'ca',\n",
" 'osemnajst',\n",
" 'menanyai',\n",
" 'dvajseto',\n",
" 'зеро ки',\n",
" 'vnovič',\n",
" 'than',\n",
" 'cam',\n",
" 'triinšestdesetih',\n",
" 'zoper',\n",
" 'enkratnima',\n",
" 'लागि',\n",
" 'careia',\n",
" 'لستن',\n",
" 'bagaikan',\n",
" 'unuia',\n",
" 'ela',\n",
" 'dit',\n",
" 'səksən',\n",
" 'тағыда',\n",
" 'тарс-тұрс',\n",
" 'apatah',\n",
" 'honom',\n",
" 'ана',\n",
" 'sana',\n",
" 'hers',\n",
" 'tengan',\n",
" 'таман',\n",
" 'nobena',\n",
" 'lîngă',\n",
" 'whom',\n",
" 'dirinya',\n",
" 'tolikšnim',\n",
" 'devetega',\n",
" 'njima',\n",
" 'si',\n",
" 'шояд',\n",
" 'veel',\n",
" 'vreun',\n",
" 'deires',\n",
" 'petdesetem',\n",
" 'біздің',\n",
" 'tisočera',\n",
" 'لسن',\n",
" 'nikakršna',\n",
" 'şu',\n",
" 'ё',\n",
" 'nobene',\n",
" 'hänellä',\n",
" 'sedemdesetega',\n",
" 'koga',\n",
" 'tuvo',\n",
" 'aus',\n",
" 'olur',\n",
" 'welcher',\n",
" 'dijelaskan',\n",
" 'dikarenakan',\n",
" 'petnajstima',\n",
" 'auraient',\n",
" 'amikor',\n",
" 'memihak',\n",
" 'above',\n",
" 'sedeminšestdesetim',\n",
" 'sur',\n",
" 'fossem',\n",
" 'сізге',\n",
" 'osemdesetimi',\n",
" 'habrán',\n",
" 'soll',\n",
" 'über',\n",
" 'tähän',\n",
" 'acestei',\n",
" 'trojem',\n",
" 'marajte',\n",
" 'यसो',\n",
" 'me',\n",
" 'koska',\n",
" 'var',\n",
" 'δαί',\n",
" 'tretjim',\n",
" 'also',\n",
" 'štirimi',\n",
" 'ἐφ',\n",
" 'kajne',\n",
" 'بي',\n",
" 'totuşi',\n",
" 'vil',\n",
" 'vsej',\n",
" 'altmış',\n",
" 'serás',\n",
" 'pas',\n",
" 'vostre',\n",
" 'será',\n",
" 'obema',\n",
" 'le-takimi',\n",
" 'اللاتي',\n",
" 'želiš',\n",
" 'dedi',\n",
" 'few',\n",
" 'jelasnya',\n",
" 'kenen',\n",
" 'нас',\n",
" 'otro',\n",
" 'while',\n",
" 'kinilah',\n",
" 'sedmimi',\n",
" 'que',\n",
" 'osmemu',\n",
" 'vom',\n",
" 'neradi',\n",
" 'stotega',\n",
" 'здесь',\n",
" 'τὰ',\n",
" 'njuno',\n",
" 'čemu',\n",
" 'sul',\n",
" 'тарбаң-тарбаң',\n",
" 'ἅμα',\n",
" 'stava',\n",
" 'unor',\n",
" 'bolehkah',\n",
" 'nekimi',\n",
" 'noille',\n",
" 'elas',\n",
" 'भित्री',\n",
" 'triindvajsetih',\n",
" 'dvaindevetdesete',\n",
" 'لولا',\n",
" 'awalnya',\n",
" 'батыр-бұтыр',\n",
" 'dno',\n",
" 'ὅ',\n",
" 'meus',\n",
" 'česar',\n",
" 'valamint',\n",
" 'kok',\n",
" 'drugim',\n",
" 'upp',\n",
" 'kurang',\n",
" 'estávamos',\n",
" 'मात्र',\n",
" 'μετά',\n",
" 'petnajsti',\n",
" 'devetdesete',\n",
" 'trojim',\n",
" 'diberikannya',\n",
" 'kamilah',\n",
" 'deze',\n",
" 'noget',\n",
" 'jeg',\n",
" 'neki',\n",
" 'smem',\n",
" 'tenía',\n",
" 'varit',\n",
" 'le-tistima',\n",
" 'deseterih',\n",
" 'kolikšnima',\n",
" 'sekalipun',\n",
" 'kdorkoli',\n",
" 'antes',\n",
" 'había',\n",
" 'õ',\n",
" 'drugačnega',\n",
" 'πως',\n",
" 'jeder',\n",
" 'berkenaan',\n",
" 'seveda',\n",
" 'vsemu',\n",
" 'гөрі',\n",
" 'να',\n",
" 'eussions',\n",
" 'üçün',\n",
" 'avessero',\n",
" 'enima',\n",
" 'heille',\n",
" 'hočeš',\n",
" 'जस्तोसुकै',\n",
" 'quem',\n",
" 'perlukah',\n",
" 'kako',\n",
" 'ned',\n",
" 'which',\n",
" 'during',\n",
" 'ihren',\n",
" 'sedeminpetdeset',\n",
" 'étantes',\n",
" 'poleg',\n",
" 'navkljub',\n",
" 'belumlah',\n",
" 'dovoljeni',\n",
" 'кейін',\n",
" 'dog',\n",
" 'doi',\n",
" 'le-takšnem',\n",
" 'пай-пай',\n",
" 'undeva',\n",
" 'le-tej',\n",
" 'til',\n",
" 'estuviese',\n",
" 'құрау',\n",
" 'está',\n",
" 'coi',\n",
" 'σύ',\n",
" 'seterusnya',\n",
" 'них',\n",
" 'эти',\n",
" 'dvakratna',\n",
" 'mar',\n",
" 'bilər',\n",
" 'als',\n",
" 'houveriam',\n",
" 'anderem',\n",
" 'devetimi',\n",
" 'ollut',\n",
" 'estaba',\n",
" 'estuviste',\n",
" 'betulkah',\n",
" 'tehle',\n",
" 'ho',\n",
" 'peter',\n",
" 'av',\n",
" 'из',\n",
" 'dovoli',\n",
" 'पाँचौं',\n",
" 'šestnajstemu',\n",
" 'waktu',\n",
" 'ditt',\n",
" 'marajo',\n",
" 'nuestra',\n",
" 'biri',\n",
" 'suyos',\n",
" 'yourselves',\n",
" 'mora',\n",
" 'бұндай',\n",
" 'нет',\n",
" 'zavoljo',\n",
" 'sabo',\n",
" ...}"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stopwords = set(nltk.corpus.stopwords.words())\n",
"stopwords"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use this, we just need to add one more step into our \"preprocessing\" stage:"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [],
"source": [
"new_sentences = []\n",
"\n",
"for s in sentences:\n",
" # Creates a new list of tokens keeping only tokens containing exclusively letters\n",
" new_s = [w for w in s if w not in stopwords]\n",
" # Inserts the new list (i.e., `new_s`) in the new_sentences list\n",
" new_sentences.append(new_s)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['owner', 'put', 'leash', 'taking', 'walk', 'outside'],\n",
" ['park', 'recently', 'barking'],\n",
" ['new', 'tiger', 'attraction', 'zoo'],\n",
" [\"n't\", 'owner'],\n",
" ['surprise', 'mood', 'playing'],\n",
" ['many', 'kids', 'playing', 'park'],\n",
" ['park', 'ugly', 'city']]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_sentences"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we would need to rerun all the other steps above to generate the co-ocurrence matrix. The following function does that. See if you can understand what it is doing:"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"# `compute_context_stuff` does exactly what we did above. We only wrap\n",
"# things into a function so that we can reuse all of it later\n",
"def compute_context_stuff(sentences, words, remove_stopwords=False, vocab_size=10):\n",
" co_occurrences = {word: nltk.FreqDist() for word in words}\n",
"\n",
" for sentence in sentences:\n",
" for word in words:\n",
" if word in sentence:\n",
" co_occurrences[word].update([w.lower() for w in sentence\n",
" if re.match('[\\w]+', w) and w!=word\n",
" # We remove stop words with this line!\n",
" # We only include a word in our list if it\n",
" # is not in the stopwords list\n",
" and (not remove_stopwords or w.lower() not in stopwords)])\n",
" \n",
" vocab = [c for c,count in nltk.FreqDist([w for word in co_occurrences\n",
" for w in co_occurrences[word]]).most_common(vocab_size)]\n",
"\n",
" co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words])\n",
" return vocab, co_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we just run the function, passing our data, our words of interest, and our vocabulary size:"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"vocab, co_matrix = compute_context_stuff(sentences, words_of_interest, remove_stopwords=True, vocab_size=10)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"word \\ context recently barking playing owner put leash taking walk outside park\n",
"---------------- ---------- --------- --------- ------- ----- ------- -------- ------ --------- ------\n",
"dog 1 1 1 1 1 1 1 1 1 1\n",
"park 1 1 1 0 0 0 0 0 0 0\n",
"owner 0 0 0 0 1 1 1 1 1 0\n",
"tiger 0 0 0 0 0 0 0 0 0 0\n",
"cat 0 0 1 1 0 0 0 0 0 0\n",
"a 1 1 0 1 1 1 1 1 1 1\n",
"was 1 1 1 0 0 0 0 0 0 1\n"
]
}
],
"source": [
"show_co_matrix(co_matrix, words_of_interest, vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Effect of Context\n",
"\n",
"In the entire discussion above, we defined meaning in terms of context, and we arbitrarily decided that our context was \"the entire sentence\". It should be quite clear, therefore, that changing what we use as context should have an impact on the vectors we obtain to represent the meaning of our words of interest. We already saw how removing stop words changes the context vectors that we obtain. Removing stop words can be seen as one way of modifying what we define as context.\n",
"\n",
"**Question**: How exactly does stop word removal affect the \"meaning\" we obtain? Which information is lost in this process and which nuances of meaning might not be accessible?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The effect of window sizes\n",
"\n",
"We discussed above another way of defining context, namely, by defining a window around our word of interest, and only considering that words that are inside that window. The function below redefines our `compute_context_stuff` function to take into account a window size."
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"# -> window size stuff should go into an earlier section already\n",
"def compute_context_stuff(sentences, words, remove_stopwords=False, vocab_size=10, window_size=5):\n",
" co_occurrences = {word: nltk.FreqDist() for word in words}\n",
"\n",
" for sentence in sentences:\n",
" for word in words:\n",
" if word in sentence:\n",
" word_pos = sentence.index(word)\n",
" co_occurrences[word].update([w.lower() for w in sentence[max(0,word_pos-window_size):min(word_pos+window_size,len(sentence)-1)]\n",
" if re.match('[\\w]+', w) and w!=word\n",
" and (not remove_stopwords or w.lower() not in stopwords)])\n",
" \n",
" vocab = [c for c,count in nltk.FreqDist([w for word in co_occurrences\n",
" for w in co_occurrences[word]]).most_common(10)]\n",
" \n",
" co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words])\n",
" return vocab, co_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to take a look at a few possible window sizes, to see what it looks like. Let's try window sizes 1, 3 and 6:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"contexts for window size 1\n",
"\n",
"word \\ context the a one its new on recently dog tiger cat\n",
"---------------- ----- --- ----- ----- ----- ---- ---------- ----- ------- -----\n",
"dog 2 1 0 0 0 0 0 0 0 0\n",
"park 2 0 1 0 0 0 0 0 0 0\n",
"owner 1 0 0 1 0 0 0 0 0 0\n",
"tiger 0 0 0 0 1 0 0 0 0 0\n",
"cat 2 0 0 0 0 0 0 0 0 0\n",
"a 0 0 0 0 0 1 1 0 0 0\n",
"was 0 0 0 0 0 0 0 1 1 1\n",
"\n",
"contexts for window size 3\n",
"\n",
"word \\ context the recently was a dog put on park cat in\n",
"---------------- ----- ---------- ----- --- ----- ----- ---- ------ ----- ----\n",
"dog 3 1 1 2 0 1 1 1 1 0\n",
"park 2 1 0 1 0 0 0 0 0 3\n",
"owner 2 0 0 0 0 1 0 0 0 0\n",
"tiger 2 0 1 0 0 0 0 0 0 0\n",
"cat 3 0 1 0 1 0 0 0 0 0\n",
"a 2 1 1 0 2 0 1 1 0 0\n",
"was 3 1 0 1 2 0 0 0 1 1\n",
"\n",
"contexts for window size 6\n",
"\n",
"word \\ context the in was dog a recently of put on for\n",
"---------------- ----- ---- ----- ----- --- ---------- ---- ----- ---- -----\n",
"dog 6 2 2 0 2 1 1 1 1 1\n",
"park 2 3 1 1 1 1 0 0 0 0\n",
"owner 2 0 0 1 1 0 0 1 1 0\n",
"tiger 3 0 1 0 0 0 1 0 0 0\n",
"cat 5 1 1 1 0 0 1 0 0 0\n",
"a 3 1 1 2 0 1 0 1 1 1\n",
"was 7 2 0 2 1 1 2 0 0 1\n"
]
}
],
"source": [
"for window_size in [1,3,6]:\n",
" print(\"\\ncontexts for window size %d\\n\"%window_size)\n",
" contexts, co_matrix = compute_context_stuff(sentences, words_of_interest, window_size=window_size)\n",
" show_co_matrix(co_matrix, words_of_interest, contexts, max_vocab_size=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References\n",
"\n",
" * McDonald, S., & Ramscar, M. (2001). Testing the distributioanl hypothesis: The influence of context on judgements of semantic similarity. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 23, No. 23)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}