In [1]:
import nltk

Removing stop-words
==================

Stop-words are those (often small) words that normally do not contain "meaning" and just mark some function in our sentences. Examples include pronouns (_I_, _me_, _you_, ..., _your_, _his_, ..., _myself_, _yourself_, ...) and connectecives (_and_, _of_, _because_, ...).

For an easy way to remove stop words, take a look [here](https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/). Basically, the `nltk` has a list of such words for several languages:

In [2]:
en_stopwords = nltk.corpus.stopwords.words('english')
en_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Then you can remove stopwords from your text using list comprehensions. For example, let's say you wanted to access Alice in Wonderland:

In [3]:
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
alice[:100]

['[',
 'Alice',
 "'",
 's',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865',
 ']',
 'CHAPTER',
 'I',
 '.',
 'Down',
 'the',
 'Rabbit',
 '-',
 'Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank',
 ',',
 'and',
 'of',
 'having',
 'nothing',
 'to',
 'do',
 ':',
 'once',
 'or',
 'twice',
 'she',
 'had',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading',
 ',',
 'but',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 ',',
 "'",
 'and',
 'what',
 'is',
 'the',
 'use',
 'of',
 'a',
 'book',
 ",'",
 'thought',
 'Alice',
 "'",
 'without',
 'pictures',
 'or',
 'conversation',
 "?'",
 'So',
 'she',
 'was',
 'considering',
 'in',
 'her',
 'own',
 'mind',
 '(',
 'as',
 'well',
 'as',
 'she',
 'could',
 ',']

You can then remove stop words with

In [4]:
alice_content_words = [word for word in alice if word not in en_stopwords]

In [5]:
alice_content_words

['[',
 'Alice',
 "'",
 'Adventures',
 'Wonderland',
 'Lewis',
 'Carroll',
 '1865',
 ']',
 'CHAPTER',
 'I',
 '.',
 'Down',
 'Rabbit',
 '-',
 'Hole',
 'Alice',
 'beginning',
 'get',
 'tired',
 'sitting',
 'sister',
 'bank',
 ',',
 'nothing',
 ':',
 'twice',
 'peeped',
 'book',
 'sister',
 'reading',
 ',',
 'pictures',
 'conversations',
 ',',
 "'",
 'use',
 'book',
 ",'",
 'thought',
 'Alice',
 "'",
 'without',
 'pictures',
 'conversation',
 "?'",
 'So',
 'considering',
 'mind',
 '(',
 'well',
 'could',
 ',',
 'hot',
 'day',
 'made',
 'feel',
 'sleepy',
 'stupid',
 '),',
 'whether',
 'pleasure',
 'making',
 'daisy',
 '-',
 'chain',
 'would',
 'worth',
 'trouble',
 'getting',
 'picking',
 'daisies',
 ',',
 'suddenly',
 'White',
 'Rabbit',
 'pink',
 'eyes',
 'ran',
 'close',
 '.',
 'There',
 'nothing',
 'VERY',
 'remarkable',
 ';',
 'Alice',
 'think',
 'VERY',
 'much',
 'way',
 'hear',
 'Rabbit',
 'say',
 ',',
 "'",
 'Oh',
 'dear',
 '!',
 'Oh',
 'dear',
 '!',
 'I',
 'shall',
 'late',
 "!'",

In [6]:
# You can also do it with several other languages
es_stopwords = nltk.corpus.stopwords.words('spanish')
es_stopwords

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

In [7]:
de_stopwords = nltk.corpus.stopwords.words('german')
de_stopwords

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [8]:
pt_stopwords = nltk.corpus.stopwords.words('portuguese')
pt_stopwords

['de',
 'a',
 'o',
 'que',
 'e',
 'é',
 'do',
 'da',
 'em',
 'um',
 'para',
 'com',
 'não',
 'uma',
 'os',
 'no',
 'se',
 'na',
 'por',
 'mais',
 'as',
 'dos',
 'como',
 'mas',
 'ao',
 'ele',
 'das',
 'à',
 'seu',
 'sua',
 'ou',
 'quando',
 'muito',
 'nos',
 'já',
 'eu',
 'também',
 'só',
 'pelo',
 'pela',
 'até',
 'isso',
 'ela',
 'entre',
 'depois',
 'sem',
 'mesmo',
 'aos',
 'seus',
 'quem',
 'nas',
 'me',
 'esse',
 'eles',
 'você',
 'essa',
 'num',
 'nem',
 'suas',
 'meu',
 'às',
 'minha',
 'numa',
 'pelos',
 'elas',
 'qual',
 'nós',
 'lhe',
 'deles',
 'essas',
 'esses',
 'pelas',
 'este',
 'dele',
 'tu',
 'te',
 'vocês',
 'vos',
 'lhes',
 'meus',
 'minhas',
 'teu',
 'tua',
 'teus',
 'tuas',
 'nosso',
 'nossa',
 'nossos',
 'nossas',
 'dela',
 'delas',
 'esta',
 'estes',
 'estas',
 'aquele',
 'aquela',
 'aqueles',
 'aquelas',
 'isto',
 'aquilo',
 'estou',
 'está',
 'estamos',
 'estão',
 'estive',
 'esteve',
 'estivemos',
 'estiveram',
 'estava',
 'estávamos',
 'estavam',
 'estivera'

Stemming
========

Let's go on with our "cleaning" of Alice in Wonderland. Now, instead of keeping each token "as is", we might want to just keep its stem. This is the task of "Stemming": finding the "stem" of the word, that is, the part of the word that is not inflected. This might be useful if we are insterested in, for example, how often a certain word appears in our corpus. For example, if we wanted to know how many times the word _walk_ was used in Alice in Wonderland, we might want to include the likes of _walked_, _Walked_ (i.e., capitalized) _walking_, _walks_, or (maybe) even _walker_.

Due to time constraints, we won't talk much here how this is implemented. Importantly, the `nltk` has several methods for stemming. The most popular seems to be the PorterStemmer:

In [9]:
# Creates a stemmer object that we can use to stem words
stemmer = nltk.stem.PorterStemmer()

In [10]:
# Then we can use this object to stem any given string:
stemmer.stem('globalization')

'global'

In [11]:
# But sometimes Stemming does not do what intuitively we'd probably want
stemmer.stem('was')

'wa'

In [12]:
stemmer.stem('deconstructing')

'deconstruct'

So then we do this with the entire Alice book:

In [13]:
alice_stemmed_content_words = [stemmer.stem(word) for word in alice_content_words]

In [14]:
alice_stemmed_content_words

['[',
 'alic',
 "'",
 'adventur',
 'wonderland',
 'lewi',
 'carrol',
 '1865',
 ']',
 'chapter',
 'I',
 '.',
 'down',
 'rabbit',
 '-',
 'hole',
 'alic',
 'begin',
 'get',
 'tire',
 'sit',
 'sister',
 'bank',
 ',',
 'noth',
 ':',
 'twice',
 'peep',
 'book',
 'sister',
 'read',
 ',',
 'pictur',
 'convers',
 ',',
 "'",
 'use',
 'book',
 ",'",
 'thought',
 'alic',
 "'",
 'without',
 'pictur',
 'convers',
 "?'",
 'So',
 'consid',
 'mind',
 '(',
 'well',
 'could',
 ',',
 'hot',
 'day',
 'made',
 'feel',
 'sleepi',
 'stupid',
 '),',
 'whether',
 'pleasur',
 'make',
 'daisi',
 '-',
 'chain',
 'would',
 'worth',
 'troubl',
 'get',
 'pick',
 'daisi',
 ',',
 'suddenli',
 'white',
 'rabbit',
 'pink',
 'eye',
 'ran',
 'close',
 '.',
 'there',
 'noth',
 'veri',
 'remark',
 ';',
 'alic',
 'think',
 'veri',
 'much',
 'way',
 'hear',
 'rabbit',
 'say',
 ',',
 "'",
 'Oh',
 'dear',
 '!',
 'Oh',
 'dear',
 '!',
 'I',
 'shall',
 'late',
 "!'",
 '(',
 'thought',
 'afterward',
 ',',
 'occur',
 'ought',
 'wonde

As you can, however, the PorterStemmer seems to be quite bad. Another possibility is the `SnowballStemmer`:

In [15]:
# You need to specify the language for it. It seems to work quite well for English
stemmer = nltk.stem.SnowballStemmer('english')

In [16]:
# Though a quick look at Alice's output doesn't seem to be any different from the previous result
alice_stemmed_content_words = [stemmer.stem(word) for word in alice_content_words]
alice_stemmed_content_words

['[',
 'alic',
 "'",
 'adventur',
 'wonderland',
 'lewi',
 'carrol',
 '1865',
 ']',
 'chapter',
 'i',
 '.',
 'down',
 'rabbit',
 '-',
 'hole',
 'alic',
 'begin',
 'get',
 'tire',
 'sit',
 'sister',
 'bank',
 ',',
 'noth',
 ':',
 'twice',
 'peep',
 'book',
 'sister',
 'read',
 ',',
 'pictur',
 'convers',
 ',',
 "'",
 'use',
 'book',
 ",'",
 'thought',
 'alic',
 "'",
 'without',
 'pictur',
 'convers',
 "?'",
 'so',
 'consid',
 'mind',
 '(',
 'well',
 'could',
 ',',
 'hot',
 'day',
 'made',
 'feel',
 'sleepi',
 'stupid',
 '),',
 'whether',
 'pleasur',
 'make',
 'daisi',
 '-',
 'chain',
 'would',
 'worth',
 'troubl',
 'get',
 'pick',
 'daisi',
 ',',
 'sudden',
 'white',
 'rabbit',
 'pink',
 'eye',
 'ran',
 'close',
 '.',
 'there',
 'noth',
 'veri',
 'remark',
 ';',
 'alic',
 'think',
 'veri',
 'much',
 'way',
 'hear',
 'rabbit',
 'say',
 ',',
 "'",
 'oh',
 'dear',
 '!',
 'oh',
 'dear',
 '!',
 'i',
 'shall',
 'late',
 "!'",
 '(',
 'thought',
 'afterward',
 ',',
 'occur',
 'ought',
 'wonder'

Or you can also use the `LancasterStemmer`:

In [17]:
stemmer = nltk.stem.LancasterStemmer()

In [18]:
# This seems even worse! =/
alice_stemmed_content_words = [stemmer.stem(word) for word in alice_content_words]
alice_stemmed_content_words

['[',
 'al',
 "'",
 'adv',
 'wonderland',
 'lew',
 'carrol',
 '1865',
 ']',
 'chapt',
 'i',
 '.',
 'down',
 'rabbit',
 '-',
 'hol',
 'al',
 'begin',
 'get',
 'tir',
 'sit',
 'sist',
 'bank',
 ',',
 'noth',
 ':',
 'twic',
 'peep',
 'book',
 'sist',
 'read',
 ',',
 'pict',
 'convers',
 ',',
 "'",
 'us',
 'book',
 ",'",
 'thought',
 'al',
 "'",
 'without',
 'pict',
 'convers',
 "?'",
 'so',
 'consid',
 'mind',
 '(',
 'wel',
 'could',
 ',',
 'hot',
 'day',
 'mad',
 'feel',
 'sleepy',
 'stupid',
 '),',
 'wheth',
 'pleas',
 'mak',
 'daisy',
 '-',
 'chain',
 'would',
 'wor',
 'troubl',
 'get',
 'pick',
 'daisy',
 ',',
 'sud',
 'whit',
 'rabbit',
 'pink',
 'ey',
 'ran',
 'clos',
 '.',
 'ther',
 'noth',
 'very',
 'remark',
 ';',
 'al',
 'think',
 'very',
 'much',
 'way',
 'hear',
 'rabbit',
 'say',
 ',',
 "'",
 'oh',
 'dear',
 '!',
 'oh',
 'dear',
 '!',
 'i',
 'shal',
 'lat',
 "!'",
 '(',
 'thought',
 'afterward',
 ',',
 'occur',
 'ought',
 'wond',
 ',',
 'tim',
 'seem',
 'quit',
 'nat',
 ');',

Lemmatization
============

Stemming and Lemmatization are actually terms used very vaguely by most people, and often mean the same. Therefore, I don't intend to make any difference between the two. However, some people might say Stemming just "cuts" the part of the words that do not change due to inflection, while Lemmatization is a little more complex, searching for the "root" of a particular word.

For example, you might say that the stem of a word like _written_ might be _writ_ (since this is the part of the word that doesn't change due to inflection), while its lemma would be _write_ (despite _written_ having an additional _t_). [This Stack Overflow answer is quite good, in fact](https://stackoverflow.com/a/1787121).

The `nltk` also provides a lemmatizer, based on the WordNet (a resource we will talk more about in the future). Let's do an example:

In [19]:
lemmatizer = nltk.wordnet.WordNetLemmatizer()

In [20]:
# Now we can lemmatize just like we did with the stemmer
lemmatizer.lemmatize('globalization')

'globalization'

In [21]:
# It works better if you pass a "Part of Speech"
#  * 'v' indicates "verb"
#  * 'n' indicates "noun"
#  * 'a' indicates "adjective"
#  * 'r' indicates "adverb"
lemmatizer.lemmatize('underwent', pos='v')

'undergo'

Here are some other examples of things we'd like to have. Notice how this is very different from what a Stemmer would do:

$$
was \to be \\
were \to be \\
written \to write \\
driven \to drive \\
went \to go \\
$$

Part of Speech Tagging
===================

In the previous section, we talked a little about "Parts of Speech". These are basically "word categories". We will talk a little more about PoS Tagging in another class; but here we will see how to just do it using `nltk`.

Very similarly to the other tools we just saw, we can tag words with their categories using the function `pos_tag()`:

However, word categories only make sense based on the contexts in which they appear. For example, the word _sound_ could be a noun (as in _I heard a sound_), an adjective (_he is back safe and sound_) or even verb (_those ideas sound horrible!_). Therefore, our `tagger` receives a list of words, that are processed together:

In [22]:
sentence = 'my dinosaur has five legs'
tokenized = nltk.word_tokenize(sentence)

In [23]:
tokenized

['my', 'dinosaur', 'has', 'five', 'legs']

In [24]:
nltk.pos_tag(tokenized)

[('my', 'PRP$'),
 ('dinosaur', 'NN'),
 ('has', 'VBZ'),
 ('five', 'CD'),
 ('legs', 'NNS')]

Of course, we can do the exact same with Alice's book:

In [25]:
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')

In [26]:
alice

['[', 'Alice', "'", 's', 'Adventures', 'in', ...]

In [27]:
alice_pos_tagged = nltk.pos_tag(alice)

In [28]:
alice_pos_tagged

[('[', 'JJ'),
 ('Alice', 'NNP'),
 ("'", 'POS'),
 ('s', 'NN'),
 ('Adventures', 'NNS'),
 ('in', 'IN'),
 ('Wonderland', 'NNP'),
 ('by', 'IN'),
 ('Lewis', 'NNP'),
 ('Carroll', 'NNP'),
 ('1865', 'CD'),
 (']', 'NNP'),
 ('CHAPTER', 'NNP'),
 ('I', 'PRP'),
 ('.', '.'),
 ('Down', 'RP'),
 ('the', 'DT'),
 ('Rabbit', 'NNP'),
 ('-', ':'),
 ('Hole', 'NNP'),
 ('Alice', 'NNP'),
 ('was', 'VBD'),
 ('beginning', 'VBG'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('very', 'RB'),
 ('tired', 'JJ'),
 ('of', 'IN'),
 ('sitting', 'VBG'),
 ('by', 'IN'),
 ('her', 'PRP$'),
 ('sister', 'NN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('bank', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('of', 'IN'),
 ('having', 'VBG'),
 ('nothing', 'NN'),
 ('to', 'TO'),
 ('do', 'VB'),
 (':', ':'),
 ('once', 'RB'),
 ('or', 'CC'),
 ('twice', 'VB'),
 ('she', 'PRP'),
 ('had', 'VBD'),
 ('peeped', 'VBN'),
 ('into', 'IN'),
 ('the', 'DT'),
 ('book', 'NN'),
 ('her', 'PRP$'),
 ('sister', 'NN'),
 ('was', 'VBD'),
 ('reading', 'VBG'),
 (',', ','),
 ('but', 'CC'),
 ('it', '

PoS Tagsets
=======

But what do all these tags mean?

It is very hard to define exactly how many classes of words are there. Take, for example, the auxiliary verbs _have_ and *be*, when use in sentences like _She had not seen that he is burning alive_. Should they be considered the same as the verbs _seen_ and _burning_? And what about different verb forms? Should they be considered differently? For example, present and past participles like _burning_ and _eaten_ can be used as adjectives (e.g., _the burning house_ or _the eaten apple_), but simple past verb forms like _ate_ or _knew_ (normally) can not. Should they be also be considered part of a different class?

This is made even more complex when you consider that some words are written in the same way, but pronounced in completely different ways. For example, the word _object_, when pronounced with the stress in the *o*, is a noun; but when the same word is pronounced with the stress in the *e*, it is a verb. Another example is the verb _read_, that in its past tense is also written the same way, but pronounced differently.


PoS Tags for English
-----------------------

[The questions above do not have an easy answer](http://itre.cis.upenn.edu/~myl/languagelog/archives/002974.html), but most NLP applications choose a set of classes (i.e., a "tagset") and just do their best effort to fit all the words into sensible tags. It won't always work, but hopefully it will work most of the time. Some popular tagsets are the [Penn Treebank's tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) (used to annotate the Penn Treebank) and the [Brown Corpus' tagset](http://www.hit.uib.no/icame/brown/bcm.html#bc6) (used to annotate the [Brown Corpus](http://www.hit.uib.no/icame/brown/bcm.html)). Other influential tagset are the [UCREL](http://ucrel.lancs.ac.uk/claws7tags.html) tags (used to tag the [British National Corpus](http://www.natcorp.ox.ac.uk/)) and the [Universal POS tags](http://universaldependencies.org/u/pos/).

PoS Tags for German
----------------------

A typical tagset for the German language are the STTS tags, the ones used for annotating the [TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html) ([you can find them at the end of this file](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf)). The NLTK has support for these tags.

Wrapping Up
==========

We have seen lots of things that we can do with the NLTK, that might be useful for us to create a new corpus (or actually just process any amount of text we might get, from whichever source). Due to time constraints, we had to not talk about

* Annotated corpora
* Corpora in another language
* Other resources available through the NLTK

However, [you can find more information on these topics in the chapter 2 of the NLTK book](https://www.nltk.org/book/ch02.html) and are more than encouraged to take a look there. Specifically, you can find some annotated corpora in section 1.6, and corpora in other languages in the section 1.7. Finally, you can take a look at the other resources in section 4.