Tokenization
=======

From Wikipedia _(if you search for "Tokenization", you will fall into "Word segmentation")_:

>_Word segmentation is the problem of dividing a string of written language into its component words._


For example, our goal is to transform strings like

    "The woman drank her coffee"

into the list

    "The", "woman", "drank", "her", "coffee"

Notice that you definitely still want to keep the information about the order of the words. That is, for the example above, getting the list

    "woman", "coffee", "drank", "her", "the"

is definitely not a correct tokenization output.

English is a "well-behaved" language in that it divides words mostly with spaces. For this
class, we will only consider languages that do use spaces as delimiters of words. Tokenizing
text without spaces in word boundaries is a hard problem and is briefly discussed in the
end of this file.

It is useful to make a distinction between _tokens_ and _types_. Consider the following sentence:

    The woman saw another woman.

Any normal human being would probably agree that this sentence has 5 words. However, what is the number of tokens in this sentence? The answer is 6: we should probably consider the period at the end as a token too.

Now, what about the number of _types_ that this sentence has. What is a _type_? Well... a _type_ is each one of the _different_ tokens in the corpus. In the sentence above, the token _woman_ appears twice. Therefore, the number of types in it is 5: "the", "woman", "saw", "another", and "." (period).

Now, what about the following sentence?

    The woman saw the other woman.

Here, again, the word _woman_ appears twice, so it should definitely be counted only once. But what about the word _the_? In the first case it appeared capital, but in the second case it appeared non-capital. Should these be counted separately? _(well... this is all debatable)_

When talking about corpora, you might see descriptions saying that a corpus has (for example) 1 million _tokens_, but (for example) 80000 _types_. This is what it means.

_As we learnt, a type that appears only once in a corpus is called a **Hapax Legomenon** (a Greek word for "said only once")._

Why do you want to know about Tokenization?
-------------------------------------------

Tokenization is already implemented in many NLP libraries, like NLTK or spaCy, and most people could probably safely just take it for granted.

However, it might be that certain algorithms require tokenization to be done in a certain way. For example, it might be beneficial sometimes to tokenize words like _wouldn't_ into _would_ and _n't_, but sometimes this might actually not matter. In this case, you are likely to want to implement your own tokenization. It shouldn't be a hard task, but it is useful to have a notion of what tools are there that can help us doing it.

A Naïve approach
----------------

It seems from the example above that you could just separate the sentence by spaces:

In [1]:
"The woman drank her coffee".split(' ')

['The', 'woman', 'drank', 'her', 'coffee']

But this approach does not work well with punctuations:

In [2]:
# Notice how the punctuation now became a problem...
"Woman! Drink coffee... Ok?".split(' ')

['Woman!', 'Drink', 'coffee...', 'Ok?']

In [3]:
# Or what about parenthesis? And commas?
"The woman's coffee was (desperately, quickly) drunk.".split(' ')

['The', "woman's", 'coffee', 'was', '(desperately,', 'quickly)', 'drunk.']

Another approach
-------------------

It looks like we could define some characters as punctuations. Say, we could decide to treat as a special token any of the following characters:

    ".!?&,()[]/'

For this kind of treatment, we will need some more powerful tools...


Regular Expressions
===========

A Regular Expression (regex) is _"a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings"_. (Wikipedia)

As the Wikipedia says, regexes look for patterns. These patterns can be as simple "abc" (that searches for the letter `a`, followed by the letter `b`, followed by the letter `c`), or as complicated as "a sequence of three letters, followed by three numbers, followed by a slash, in the end of the line". These are defined using a certain language, that, despite being a little hard to read, is very powerful for string manipulation.


We will look into many more examples now...

In [4]:
# Import the Python package for Regular Expressions
import re

# This Python package contains several utility functions for string manipulation with
# Regular Expressions. One interesting function is `search(regex, str)`:
if re.search('abc', 'my string has abc in it!'):
    print("Found abc")

Found abc


In [1]:
import re
if re.search('abc', 'my string has abc in it!'):
    print("hello")

hello


If your regex is complicated to read, you might want to take a look at
[this website](https://regex101.com/) or
[this website](https://regexr.com/). You might also find [this reference website](https://www.regular-expressions.info/) a good place to read more about them.

Let's say that, instead of matching only the string `abc`, you would like to check if your string contains "any number (including none) of letters `a`, followed by the letters `b` and `c`". This could be done with the `*` operator. For example, this piece of code:

In [59]:
my_string = 'my string has no letter a followed by bc'
if re.search('a*bc', my_string):
    print("Found a*bc")

Found a*bc


In [2]:
my_string = 'my string now is aaaaaaaaabc'
if re.search('a*bc', my_string):
    print("Found a*bc")

Found a*bc


will print `Found abc` twice: once because because there is zero `a`s before the letters `bc`, and another because there are many `a`s before the letters `bc`.

Replacing
-----------

Regular Expressions are also useful for changing parts of a string. Let's say that the corpus you took your data from is full of errors, and you would like to fix some of them. Specifically, every time you find the word `the`, it contains several `t`s where it should contain only one. You now want to replace that group of "any number of `t`s followed by a `h` and a `e`" with the right word: `the`. Maybe you could try the code below

In [61]:
my_string = 'tttttttthe dog chases the cat'

# Replace is a function that receives three parameters:
# 1) A regular expressions to be matched
# 2) A string to be put in place of the matched regular expression
# 3) The input line where the replacing will happen
re.sub("t*he", 'the', my_string)

'the dog chases the cat'

The `sub()` function will be used a lot throughout this text, and referring to each of its parameters might be a little confusing. Most of the time, I'll just say `regex` referring to the first parameter (which is actually what we are most interested in); but to avoid confusion, whenever I really need to be specific about the parameters I will try to refer to them by the following names:

 1. **pattern**: the regular expression to be searched for in the _input string_
 2. **replacement string**: the string to be put in the place of the _pattern_
 3. **input string**: the string that contains the data you want to modify

The problem with this code is that it will also match "no `t`s":

In [8]:
my_string = 'he is hungry'
re.sub("t*he", 'the', my_string)

'the is hungry'

In English, the word `he` is a valid word, and you would like not to "fix" the words `he` too. There is a special regex operator for "at least one" occurrence of a letter: the `+` operator:

In [9]:
my_string = 'he is hungry'
line = re.sub("t+he", 'the', my_string)
print(line)

my_string = 'tttttttthe dog chases the cat'
line = re.sub("t+he", 'the', my_string)
print(line)

he is hungry
the dog chases the cat


Character Classes
--------------------

Let's say that your corpus has some other problem: because of the way it was extracted from the internet, it is composed by many characters that actually do not compose English words, like `#`, or `%` or even `@`. Let's say that you would like to remove them all. You could of course write a rule to replace each one of the characters by an empty string. For example:

In [10]:
line = "I #like coco@nut%"
line = re.sub("#", "", line)
line = re.sub("%", "", line)
re.sub("@", "", line)

'I like coconut'

But it would probably be tedious (and error prone) to have to create one line for each of the characters. Instead, regexes have a special syntax to allow you to match any character of a group:

In [11]:
line = "I #like coco@nut%"
re.sub("[#@%]", "", line)

'I like coconut'

These square brackets are called [character classes](https://www.regular-expressions.info/refcharclass.html).

Of course, you can always use the `*` and the `+` operator after the brackets to match multiple occurrences of the characters. For example, if you want to match any number of occurrences of the special characters followed by the word `nut`, you can use:

In [63]:
line = "@##%@%# <-- this will not be matched; but @#%@#%nut this will"
re.sub("[@#%]*nut", "_", line)

'@##%@%# <-- this will not be matched; but _ this will'

The problem with our current solution for eliminating special characters is that we need to have a list of all special characters that we want to eliminate. We need to know that our corpus contains the character `#`, and then we need to insert it in our code line (i.e., in the text between the square brackets in our code).

It would be nice if we could simply say what are the characters that we want, and then remove _any other character_. Regexes allow us to do that too. We can match any characters that are _not_ in a certain list by inserting a `^` as the first character inside the square brackets. For example:

In [13]:
line = "I #like coco@nut%"

# This will match only the characters that are NOT @, # or %,
# and replace them with an empty string
re.sub("[^@#%]", "", line)

'#@%'

Some character classes are already "predefined" and you can just use them. For example, the class of all letters can be written as `\w`, and the class of all non-letters can be written as `\W` (notice that the W is capital here). Similarly, the class of all spaces is `\s` and all the non-spaces are `\S`. Finally, if you want to match just any character, you can always use `.`. For example:

In [70]:
line = "<?#$> I like coconut .,[]{}/%!"
print("ORIGINAL:           ", line)

# Replaces all the letters by "_"
print("MATCHES ALL WORDS:  ", re.sub('\w', '_', line))

# Replaces all the non-letters by "_"
print("MATCHES NON-WORDS:  ", re.sub('\W', '_', line))

# Replaces all the spaces by "_"
print("MATCHES ALL SPACES: ", re.sub('\s', '_', line))

# Replaces all the non-spaces by "_"
print("MATCHES NON-SPACES: ", re.sub('\S', '_', line))

# Replaces all the characters by "_"
print("MATCHES EVERYTHING: ", re.sub('.', '_', line))

ORIGINAL:            <?#$> I like coconut .,[]{}/%!
MATCHES ALL WORDS:   <?#$> _ ____ _______ .,[]{}/%!
MATCHES NON-WORDS:   ______I_like_coconut__________
MATCHES ALL SPACES:  <?#$>_I_like_coconut_.,[]{}/%!
MATCHES NON-SPACES:  _____ _ ____ _______ _________
MATCHES EVERYTHING:  ______________________________


Groupings
-----------

So far we have always been using regex to match either an entire string, or a set of characters, but never things like "either `abc` or `def`". Before, we had created a regex like

In [71]:
re.sub('abc', 'blah', 'my string contains abc')

'my string contains blah'

If we now wanted to find "either `abc` or `def` and replace both of the with `blah`, we can't use square brackets (i.e., `[ ]`), or otherwise we will match _each one of the characters inside the brachets_:

In [16]:
re.sub('[abcdef]', 'blah', 'my string contains abc')

'my string blahontblahins blahblahblah'

The right way to match multiple strings is by using parentheses (i.e., `(` and `)`):

In [73]:
re.sub('(abc|def)', 'blah', 'my string contains def')

'my string contains blah'

Notice that the `|` operator works as an "OR" operator. That is, the regex above will match either `abc` or `def` and replace any of the two sequences as `blah`.

Groupings are also useful if you want to swap things around. This is possible because it is possible to refer to each of the matched groups using a variable name. For example, let's say your corpus is composed of several sentences of the type

    the dog chases a cat
    the cat chases a mouse
    the mouse eats the cheese
    the human eats the beef

And you'd like to swap the noun following `the` and the noun following `a`, so that you would end up getting something like

    the cat chases a dog
    the mouse chases a cat
    the cheese eats the mouse
    the beef eats the human

Before solving this problem, remember that `(\S)*` will match "anything that is not a space", which, if followed by a space, will probably capture a word:

In [18]:
# Gets each word followed by a space
re.sub('\s(\S)*', ' blah', 'Each word here will become " blah"')

'Each blah blah blah blah blah blah'

Another thing that you'll need to know is that groupings can be referred to using something called [Backreference](https://www.regular-expressions.info/backref.html). So, for example, if I wanted to swap the words `a` and the `the` of a sentence in my corpus (without knowing beforehand where the `a` and where the `the` is going to appear), I can do something like:

In [19]:
re.sub('(the|a) (.*) (the|a)', '\\3 \\2 \\1', 'the cat and a dog')

'a cat and the dog'

There is a lot happening in this line. The first thing to notice is that this is the first time the replacement string (remember the names of the parameters of the `sub()` function above) has something "special" in it. Notice how there are three groups of parenthesis in the pattern, and how they correspond to the _backreferences_ `\\1`, `\\2` and `\\3` in the replacement string. Since we inverted the order of `\\3` and `\\1`, the `a` and the `the` were swapped in the output.

We then have `(.*)`. As we discussed before, just the `.` matches any character, and putting `.*` matches `whatever number of "any characters"`. By putting this between parenthesis (i.e., in a _group_), we have a way to refer back to them in our replacement string.

Now armed with this knowledge, we can fix our corpus with something like:

In [20]:
line1 = 'the dog chases a cat'
line2 = 'the cat chases a mouse'
line3 = 'the mouse eats the cheese'
line4 = 'the human eats the beef'

print(re.sub('(the|a) (\S*) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line1))
print(re.sub('(the|a) (\S*) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line2))
print(re.sub('(the|a) (\S*) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line3))
print(re.sub('(the|a) (\S*) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line4))

the cat chases a dog
the mouse chases a cat
the cheese eats the mouse
the beef eats the human


If we know the exact size of the words, we might want to match an exact number of characters (instead of using a star and matching _any number of characters_). It is possible to match an exact number of characters using another special syntax: putting a number between `{` and `}` immediately after some character in our regex. For example, in the lines of code below, `\S{3}` means "match exactly three non-space characters", and the parenthesis surrounding it just put it in a group (so that it can be backreferenced in the replacement string):

In [21]:
line1 = 'the dog chases a cat'        # <-- this line will change
line3 = 'the mouse eats the cheese'   # <-- this line won't

print(re.sub('(the|a) (\S{3}) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line1))
print(re.sub('(the|a) (\S{3}) (.*) (the|a) (\S*)', '\\1 \\5 \\3 \\4 \\2', line3))

the cat chases a dog
the mouse eats the cheese


In [84]:
line = "I not coconut"
re.sub("I (like){0,1}", "You despise ", line)

'You despise not coconut'

As you can see, since `mouse` is composed of 5 characters, the regular expression will be unable to match (the reason also has to do with the spaces surrounding the parenthesis), and thus the line won't change.

These [quantifiers](https://www.regular-expressions.info/refrepeat.html) can also be used for groups. For example, the following regular expression transforms instances of `thethethe` or `aaa` into `the` and `a`:

In [22]:
re.sub('(the|a){3}', '\\1', 'thethethe dog chases aaa cat')

'the dog chases a cat'

This syntax also allows you to specify a minimum and a maximum number of times when something might occur. The regex below fixes any number of repetitions of `the` and `a` between 2 and 7:

In [23]:
re.sub('(the|a){2,7}', '\\1',
       'thethethethethe dog chases aaa cat and thethethethe cat chases aaaaaaa mouse')

'the dog chases a cat and the cat chases a mouse'

Miscellaneous
---------------

### Mixing Classes and Groups

You should notice that classes and groups can of course be used together. If you want to match both `the` and `The`, you can always write something like:

In [46]:
re.sub('([Tt]he){2,7}', '\\1',
       'ThethetheThethe dog chases theThe cat and theTheThethe cat chases the mouse')

'the dog chases The cat and the cat chases the mouse'

However, use this kind of combination parsimoniously: as you can see, the first `the cat` got a capital `T`.

### [Anchors](https://www.regular-expressions.info/refanchors.html)

There may be cases when you'd like to make some replacement only if the matching happened in the beginning or the end of your input string. For example, you might want to change a `the` into `The` if it occurs in the beginning of the sentence, or also insert some punctuation in the end if it has none. Regex have two special characters for these tasks: a character that matches the beginning of the input string (`^`) and a character that matches its end (`$`).

For example, if you'd like to capitalize the `the` in the beginning of your sentence, you might do:

In [88]:
re.sub('^barks', 'The', 'the dog barks')

'the dog barks'

and adding punctuation might be done with something like:

In [26]:
# Remember that [^.!?] matches "any character that is not "." nor "!" nor "?"
re.sub('([^.!?])$', '\\1.', 'the dog barks')

'the dog barks.'

Further Theoretical Details
-------------------------------

The set of all languages that can be expressed by Regular Expressions is the set of the Regular Languages. Regular Languages are the simplest type of Formal Languages, but the theory of Formal Languages is not going to be discussed here. It is important to say, however, that they have a strong connection with the formalisms used in Generative Syntacticians. Further details about these topics can be found in:

  * [Chomsky, N. (1956). Three models for the description of language](http://static.stevereads.com/papers_to_read/three_models_for_the_description_of_language.pdf)
  * [Hopcroft et al. (2013). Introduction to Automata Theory, Languages, and Computation (3rd ed.)](https://en.wikipedia.org/wiki/Introduction_to_Automata_Theory,_Languages,_and_Computation)

Useful cleaning tips
=============

There is a ton of things you might want to do with your data to make it clean. Sometimes you might want to remove capitals, or you might want to transform all `&` into ` and `. You might to remove multiple spaces, or trailing spaces, or you might even want to remove hashtags, or even extract URLs (more in the end of the class), ... the list goes on and on.

You have already learned how to transform `&` into ` and `. Let's write a regex for this:

In [27]:
re.sub('&', ' and ', 'Johnson & Johnson')

'Johnson  and  Johnson'

Of course, now you have multiple spaces around the `and`. You can also clean this neatly with something we learned:

In [28]:
# \s+ matches 1 or more space characters
re.sub('\s+', ' ', 'Johnson  and  Johnson')

'Johnson and Johnson'

This removes ALL multiple spaces in an input string. But let's say you only would like to remove either the left trailing spaces or the right trailing spaces. Python is your friend, offering the methods `lstrip()`, `rstrip()` and `strip()`. In the three examples below, notice how the multiple spaces surrounding the word `spaces` are not changed.

In [29]:
'    eliminate   spaces  to the left    '.lstrip()


'eliminate   spaces  to the left    '

In [30]:
'    eliminate   spaces   to the right    '.rstrip()

'    eliminate   spaces   to the right'

In [31]:
'    eliminate   spaces   left and right    '.strip()

'eliminate   spaces   left and right'

It might also be that whichever algorithm you want to process your text with does not "like" to differentiate between non-capitalized and capitalized first letters. Changing this can be easily done with `lower()`, `upper()` and `capitalize()`:

In [32]:
"I like Coconut with Chocolate".lower()

'i like coconut with chocolate'

In [33]:
"I like Coconut with Chocolate".upper()

'I LIKE COCONUT WITH CHOCOLATE'

In [34]:
"I like Coconut with Chocolate".capitalize()

'I like coconut with chocolate'

It may also be useful if a given substring is present in a string. We had before done it with regular expressions in the following way:

In [35]:
my_string = 'my string now is aaaaaaaaabc'
if re.search('abc', my_string):
    print("Found abc")

Found abc


It turns out that, if you are just looking for a small sequence of characters (without any of the special functionalities that regular expressions give you), you can do it in a much simpler way with the operator `in`. For example:

In [36]:
if 'abc' in my_string:
    print("Found abc")

Found abc


Finally, you might just want to remove the hashtags from your text. Doing that should be in your bloodstream at this point. All you need to do is using a regex to find them, and replace them with nothing.

In [37]:
re.sub('#[\w]*', '', 'This is a #hashtag test with #ßŧrænge #cħæracters')

'This is a  test with  '

[_Do notice, however, that hashtags do not seem to follow very well any consistent standard =/_](https://shkspr.mobi/blog/2010/02/hashtag-standards/)

Tokenizing with all we have learned
=================

Remember that the whole motivation for having learned about Regexes and the Python string utilities was that we wanted to perform Tokenization. We had assumed that it would be a good idea to consider special characters like `".!?&,()[]/'` as separate tokens. It should be now clear that this kind of tokenization can now be easily performed using the tricks we learned above:

In [38]:
line = "Mary&Jane knew, despite Bob's negation, that he had (cruelly) killed Elisa!"

# 1. Separate special characters into separate words
# (i.e., insert spaces around them)]
line = re.sub('([&\'*(),\.!?/"])', ' \\1 ', line)

# 2. Remove multiple spaces (this is actually optional if you just call `.split()`, and
# is done here only as an example)
line = re.sub('\s+', ' ', line)

# 3. Remove trailing spaces
line = line.strip()

# 4. "Tokenize"
line.split()

['Mary',
 '&',
 'Jane',
 'knew',
 ',',
 'despite',
 'Bob',
 "'",
 's',
 'negation',
 ',',
 'that',
 'he',
 'had',
 '(',
 'cruelly',
 ')',
 'killed',
 'Elisa',
 '!']

_Finally... notice, however, that even after all of our efforts, we still end up getting things like `Dr.` or `B.Sc.` "wrong" (or, at least, the result we get is not the most intuitive). Probably the best is to just use some library (although most of the libraries also get many things wrong anyway)._

After all of this, how is Tokenization actually done?
=====================================================

This entire story was presented only to show the lots of things that Python offers for
processing strings, both with the `string` class and with Regular Expressions.
However, tokenization does not need to be implemented: the NLTK already gives you a
function for that.

In [39]:
# If you still do not have the NLTK tokenizer installed, you might have to run
# the following line to install it: (in that case, you'll need to uncomment it)
#nltk.download('punkt')

# Import the libraries for tokenizing
import nltk
from nltk import word_tokenize

tokens = word_tokenize("The woman's coffee was (desperately, quickly) drunk.")
print(tokens)

['The', 'woman', "'s", 'coffee', 'was', '(', 'desperately', ',', 'quickly', ')', 'drunk', '.']


Last words on Tokenization
==========================

English is a "well-behaved" language in that it divides words mostly with spaces. Other
languages, however, might not give any hint on where words start or end. For example,
Chinese used not to use any indication of word boundaries. This style of writing is referred
to as [_Scriptio continua_](https://en.wikipedia.org/wiki/Scriptio_continua). The examples
below are from the Wikipedia article on the topic. The following example is from Chinese:

>北京在中国北方；广州在中国南方。

>北京在中国北方广州在中国南方

>北京　   在　 中国　    北方   ； 广州　     在　 中国　    南方。

>Běijīng zài Zhōngguó běifāng; Guǎngzhōu zài Zhōngguó nánfāng.

>Beijing is in Northern China; Guangzhou is in Southern China.
    

Older Indo-European languages also used not to use any space between words. For example,
both Latin and Greek used initially no spaces nor punctuation. The example below is from Latin:

>NEQVEPORROQVISQVAMESTQVIDOLOREMIPSVMQVIADOLORSITAMETCONSECTETVRADIPISCIVELIT

>NEQVE•PORRO•QVISQVAM•EST•QVI•DOLOREM•IPSVM•QVIA•DOLOR•SIT•AMET•CONSECTETVR•ADIPISCI•VELIT

>Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit

>Nobody likes pain for its own sake, or looks for it and wants to have it, just because it is pain

The internet lately has brought the topic back with hashtags and URLs like
[gibtespommesheute.de](http://www.gibtesheutepommes.de/). Hashtag segmentation is an
open research problem without a definitive solution, partly because of the limited
pragmatic information computer systems have access to. A
[funny example](http://cs.uccs.edu/~jkalita/work/reu/REU2015/FinalPapers/05Reuter.pdf)
could be the hashtag #brainstorm, which could be segmented as

    brainstorm --> bra in storm


"Filtering" (Preparing your data for Machine Learning)
==================================

I want to finish this class with some useful filtering tips that are likely to be useful to prepare your data for Machine Learning (ML is the topic of the next class).

Removing URLs
-------------

Very often, Machine Learning algorithms expect to have some "vocabulary" of all types appearing in the corpus. If you have URLs in your corpus (for example, the corpus was collected from random places in the internet), you might end up with a vocabulary that is littered with things like _gibtespommesheute_, _localhost_ or even _pblandfort_ (which might or not be what you want).

Identifying URLs is a little tricky: you probably already saw some programs identifying them wrong. Here, we will simply look for things like _http://_, _https://_, or _www._ in the beginning of any string. Importantly, the name of the protocol (HTTP, HTTPS, SFTP, ...) is normally a sequence of a few letters, followed by a colon and two slashes. We will also assume that the URL ends when we find a space.

In [40]:
#########
# Replaces URLs with #URL#
#########

line1 = "I went to www.google.com to look for it"
line2 = "I don't like to buy in http://amazon.com like you do"
line3 = "I got a aptp://blah.blah.blah ... have you ever seen a url like this?"

line1 = re.sub('www\.[\S]*', '#URL#', line1)
print(line1)

line2 = re.sub('[\w]+://[\S]*', '#URL#', line2)
line3 = re.sub('[\w]+://[\S]*', '#URL#', line3)
print(line2)
print(line3)

I went to #URL# to look for it
I don't like to buy in #URL# like you do
I got a #URL# ... have you ever seen a url like this?


Removing HTML tags
------------------

Just like in the previous section, if you "scraped" your corpus automatically from the internet, you might have lots of HTML tags in your corpus that were accidentally taken in the process, which you likely want to clean out.

Of course, there are several ways in which you can do this. We will do this in a very simple way, using a regex to eliminate anything that is in between a `<` and a `>`.

In [41]:
#########
# Replaces HTML tags with #TAG#
#########

line1 = "I got a <b>MUCH</b> better car than yours..."

line1 = re.sub('<[^>]*>', ' #TAG# ', line1)

# (remove multiple spaces)
line1 = re.sub('\s+', ' ', line1)
print(line1)

I got a #TAG# MUCH #TAG# better car than yours...


The problem with this approach is that, if you have the characters `<` and `>` genuinely used in your data, then this will not only eliminate them, but also likely cause a lot of mess in your data.

(this is just one example where it pays off a lot to think well what you are doing before start applying it to the entire corpus)

### Extracting attributes in your HTML tags

HTML tags sometimes contain attributes that are interpreted by your browser in several ways. A common example of a tag that normally appears with an attribute is the `<a>` tag. It normally appears with the attribute _src_ pointing to some other place, e.g.,

    <a src="http://some_other_place.com">

Therefore, it might be that, instead of completely removing the tag, you would like to keep this information in some other variable. In the case of the `<a>` tag, you might want to keep the link as a metadata related to the tag. Let's define a function to do this...

In [42]:
#########
# Example of getting an attribute from a tag
#########

line1 = '<a   src="some_other_place.com">Click here to go to some_other_place.com</a>'

def get_url_from_src_attribute(line):
    tag_regex = re.search('<[\S]+\s+src="[^"]*">', line1)
    if not tag_regex:
        return None
    
    # Now we have only the tag. We still need some filtering
    print(tag_regex.group(0))
    tag_str = tag_regex.group(0)

    # Let's now remove only the attribute that we care about
    attribute = re.search('src="[^"]+"', tag_str)
    
    # Notice that we are guaranteed to have the attribute, because we matched it
    # in the previous regex
    #if not attribute:
    #    return None
    
    print(attribute.group(0))
    attr_str = attribute.group(0)
    
    # Finally, we will take the part that we want
    link = re.search('"[^"]+"', attr_str).group(0)
    link = link[1:-1]
    print(link)
    
    return link

extracted_link = get_url_from_src_attribute(line1)
print("Extracted link: ", extracted_link)

<a   src="some_other_place.com">
src="some_other_place.com"
some_other_place.com
Extracted link:  some_other_place.com


Replacing numbers
-----------------

Unless you are interested in the meanings of the sentences, it is often irrelevant which numbers are used in a sentence. For example, consider the sentences:

    (1) She bought 50 packets of coffee last night.
    (2) She bought 349 packets of coffee last night.
    (3) She bought three packets of coffee last night.

While in case 3 the number is written in letters, it is quite irrelevant how many packets of coffee (whatever this means) _she_ bought if your goal is, for example, to have an algorithm to learn the structure of the sentence: it is the same in all three cases. So if you want to train some algorithm to learn how to deal with numbers, you probably want to "trick" it into believing all numbers are "the same word".

A normal way to do this is by replacing any numbers with some special symbol. This is easily done with regex (hopefully by now these ideas are already in your blood stream):

In [43]:
#########
# Replaces any numbers into `###`
#########

line = "She bought 349 packets of coffee; he bought 3 packets of beer, and 49 of chocolate"

line = re.sub('[0-9]+', '###', line)
print(line)

She bought ### packets of coffee; he bought ### packets of beer, and ### of chocolate


In [44]:
# Notice that this will also take the numbers that are connected to works
# (which might or not be what you want)

line = "I ran 5km yesterday."

line = re.sub('[0-9]+', '###', line)
print(line)

I ran ###km yesterday.


Things for the next few classes
===============================

It would be awesome if you could have some packages already installed in your computer for the next class (to avoid having to solve problems in the class, when the time is limited).

    # Activate your virtual environment
    # (if you are using Anaconda, you might not need this step, and you will probably want
    # to install using the package manager that came along with it)
    $ source /path/to/your/virtual/environment/bin/activate

    # Install packages
    # sklearn: contains several Machine Learning utility functions
    # scipy and numpy: useful vector/matrix functions
    # matplotlib: for showing plots in the screen
    $ pip install sklearn scipy numpy matplotlib

Also, we are going to use a dataset for Sentiment Analysis, with ~5000 positive sentences, and ~5000 negative sentences.

This dataset can be found [here](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz).

In [45]:
import random

positives = [sentence.strip() for sentence in open('rt-polaritydata/rt-polaritydata/rt-polarity.pos', 'r', encoding = 'latin1')]
negatives = [sentence.strip() for sentence in open('rt-polaritydata/rt-polaritydata/rt-polarity.neg', 'r', encoding = 'latin1')]

print("Positive samples")
print(random.sample(positives, 5))
print("-----")
print("Negative samples")
print(random.sample(negatives, 5))

Positive samples
["a movie that will surely be profane , politically charged music to the ears of cho's fans .", 'grant is certainly amusing , but the very hollowness of the character he plays keeps him at arms length', 'with " ichi the killer " , takashi miike , japan\'s wildest filmmaker gives us a crime fighter carrying more emotional baggage than batman . . .', 'an intriguing cinematic omnibus and round-robin that occasionally is more interesting in concept than in execution .', 'the movie is full of fine performances , led by josef bierbichler as brecht and monica bleibtreu as helene weigel , his wife .']
-----
Negative samples
["the sad thing about knockaround guys is its lame aspiration for grasping the coolness vibes when in fact the film isn't as flippant or slick as it thinks it is .", '" an entire film about researchers quietly reading dusty old letters . "', "it's hard to like a film about a guy who is utterly unlikeable , and shiner , starring michael caine as an aging bri

### Exercise

1) Write the code to open the two datasets and tokenize the sentences. We will use the tokenized sentences from the next lecture on.

2) Create a vocabulary for the dataset. That is, use nltk.FreqDist() to create a list of all _types_ (the different tokens) in the dataset. A _vocabulary_ is basically the set of all tokens present in the dataset.