What are Convolutions?

For quite some time already I have been wanting to write this blog post. A little more than one year ago I got acquainted to Convolutional Neural Networks, and it didn’t immediately strike me why they are called that way. I eventually read this blog post that helped a lot to clarify things; but I thought I could try to give more details on what exactly is meant when one says “Convolution” here.

This blog post builds upon the description given there, so, if you still didn’t read that, stop reading this and go there take a look at that blog post. I may overlap some of the discussions here with the discussions there.

In the sections that follow, I’ll introduce convolutions (actually, I’ll let Kahn Academy do that for me), then introduce a procedure to calculate it, motivate a discussion about discrete convolutions, show why it makes sense to represent the convolving functions as vectors and extend the definition to the 2D space. The next blog post will explain why these are useful for signal processing and what is their relation with Convolutional Neural Networks.

Convolutions

Convolutions are a very common operation in signal processing. While the colah’s blog post presents it in a more abstract/intuitive statistical way, I find that a more gore calculus-driven introduction from Kahn Academy might help you realize that the concept is just an integral:

In this Kahn Academy video, Sal found a closed formula for the convolution by solving the integral. Given that a convolution is an integral, you might consider that it represents the area below some curve. But what curve exactly? I’ll discuss more about it in the next section. For now, what is worth is to understand that there several ways in which you can think of convolutions, and it might help a lot if you allow yourself to switch views at different points in time.

A concrete example

If you go to the Wikipedia article on convolutions, you may find the following two (awesome) images:

Convolution of a function with itself.

Convolution of a spiky function with a box.

What these images are saying is that you can calculate the value of the convolution $f \ast g$ at the point $t$ by following a very simple procedure. I’ll define two functions $f$ and $g$ to make the steps easier to follow. Let

\[f(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}\]

and

\[g(x) = 2 \times f(x)\]

Here we have the two curves:

Two signals

(I used Google Spreadsheets to do this, so you’ll notice the lines are not exact, but you should be able to get the idea)

First: flip $g$ horizontally (i.e., $g(x) <- g(-x)$). Let’s give the flipped $g$ a name, say $g’$. (if you don’t flip $g$, then what you are calculating has actually the name of “cross-correlation”, and is simply another typical operation in signal processing.).

Flipped signal

Second: shift $g’$ horizontally by $t$ units. If $t$ is positive, then $g’$ will be shifted to the right; otherwise, it will be shifted to the left. For our example, let’s say that $t=0.3$. I’ll call this function $g_{shifted}’$

Shifted signal

Third: this is the step where the problems arise. Now what you want is actually multiply the two curves are each point between $-\infty$ and $+\infty$ and calculate the area below the curve that this multiplication will form. Let’s assume that the functions are zero most of the time (just like in our example), and non-zero only in a small section of their domain. Because we are multiplying the two values, we only care about the values where both functions are not 0. In all other cases, the integral will be 0 anyway. Let’s assume that both functions are non-zero only in an interval $[a, b]$. In this case, our problem reduces to calculating the integral of the multiplication of $f$ and $g_{shifted}’$ inside that interval. Now it could still be a challenge to calculate the integral of the $g_{shifted}’$ and “f” in that interval.

Calculate area below curve

(While searching for a way to understand this procedure, I came across this very nice demo. In it you can define your own functions and play arround to find out how the convolution is going to be.)

The problem with continuous convolutions is that we would have to actually calculate an integral. But what if our function were actually “discrete”? Fortunately for us, most applications on Image Processing require discrete signals, and for our purposes it would be perfectly ok to discretize these continuous signals.

Calculate sum of elements below curve

After discretization, All the concepts we have discussed so far would follow the same logic. Now, instead of an integral we now have a sum. So, given the interval $[a, b]$, we could calculate the convolution as

\[(f \ast g)(t) = \sum^b_{i=a}{f(i) \times g_{shifted}'(i)}\]

And fortunately this sum is easy to calculate.

Note: the avid reader may notice that the integral of an interval spanning only a point should have been 0 (and therefore the convolution should always have become 0 after the discretization). The reason why this does not work has to do with the dirac delta function, and I won’t go into many details here. You can just assume that the discretized version of the signal is a sum of dirac delta functions.

In the example above I discretized the functions using 1 point for each 0.05 step in $x$. This would make the discussion below very hard to understand. So, to make things simpler, in all the text that follows I’ll use steps of 0.25 instead. The image below shows how the original functions $f$ and $g$ would look like discretized this way.

Discretized curves with steps of 0.25

1D discrete convolutions

It turns out that the functions $f$ and $g$ used in convolutions are in reality most of the times composed almost entirely by zeros (as assumed before). This allows for a much more compact representation of the functions as a vector of values. For example, $f$ and $g$ could be represented as:

\(f = [\dots 0, 0, 1, 1, 1, 1, 0, 0, \dots] \\ g = [\dots 0, 0, 2, 2, 2, 2, 0, 0, \dots] \\\) (Of course, the number of 1 and 2 depends on how the discretization was performed)

Now let’s say I’d like to calculate the value of the convolution between $f$ and $g$ at the point $t = $some coordinate. It is hard to point the exact place, so I’ll make the place bold:

\(f = [\dots 0, 0, 1, 1, \textbf{1}, 1, 0, 0, \dots] \\\) (For future reference, I’ll call this position $t=2$)

The way to calculate it is just the same:

  • Flip $g$ (but it has no effect here, because $g$ is symmetric anyway);

  • Move $g$ horizontally by $t$: this is a little abstract here; but if we align the $f$ and $g$ the way they were initially aligned, then we should get:

\[f = [\dots 0, 0, 1, 1, \textbf{1}, \textbf{1}, 0, 0, 0, 0, \dots] \\ g = [\dots 0, 0, 0, 0, \textbf{2}, \textbf{2}, 2, 2, 0, 0, \dots] \\\]
  • Multiply all elements position by position and sum them all.
\[(f \ast g)(t) = (1 \times 2) + (1 \times 2) = 4\]

You might have noticed how these operations may resemble dot-products. You could have implemented them as:

\[(f \ast g)(t) = [1, 1] \bullet [2, 2]\]

This way, if you wanted to calculate the convolution for many different values of $t$, you could just keep shifting the vector $g$.

\[\begin{align*} \text{When } t &= 0 \\ f &= [\dots 0, 0, \textbf{1}, \textbf{1}, \textbf{1}, \textbf{1}, 0, \dots] \\ g &= [\dots 0, 0, \textbf{2}, \textbf{2}, \textbf{2}, \textbf{2}, 0, \dots] \\ (f \ast g)(t) &= [1, 1, 1, 1] \bullet [2, 2, 2, 2] = 8 \\ \\ \text{When } t &= 1 \\ f &= [\dots 0, 0, 1, \textbf{1}, \textbf{1}, \textbf{1}, 0, 0, \dots] \\ g &= [\dots 0, 0, 0, \textbf{2}, \textbf{2}, \textbf{2}, 2, 0, \dots] \\ (f \ast g)(t) &= [1, 1, 1] \bullet [2, 2, 2] = 6 \\ \\ \text{And, } & \text{finally, if you consider all values of } t \\ f &= [\dots 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, \dots] \\ g &= [\dots 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, \dots] \\ (f \ast g)(t) &= [\dots 0, 2, 4, 6, 8, 6, 4, 2, 0, 0, \dots] \\ \end{align*}\]

Unfortunately, these are still vectors with an infinite number of dimensions, which are hard to store in our limited storage computers. It is worth noting that very often the functions $f$ and $g$ for which we want to calculate a convolution are 0 most of the time. Since we know that the result of the convolution in these regions will be zero, we can just drop all of the zeros:

\(\begin{align*} f &= [0, 0, 0, 0, 1, 1, 1, 1, 0, 0] \\ g &= [0, 0, 0, 0, 2, 2, 2, 2, 0, 0] \\ (f \ast g) &= [0, 2, 4, 6, 8, 6, 4, 2, 0, 0] \\ \end{align*}\) (As you can see, I kept some of the zeros. I could have removed them. It was my choice)

And congratulations, we just arrived in a very compact representation of our functions.

Note: The entire discussion so far supposed that we would keep $f$ still and always transform $g$ according to our three steps to calculate the convolutions. It turns out that convolutions are commutative, and therefore the entire procedure would have also worked by holding $g$ still and changing $f$ in the same way. (Incidentally, they are also associative)

But what does all of this mean?

When I started talking about convolutions, I said that they are used a lot in the context of signal processing. It might be a good idea to forget that these vectors are functions for a while and consider them signals. (this video might help to convince you that this is a sensible idea.) In that case, what a convolution is doing is taking two signals as input and generating a new one based on those two. How the new signal looks like depends on where both signals are non-zero. In the next blog post you’ll see how this can be used in meaningful ways, like finding borders in an image, blurring an image, or even shifting a signal in a certain direction.

Most importantly, convolutions are a very simple operation (composed of sums and multiplications that can be done parallely), which can be easily implemented in hardware. They are a great tool to have in hand when solving difficult problems.

2D Convolutions

It shouldn’t be a big leap to extend these concepts to the 2D space.

Let us skip all the discussion about continuous functions and vectors with infinitely many elements and consider our current state: functions $f$ and $g$ are represented as small vectors, and we want to calculate the convolution of those two functions (vectors) at any point $t$. If we now define new $f$ and $g$ in a 2D space, then we can represent them as matrices. For example, if we now redefine $f$ as

\[f(x, y) = \begin{cases} 1 & \text{if } 0 \leq x,y \leq 1 \\ 0 & \text{otherwise} \end{cases}\]

and rediscretize it in the same way we did before, then we would get a matrix that looks something like:

\[f = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix}\]

(Do not forget: I was the one who decided to keep a border with zeros. I could have left many more columns and rows with zeros in the borders. This may seem irrelevant for now, but will be useful when we discuss kernels in the next blog post.)

Let us define a new $g$, that after discretization looks like the following:

\[g = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix}\]

How would the convolution then be calculated? Same steps:

  • Flip the matrix $g$ (both horizontally and vertically), generating $g’$.

  • Shift $g’$ (according to the place where you want to evaluate the convolution). Basically, you want to align $g’$ with some part of $f$.

  • Multiply the aligned elements and sum their result.

An example calculated by hand

Before concluding this blog post, I want to calculate an example by hand. If you did not understand everything so far, this should clarify whatever is missing. Let’s define two new functions $f$ and $g$, that, after discretization and “vectorization”, become the following matrices:

\[f = \begin{bmatrix} 0 & 3 & 6 & 3 \\ 3 & 6 & 3 & 6 \\ 6 & 3 & 6 & 3 \\ 3 & 6 & 3 & 0 \\ \end{bmatrix}\] \[g = \begin{bmatrix} 0 & 3 & 0 \\ 0 & 1 & 2 \\ 4 & 0 & 0 \\ \end{bmatrix}\]

If you think of $f$ as an image, you might interpret it as two diagonal lines (the values with 6) surrounded by some “shade” (the values with 3). The function $g$, on the other hand, is hard to interpret. I chose a very asymmetric matrix to show how the flipping (the first step in our calculation) affects the final values in $g$.

Let’s calculate $(f \ast g)(0,0)$. First is to flip $g$ to create $g’$:

\[g' = \begin{bmatrix} 0 & 0 & 4 \\ 2 & 1 & 0 \\ 0 & 3 & 0 \\ \end{bmatrix}\]

Then we align the matrix $g’$ with the part of $f$ that corresponds to position $(0,0)$. This part might cause some confusion. Where exactly is $(0,0)$? There is no actual “right answer” to where this point should be after discretization, and we don’t have the original function formula to help us find out. I’ll call this “the border problem” and refer to it in the next blog post. For now, I’ll just align with the points “we know” and forget about any zeros that might lurk beyond the border of the matrix representing $f$. This will give us a so-called “valid” convolution.

Finally, we need to multiply each element pointwise and sum all of the results. To make things clearer, if $A$ and $B$ denoted the two matrices of same size that we now have, then what we want to do is:

\[A \odot B = \sum_{i,j}{f_{i,j} \times g_{i,j}}\]

Where I am representing this “pointwise multiplication followed by sum” by the operator $\odot$. In our specific case, we get:

\[\begin{split} (f \ast g)(0,0) &= \begin{bmatrix} 0 & 3 & 6 \\ 3 & 6 & 3 \\ 6 & 3 & 6 \\ \end{bmatrix} \odot \begin{bmatrix} 0 & 0 & 4 \\ 2 & 1 & 0 \\ 0 & 3 & 0 \\ \end{bmatrix} \\ &= (0 \times 0) + (3 \times 0) + (6 \times 4) + (3 \times 2) + (6 \times 1) + (3 \times 0) + (6 \times 0) + (3 \times 3) + (6 \times 0) \\ &= 45 \end{split}\]

Easy, right?

Now to calculate $(f \ast g)(1,0)$ we just move the matrix $g$ to the right, aligning it with the next submatrix of $f$:

\[\begin{split} (f \ast g)(1,0) &= \begin{bmatrix} 3 & 6 & 3 \\ 6 & 3 & 6 \\ 3 & 6 & 3 \\ \end{bmatrix} \odot \begin{bmatrix} 0 & 0 & 4 \\ 2 & 1 & 0 \\ 0 & 3 & 0 \\ \end{bmatrix} \\ &= (3 \times 0) + (6 \times 0) + (3 \times 4) + (6 \times 2) + (3 \times 1) + (6 \times 0) + (3 \times 0) + (6 \times 3) + (3 \times 0) \\ &= 45 \end{split}\]

And the other two elements are calculated the same way:

\[\begin{split} (f \ast g)(0,1) &= \begin{bmatrix} 3 & 6 & 3 \\ 6 & 3 & 6 \\ 3 & 6 & 3 \\ \end{bmatrix} \odot \begin{bmatrix} 0 & 0 & 4 \\ 2 & 1 & 0 \\ 0 & 3 & 0 \\ \end{bmatrix} \\ &= (3 \times 0) + (6 \times 0) + (3 \times 4) + (6 \times 2) + (3 \times 1) + (6 \times 0) + (3 \times 0) + (6 \times 3) + (3 \times 0) \\ &= 45 \end{split}\] \[\begin{split} (f \ast g)(1,1) &= \begin{bmatrix} 6 & 3 & 6 \\ 3 & 6 & 3 \\ 6 & 3 & 0 \\ \end{bmatrix} \odot \begin{bmatrix} 0 & 0 & 4 \\ 2 & 1 & 0 \\ 0 & 3 & 0 \end{bmatrix} \\ &= (6 \times 0) + (3 \times 0) + (6 \times 4) + (3 \times 2) + (6 \times 1) + (3 \times 0) + (6 \times 0) + (3 \times 3) + (0 \times 0) \\ &= 45 \end{split}\]

Resulting in the final matrix:

\[(f \ast g) = \begin{bmatrix} 45 & 45 \\ 45 & 45 \\ \end{bmatrix}\]

Conclusions

In this blog post I expect to have given you a very intuitive understanding of how convolutions are calculated and a notion of what they are doing. It should help you to make the connection between all those integrals you find in Kahn Academy or Wikipedia and the discrete convolution operation you see in some Neural Networks. If none of this still happened, the examples of the next blog post will definitely help you to realize what is going on.

I had not planned for this blog post to become so long. In the next blog post I’ll show applications of convolutions from the image processing field, and how they connect to Convolutional Neural Networks. As a bonus, I want to show a very elegant application of convolutions from the Neural Turing Machines.

Stay tuned =)

UPDATE: Thanks to Fotini Simistira for pointing some mistakes in my calculations.

What is Machine Learning?

When I started getting in touch with Artificial Intelligence (AI), no one could give me a clear distinction between all those buzzwords such as “Artificial Intelligence”, “Pattern Recognition”, “Data Mining” and “Machine Learning” (ML). It took me a long time to actually be able to dintinguish the meaning of these, and, to say the truth, it is still not very clear to me how exactly the first three relate to each other. The meaning of “Machine Learning”, however, is very simple, and I believe it should have been made clearer from day one. This post has the goal of separating “Machine Learning” from this mess, making it very clear when something is ML and when something is not, and what the relation between ML and AI is.

That said, while I do expect you to have a perfect notion of what is and what is not part of the field after you read this blog post, I don’t intend to give you a better definition of the expression “Machine Learning” than the definitions you may have already found in other places in the web.

Why the confusion?

When I sat down to write this blog post, I thought of taking a look at how others have defined the field before (because, of course, I didn’t expect to come up with any magic new definition). I came across this video (from the amazing course on Machine Learning in Coursera by Andrew Ng):

And this video:

If you watch them, you will see these three field definitions:

Field of study that gives computers the ability to learn without being explicitly programmed.

A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E.

These first two definitions are good, because they make it clear that the programmer doesn’t tell explicitly what the machine should do: the behavior of the machine is completely dependent on the data it has previously seen (and of course the ways in which the machine learns).

"The extraction of knowledge from data"

A problem with this last definition is that it does not clarify who extracts the knowledge from the data. If the programmer writes rules that conform to the patterns in the data, does this count as Machine Learning? (more on this in the next section)

If you look back at these three definitions, they may cause you to be confused about what exactly the relation between Machine Learning and other fields is. Why is it “Machine Learning” and not, say, “Artificial Intelligence”? (and how is it related to AI, in the first place?) And when am I applying AI that is not ML? And can I apply ML that is not AI?

So… how do I explain this?

A simple story

As I said, I don’t actually intend to give any better definition. Any definition I could present here could be probably invalidated after any amount of some more “socratic” scrutinity. Instead, (and following the field itself,) my explanation is based on examples.

Say you are a photographer who takes two types of pictures: (1) landscapes; and (2) people’s faces (say, for their CV). Let’s assume you have them all stored in a folder in your computer. One fine day you decided to organize your pictures into two folders (landscapes and cv). You started dumping some of your stored photos into the two folders, but after some 50000 images you realized they were too many and thought it would be nice to have some automatic method to do that.

In the lack of a better idea, you decide that an easy way to check if the image is of a human is to count the ammount of pixels that have a color similar to a human skin color. You create a rule that says something like “if the image has more than 500 pixels of those colors, then it is a cv picture”.

You write a program that counts the number of pixels of those colors in your images and moves the files to the respective folder. You run your program and realize that you cv folder now has a lot of images of sandy landscapes. Your program did not do what you hoped it would.

You look better at the images you had already classified, trying to find patterns that could help you to identify the two types of images. You make, say, a histogram of the colors in the two types of images and realize that there is a set of rules that you could apply that would work in most of your cases. You implement those rules, run your program, and feel relatively satisfied: you did as well as you could.

Was this AI? Was this ML? (1)

Now… look back at the story and think about it: did you just make AI? I would say that the answer should probably be yes: if your program had worked, someone who has never seen what your program does would have most probably believed your program was “intelligent”. On the other hand, your rules were engineered by you, and the machine was only supposed to apply them to decide what to do. While you may feel you “taught” the machine what it should do in each case, all the teaching was done through “explicit programming” (see the first definition of Machine Learning in the previous section). In other words: it was not the machine who learnt; it was you.

Revisiting your story

You think you could achieve some better performance, but you are not sure exactly how. Instead of simply counting the colors of your pixels and using this to conclude the type of the image, you think it would be a nice idea to use some fancy image processing techniques. You recall something called the “Discrete Fourier Transformation”, that allows you to get a representation of the images in the “frequency domain”. You apply it to some of your already classified, and realize that, truly, the cv pictures have a very specific pattern of frequencies that you can develop some set of rules for… and anything that does not conform to that pattern seem to be a landscape. You implement your algorithm and feel a little more confident that this time there are fewer errors.

Was this AI? Was this ML? (2)

If you think the first version of your “image classifier” was AI, then you should probably take this second version as AI. (One common complaint among AI practitioners is that people stop taking something as AI as soon as they find out how it works).

On the other hand, could you say you just did ML? My answer is still no: it is not because you applied some fancy “feature extraction” algorithm to understand your data that you are now doing Machine Learning. Again, all the extraction of knowledge is done by you, and you just explicitly instructed the machine to follow what you considered to be the best set of rules you could think of.

So… well… then… actually… when is it ML?

Let’s look one last time to your story. You came a long way until here: used some exploratory statistics to find what the colors in each type of images can tell about them, and also what frequency patterns your images most commonly have. You implemented a rule-based classifier that decided in which folder to put each one of your images.

But you feel all of these rules you developed are not good enough. It would be great if the program could learn by itself what folder to put each image in, based on examples of images of each type. Enter Machine Learning!

When you apply Machine Learning, you don’t want to tell what rules to use: you give data, and you expect the program to figure out by itself what to do. For example, let’s say you still have those 50000 images you had manually classified in the beginning of our story. Let’s say you heard of some popular image classification model called Convolutional Neural Network for which you found some code in the internet and would like to try. In this case, for each one of your 50000 “training” images, you also tell the model what is the answer you expect it to give back (i.e., either cv or landscape). When you start, the model makes a lot of mistakes, outputting many times cv when it was supposed to output landscape, and vice-versa. But every time it makes a mistake, it updates its internal variables in a way that causes it to become more likely to answer right the next time. When the model is done training, you should expect it to answer right most of the times (at least for those 50000 images you were using to train it).

Now that it is done training, you put those 50000 images aside and use the already trained model to put all of your other images in their respective folder. Notice that you didn’t tell the model what to look for in each image. You may have told it how to update its internal variables, but you didn’t explicitly develop any rule. The “rules” were found by the model itself. It learned them!

Some final thoughts

Machine Learning is a huge field, and is going through some nice hype in the last few years. The example I gave in the previous subsections uses something called “Supervised Learning”, which is when you tell the model what is the expected answer for each training instance. There are other models for which you don’t have to explicitly point out the “right answer” every time (and that are useful when you don’t have those 50000 manually labeled examples you trained your Convolutional Neural Network with).

The models may also learn several different types of things. In the case of Convolutional Neural Networks, it is actually not 100% clear what the networks are learning; however, other types of models may learn, say, rules, just like those you were trying to develop manually in the beginning of our story.

Other resources

There are a lot of resources in the internet on both AI and ML. There are some, however, that are my favorites, which I thought of linking here.

On Artificial Intelligence:

On Machine Learning

First Post

Good morning and welcome to the Black Mesa Transit System. This automated train is provided for the security and convenience of the Black Mesa Research Facility personnel. The time is 8:47 AM. Current topside temperature is 93 degrees with an estimated high of 105. The Black Mesa Compound is maintained at a pleasant 68 degrees at all times.

Well… this is a greetings delivered by the Black Mesa Transit System. In the upcoming weeks I intend to have some serious content in this blog, which will hopefully replace this dummy blog post.

In the meanwhile, maybe you could take a look at the Poems and Songs sections in the sidebar =)