Quick explanation on why CNN are nowadays almost always used for computer vision tasks.
A few years after its development, CNN already won its first image recognition contest (2011), the next year, in 2012, four image recognition contests winners were using CNN as a base architecture! And even now, CNN is still reigning as the queen in computer vision!
Let’s start off this explanation by:
What do we do in computer vision?
Instead of vector-like features in tabular data, and encoded text data, we deal here with an image. Images do not necessarily have labels, sub-labels for regions and therefore features should be extracted, or smartly reduced. To illustrate the following let’s give a simple example: Say that the task is to decide based on an image the age of somebody, then we should extract the correct features to use in order to infer. These features can be the presence of wrinkles, white hair, etc… As an end word for this part: feature extraction!
How to extract features from images?
Say you want to extract wrinkles, how are we supposed to find wrinkles (not manually obviously) in an image? Well, wrinkles (fig.1) more or less look like this (fig.2)
In order to capture such lines from the image, an intuitive solution is to use matrix filtering. See the following example (fig.3):
Here we the filter is a 3×3 matrix that has value 1 on its diagonal and 0 elsewhere. This filter will, therefore, be used as a mask for cross-shaped regions, or totally dark regions. Outputting a high value after entry-wise matrix multiplication means that the filter shape and the region analyzed matches. Note that matrices for filtering do not always have to be valued at 0 and 1, it can also have -1 in order to impose a strict shape search. Proceeding the same way for wrinkles, we can use the following filters (fig.4).
We can apply these filters to the original image and simply keep higher values to show that shape has been detected. This way we know where and what are the extracted features from the original. (Remark: This step is better known as a convolution layer step, but we will avoid prying too much on it as there will be another article explaining in details the layers of a CNN)
Filters? but what kind of filters should I use in general?
Unlike what we did previously, for general task feature extraction, filters can have various size and when reaching too great of sizes, enumerating them or guessing these different filters become too difficult. Plus, most of the time we ourselves ignore what feature we should be extracting for the classification. Well, guess what? a CNN does not require us to guess what filter we should use. Indeed the training of a CNN is the training of these filters which are weights for the network. Thus a CNN can totally uncover details that humans would never notice, to determine for instance the age of someone on a picture. THAT is the strength of a CNN. (we were working so far with only few filters that were applied at the same step, but usually more filters of different sizes are used and used with temporal delay.