Rules vs. Learning

For automated decisions, we can use explicit hard-coded rules. Or we can let machines learn the rules from data. What is the difference?

An example for a rule-based decision

We are familiar with rules to make decisions. Consider the example of a traffic light. When the light is red, we know we must stop and wait. When the light turns green, we know we can go. And when we approach a traffic light that is just switching from green to yellow, we need to decide whether we should accelerate or break. That decision could be based on our speed and/or the distance to the traffic light. The following image shows an example for a rule-based decision for this problem:

When a computer has all the required information (status of the traffic light, distance to traffic light), we could easily automate this simple decision using a program similar to the one in the image above.

Some problems are hard for a computer, but easy for humans

The image below shows a cat. We know this immediately when we look at the image. This because our brains have evolved to recognize important concepts in our environment automatically. Recognizing a tiger standing in front of us was and is crucial for our survival.

An image is a series of bits to a computer

For computers, it's not at all that easy. A computer is a general-purpose machine that has no concept of things like cats or other objects. To a computer, the image below is nothing but a long list of numbers. To make things worse, these numbers aren't decimal but binary numbers, consisting only of 0 and 1, or bits. Depending on whether it's a color or grayscale image, a chunk of a certain length represents the color of a single pixel.

So unless we teach our computer how to recognize a cat in a stream of thousands or millions of 0 and 1, it won't be able to do it. The question then is: Can we teach a computer with a set of rules what a cat looks like in an image? Just like we did with the traffic light example above? Or do we need a different approach?

Rules are doomed to fail here

You can probably guess the answer, but let's first at least give a try: How could a set of rules look like that tells us in the end whether and image contains a cat or not? Here is an idea, albeit a naive one: Let's imagine we create a prototypical cat blueprint and compare a new image pixel by pixel with that blueprint to calculate a similarity score. We could then define a threshold of similarity to decide. This might be a good idea for simple, standardized objects that always appear the same in an image. But cats can appear very differently: Sitting, running, jumping, chasing. They can be close to the camera or far away. Cats come in different shapes and colors. The combinations are indefinitely many, and this makes it impossible to come up with a blueprint filter for a cat.

Before we solve the cat-recognition problem, why not start with something simpler?

Handwritten digits (MNIST)

A smaller image recognition problem

To simplify the problem of recognizing something in an image, lets downscale the problem. We now focus on small images of 28 by 28 pixels, each pixel can be either black (0), white (1), or any gray shade in between. The task is to classify any such image into one of 10 categories, which represent the numbers between 0 and 9.

This problem is famous in machine learning and is being used as a benchmark for new machine learning algorithms. The benchmark is based on the so-called MNIST dataset (Modified National Institute of Standards and Technology database), which contains a total of 90,000 images of handwritten digits. Yann LeCun et al. offered a new approach called Convolutional Neural Networks in their paper Gradient-based learning applied to document recognition, which is still the best known architecture for a neural network to solve this problem today. State-of-the-art algorithms can correctly recognize ~99.75% of all digits.

But let's go back to the basics and assume we don't have any of that yet. A handwritten digit from the MNIST dataset is represented in the same manner as the cat image from above. It consists of a 28 by 28 matrix (= 784 pixels in total), in which each value represents a shade of gray between 0 (black) and 1 (white). To be precise, the 2D matrix is imposed by us for better visual representation. All the computer sees is a long list of 784 times 8 bits, assuming we use 1 byte to represent a shade of gray. Knowing that one row has 28 pixels allows us to display the image correctly.

Can rules work here?

Now that we have a simpler and well-defined problem, can we come up with a system of rules to classify any image from the MNIST dataset into one of the 10 categories and thereby recognize the digit correctly? You can try this yourself in the following exercise:

3.1 Rules vs. Learning

Variation is the problem

You might come up with a decent solution to the exercise, but it most likely won't be as good as state-of-the-art algorithms with a precision of 99.75%. And it will be quite an effort to implement. But what makes it so difficult?

If all humans wrote the same and all digits looked the same, the problem would be straightforward to solve. The suggested prototypical blueprint approach from above would work perfectly. We could define a filter for every digit from 0 to 9 and apply the filter to a new image. The filter with the highest similarity value wins and becomes our prediction. This essentially as if we wrote IF..ELSE rules comparing the image pixel by pixel to the digits 0 through 9. A filter is just a better way to imagine this.

However, as we can see from the image below, people write digits very differently, and they don't look the same at all. Just look at instances of the same digit and compare them. You will find many versions of the same digit. With the high degree of variation, our filter-based approach will probably fail or at least perform much worse than in a pixel-perfect world. Variation is a problem for rules-based systems.

To illustrate the problem with the filter-approach, look at the two 3s that are layered on top of each other. The one in blue, the other in grayscale. Both clearly represent a 3, but the pixels are at different locations. A naive filter that doesn't consider the variation wouldn't work here.

We need rules that can generalize

To cope with the high variation in handwritten digits, we need less strict rules that generalize better. An example of such a rule or "fuzzy" filter is shown in the figure below. For each digit, a pixel in the blue area increases the likelihood for that particular class, while a red pixel decreases it.

Let's look at the filter for the zero. The red area in the middle means that when in a new image we find pixels in that area, it is less likely that the digit in the image is a 0. In contrast, when we find pixels lying on the oval blue shape surrounding this red area, it is more likely a 0. To calculate the score for the class "0", we compare the image pixel by pixel with the fuzzy filter and sum or subtract a value depending on whether pixels are in the blue or red areas. We calculate the scores for all 10 classes and choose the highest as the winner.

Rules learned from data

The fuzzy filter from above is not something we want to hard-code manually. Note that the filter does not only contain blue or red, but all shades between full and no color at all. This means the filter doesn't just assign positive or negative values, but also learned a continuous bandwidth for applying the filter. Some pixels are more important than others.

This would be a very cumbersome task when done manually. We'd have to go through all 90,000 images and calculate a precise value between -1 and +1 for each pixel in our filter and make sure we don't overfit our filter to the last images we considered. This is exactly what a machine learning algorithm does when learning the rules (here: the filters) to classify handwritten digits, but it can do it much faster and better than humans can. In the following sections, we'll learn how a machine learning algorithm learns from data.

Slides

You can download the slides as PDF here.

References

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998. Online.

PreviousTraining a Custom Model NextLearning from Data

Last updated 4 years ago

Was this helpful?