This post summarizes the ideas behind pattern recognition and explains a few words that are very often used in pattern recognition.
Specifically, we get to know the following:
- What is pattern recognition and where is it used?
- Machine Learning: How can computers learn?
- Trainings set and test set: Why we need two different data sets
- Preprocessing: The art of data cleaning
- The feature extraction: Clever data reduction to the most important values
- The Classifier: The heart of any pattern recognition software
We will limit ourselves to very abstract concepts and hardly look at concrete implementations or even pseudo code. This post should only serve to get a feeling for pattern recognition.
“Pattern recognition is the ability to recognize regularities, repetitions or similarities in a set of data.”, says Wikipedia 1. So in pattern recognition we look for rules in a given set of data to obtain new information using these rules.
Let’s imagine specifically that we receive an order from Deutsche Post (a german postal service), which reads as follows: “We need a piece of software that automatically reads the addresses on the packets and sends the packets to somewhere depending on the recipient address”.
Our task now is to write a text recognition software, which is of course not easy for us as beginners in the field of pattern recognition. Text recognition software is one of the common examples in the field of pattern recognition. However, the following areas also belong to pattern recognition:
- Voice recognition (What did the user say?)
- Object recognition (What is visible in the picture? Dogs? Cats? Cars?)
- Weather forecast (How will the weather be tomorrow? Is it raining?)
- Stock price forecast (How is the price behaving? Will it rise soon? When should I sell or buy?)
- Production monitoring (When does my product break down? Which steps must be carried out to extend life time? Can this already be seen during production?)
- Translation programs (Translating texts into another language, e.g. English <-> German)
- Antivirus systems (Is the file a virus? If so, how threatening? Do I know this kind of virus?)
- Much more
Pattern recognition can be found practically everywhere and in every part of computer science. Everyone has certainly written a very simple variant of pattern recognition software himself.
But now back to our text recognition software. The first idea that comes to our mind is to define fixed rules how a number/letter should look like. We start with the numbers and ignore the letters completely for now. Let us simply draw up our first rule: “A zero is always a closed circle”.
This sounds logical and makes sense at first glance. So if our program gets a number and should say what that number is, then we recognize the 0 by the fact that all pixels on the image are connected to each other and lie together in a circle.
To see if the pixels are arranged in a circle, there are enough algorithms that are more or less complex. We define the same kind of rules for the other numbers, so a 1 could always be “a line with a hook”. Now we construct a rule for each of our digits and after initial tests it all looks good. So we are proud to hand over our first pattern recognition software to Deutsche Post and are already looking forward to our award.
So far so good … Right?
After a short time Deutsche Post will mail us and write something like this:
“The software hardly recognizes letters and numbers, out of 100 letters/numbers only about 2 are recognized correctly! Total junk!” Completely outraged and astonished, we take the long way and drive directly to the head office of the post office. After a few hundred wrong examples, we believe the client and go back to the drawing board:
“But what had happened?”
Obviously, our problem is divided. First of all, we did not realize that our rules did not always work, we only tested our software for numbers and letters that we wrote ourselves. That means for our 0s this rule was perhaps perfect and sufficient. But unfortunately the software has to work for any handwriting.
Our second problem is even worse than our first:
Obviously, our defined rules were not good enough. We quickly realize that a 0 is not always a circle and does not always have to be closed. If we think about it a little more, we also notice that the 0 could just as well have a line through the middle. The 0 can be even more angular than oval and we can think of many more special cases that we have not considered. But establishing a rule for each case would be a huge amount of work. Above all, we can never be sure we have all possible rules covered. It is therefore quite difficult to establish concrete rules that work in all cases. And this is exactly where Machine Learning comes into play.
What is machine learning?
Instead of setting exact rules, we show the computer thousands, sometimes millions, of examples of an object in various situations/exposures/sizes/etc. and hope that the algorithm is smart enough to recognize the common pattern in the data.
So we teach the computer knowledge by showing it examples (and telling it what it is), similar to how we would teach an child. The computer itself should discover rules such as “0s are mostly oval” and combine them in such a way that it can easily distinguish between the different digits. Since we personally do not have to set up any rules or only very few rules by hand now, but the computer finds the rules independently, we can of course also have hundreds of different rules set up per digit, which then work for 99% of the cases. Maybe finding so many rules may take an enormous amount of time, but that does not bother the computer! We just let it find rules for a few days. Because finding the rules usually only happens once.
Our solutions are as follows:
- We collect more examples of different people and fonts so that our computer can find rules for all kinds. To avoid another slip and to find out how good the rules found really are, we divide our data into a training set and a test set.
- We throw machine learning on our records and hope for good results.
Training and test sets
After we got a few thousand pictures of digits, we could start with machine learning. But the problem is, as with our manual rules before, that we do not know how well the rules work for “unknown” fonts.
We need a reliable way to measure how well our rules work for fonts the computer has never seen before. So that we can estimate how many digits are really recognized correctly. You always have to remember that you can not train everyone’s typeface in Machine Learning. This is simply not possible, both in terms of time and effort, to get these digits from everyone. Instead, we want to find out whether our algorithm is suitable for use in real life. To find this out you divide the examples which we show the computer into two groups. The training set and the test set.
The training set contains examples that should be shown to the computer. The computer therefore uses the complete training set for machine learning. The test set, on the other hand, is usually much smaller than the training set. It contains only examples which are not in the training set and which the computer has never seen before.
To find out now how well our rules work in reality, we apply these rules to our test set after learning the rules. Since the test set contains only unknown fonts, it is very easy to estimate how good the algorithm will work in reality.
Let us imagine we get 100 different examples of each digit from the post office. The training set could now consist of 90 randomly selected examples per digit and the test set of the remaining ones. The hit rate for our training set could be about 99%, because the rules were found based on this set. The test set, on the other hand, may be only correctly predicted in 80% out of all digits.
So let us assume that we have a hit rate of 80% for the test set with machine learning. This means that 20% of all letters will be sent to the wrong recipient or will not be delivered at all. Compared to before, we have improved our algorithm significantly, but we are still a long way from reaching the end. There must be a way to achieve even better performance!
Preprocessing – Cleaning helps!
Everyone will recognize the sentence “If you keep your room tidied up, you will find your things again” from childhood. As annoyed as we were by this statement and the associated tidying up, we knew it was true. The same applies to pattern recognition software:
Tidying up is half the battle
Cleanup, also known as preprocessing, removes irregularities and contradictory statements from our data. Preprocessing thus brings the data into a uniform form. Let us take a look at these four pictures of one and the same number:
Obviously, this is always the same number with the same font, and yet our machine learning algorithm will extract at least four additional rules.
It is also a 0 if the character satisfies the properties of 0s and:
- the background is white and 0 is red.
- the background is green and 0 is red.
- the background is grey and 0 is red.
- the background is grey and 0 is green.
So to cover every color combination, you would need millions of data in all sorts of color combinations for our computer to learn all these colors. Preprocessing helps us to avoid this problem. We will clean up the data a little bit! If we convert all the images to black and white (also called binaryization), our computer no longer needs to learn which colors can be combined, but can concentrate again on the more important properties. So we take the work off the computer and say right from the beginning: “The colour of the pictures does not matter” by converting all pictures into black and white pictures before we shown them to the computer.
Let’s have a look at the pictures:
It looks much better, but somehow you may have the feeling that this is not perfect. Now you really notice that the pictures are also slightly rotated. Of course, this means more rules for the computer to learn. This can also be corrected by having all images automatically rotated so that they are always “equally rotated”. The next preprocessing step is to bring the digits all to the same size/thickness/etc. You see there are several possible preprocessing steps and each step can further improve the hit rate.
Meanwhile we have brought all our pictures to a similar shape, they are now all about the same size, all black and white, all shot in the same way, and so on. But we will not exceed 95% hit rate in our test set, which is still not good enough.
Feature Extraction – First Aid for Computers
To further increase our hit rate, we can help the computer finding rules.
Until now, the computer always got the complete image and had to find rules based on the pixel values. But if we tell the computer in advance what to look for, we can combine our human knowledge with machine learning and hopefully get results that are exceptionally good.
Let’s take a look at these A letters for clarification:
Since we have applied our preprocessing steps to it, the As all look very similar. Nevertheless, they differ in the font. Some A letters are a bit more curved, others are a bit more angular and others are shaky. We as humans know of course that the font has no influence on which letter is shown, but of course the computer does not. By splitting the whole letter into “perfect” lines and circles, we save ourselves a bunch of rules.
We no longer say to the computer: “At position 0,0 is a white pixel, at position 1,0 a black pixel,….” and this for all pixels in the image, but we can limit ourselves to saying to the computer: “At position 12,3 is a line that goes to 20,5, at position 2,2 is a circle with radius 5,…”.
This way we combine the pixels into larger objects and help the computer to find rules. The feature extraction brings the data into a form that makes it easier for the computer to find rules. Preprocessing and feature extraction cannot always be clearly distinguished from each other, often enough these two steps overlap.
In summary, however, one can say that preprocessing consists of procedures that try to bring the data into a uniform form (same color, same position, etc.). Feature Extraction, on the other hand, tries to incorporate human knowledge and tries to reduce the data to the most important.
After helping our computer to extract the rules, we may get a 99.9% hit rate. That is not perfect, but finally good enough to use it. The remaining 0.1% must then be viewed and deciphered by an employee of the post office.
The Classifier – The Heart
“Stop! Stop! What is a classifier now?” you might think.
In fact, we have assumed the existence of a classifier all the time without knowing that it is one. We used machine learning to make the computer constructing its own rules. But machine learning is only the process of learning, not the resulting program. Simply put, a classifier is the program that manages the rules, finds rules, applies rules, and finally tells us what number it is. The classifier is the last step in our pattern recognition pipeline. In it all information, all rules, simply everything, run together and based on experience and the learned rules the classifier then delivers a result. There are dozens of different classifiers, such as Artificial Neural Networks, SVMs, k-Nearest-Neighbour Classifiers, expert systems and many more. Each classifier is more or less well suited for other application areas and feature extraction methods.
In this chapter we summarize what we learnt today in a graph to better understand the dependencies and the process (examples for each stage of the pipeline are given in brackets).
What I find particularly fascinating about this topic is the enormous variety that can be found in every pattern recognition software project. There are a few methods and procedures that you can always use and that actually always improve the hit rate, but which combination of methods really brings THE best hit rate, that can usually only be estimated in advance, but not clearly predicted or even calculated.
This is exactly what makes the field of pattern recognition so extensive and complex. There are generally no perfect solutions to any problem in this area.
In game programming, for example, I know I can find the shortest path between A and B via Dijkstra or A* and I know the algorithm will give a perfect result no matter how often and on which problems I apply these algorithms. Unfortunately, there is no such algorithm in pattern recognition, even though many people already dream of such “intelligence”.
Marcel Rupprecht, "The Pattern-Recognition-Pipeline", in Neural Ocean, 2018, //neuralocean.de/index.php/2018/03/28/the-pattern-recognition-pipeline/.
year="2018 (accessed March 30, 2020)"