Today we have a look at how you can create a text recognition software with simple methods without being dependent on Tesseract and similar.
We will get to know the following:
- Semantic Local Binary Patterns (S-LBP): As feature extraction method1
- Histograms: As method to efficiently merge/collect features
- k-Nearest-Neighbor Classifier: As simple method to classify samples
A binary for windows can be found here: SimpleOCR
This binary is very close to this tutorial but uses a few more advanced techniques.
In principle, the requirements are limited, you should only know how pictures are roughly structured and have worked through the following post once:
For this small project we limit ourselves to numbers of the same typeface. At first glance, this may make everything exceptionally simple, but it will not be as simple as you think. Our approach should also work for slight changes (such as resizing or shifting).
Some time ago I extracted some sequences of numbers from a game. These sequences of numbers indicate the current score and are sufficiently complex and yet simple enough to write a simple text recognition based on these numbers. A few examples of the extracted numbers:
We see that the images are of different sizes and the numbers can theoretically be arbitrarily long or short (but at least one digit).
To make life a little harder for us, we add a few sequences of numbers with different colors:
The complete set of numbers is available at Download Numbers.
The best way to solve such problems is always to work through the pattern recognition pipeline. Especially at the beginning. This prevents forgetting something.
The first thing you notice are the different colors. The colors have no influence on the digits. A 2 in yellow looks exactly like a 2 in red or blue. Therefore, the first step is to color the numbers uniformly. The optimum would be to binarize the numbers, i.e. to color the numbers black and the background white. But the longer you think about it, the harder this preprocessing step seems to become. After all, we cannot just say:
“Turn the darkest pixels on the image black and the brightest pixels white” (or the other way around). Because we have both numbers in very bright colours and numbers in very dark colours. You can now think about complex algorithms, but sometimes it makes sense not to spend too much time preprocessing in such situations, because the problem could solve itself later with feature extraction. Not applying preprocessing is not good either. But what we can do easily is to convert the color image into a grayscale image. Converting color images to grayscale images is very easy and fast. All you have to do is go over every pixel in the image and apply the following formula:
Translated with www.DeepL.com/Translator
Grayvalue = 0.33 * Red + 0.33 * Green + 0.33 * Blue
There are other formulas to create a grayscale image, but for us this formula is perfect. If you apply that to all the pictures, you get something like this:
Obviously, the differently colored images are still different, but they now all have at least the same color, namely gray. Another preprocessing step would be to resize all digits to the same size or to cut off the border to the left of the numbers. It is also not worth bringing all numbers to the same size due to our feature extraction method. Removing the left margin is not too complicated and we will split the complete number into single digits in a later step anyway. That is why we are postponing the step to later. Converting the images into grayscale images is the only preprocessing step we really use for this data set.
In the next step we extract some features from our grayscale image to help the computer recognize rules, there are many possible feature extraction methods that we could use and that would all give similar good results. However, we limit ourselves to a variation of local binary patterns (also abbreviated as LBP).
Local Binary Pattern
LBP is a good feature extraction method that is very easy to implement and understand. Therefore, it enjoys great popularity. The idea behind LBP is simple. LBP analyzes each pixel in the image individually and takes a closer look at the direct neighborhood of this pixel. Each pixel that is significantly brighter than the analyzed pixel is marked with a 1 and each pixel that is darker than or similarly bright to the pixel in the center is marked with a 0. Normally you set a fixed limit to reach, until you say, that a pixel differs clearly from the pixel in the middle in brightness. For a grayscale image with values from 0 to 255, you usually take values in the range 10-40. The sequence of 0s and 1s around the pixel in the center are taken as a binary number and assigned to the pixel in the center. If we look at the whole graphically, the principle should become clear quickly (in this example the neighboring pixel must be at least 10 brighter):
LBPs are not that complicated. LBPs in themselves are well suited for various problems, but even better is an extension of the LBPs, the Semantic-LBPs1. A research paper some time ago stated, 1 that for pattern recognition those LBP patterns are important that have no interruption in the ones or zeros. That means this would be LBPs, which have no interruption in the respective one or zero sequence:
These sequences have got interruptions:
According to the results of Mu et al.1< those patterns without interruption have the greatest influence on the detection rate. Of course, we benefit from this realization because we like to reduce our data to the bare essentials in order to reduce the processing time and to extract only the most necessary rules.
So we throw out all calculated LBPs that have interruptions. We also kick out LBPs that only consist of 1s or only 0s. LBPs consisting of only 0s mean that no pixel in the neighborhood differs from the pixel in the center. So that means we found an area. The largest area is the background and of course we do not want to recognize it, so we do not need these LBPs either. It is a different situation when all the neighbouring pixels differ too much, then we are looking at an isolated pixel that does not belong to a digit, because no digit in the world consists of a single pixel. After we have applied the LBP algorithm to the whole image and threw out the unused LBPs, we get an image that looks something like this:
In this picture I have all LBPs that were not thrown out colored in white and all others in black. You can see that the remaining LBPs roughly represent the edges of the digits. This is the first step towards SLBP. But there are still a few small things left that can improve the LBP. To understand the next steps, we must first understand what histograms are and what we want to use them for.
Histograms are often used to make feature extraction more robust against resizing, displacement and similar.
The idea of a histogram is very simple. You just count how often a certain feature appears in an image. We will illustrate this again with an example. Let us imagine that we have the task of deciding whether an image contains a circle or a line. Both the circle and the line are represented by white pixels and the background is always black. The circle and the line are always the same length/size, but can appear anywhere in the image. Some examples:
The easiest way to find out whether a circle or a line is visible is to count the white or black pixels. Since the circle is much larger and consists of more white pixels, you can easily distinguish between line and circle by the sum of white pixels. This sum is also called histogram. In a histogram, all pixels of one color are counted and stored. Let us take a look at a small part of a grayscale image (numbers in the cells correspond to concrete pixel values):
The corresponding histogram would look like this (all non-existent values were omitted):
- 3: 2
- 4: 2
- 8: 2
- 11: 2
- 15: 3
- 201: 2
- 211: 3
All other, non-existent values (also called bins) would be 0.
Very often histograms are also shown as a bar chart:
Now we create the histogram for some of our circles and strokes.
The histogram is always visible to the right of the image.
0 (= black): 3219
1 (= white): 1681
0 (= black): 4619
1 (= white): 281
0 (= black): 4619
1 (= white): 281
0 (= black): 3219
1 (= white): 1681
You see, no matter where the circle or the line is, the sum of the white pixels is always the same. This makes it easy to distinguish between circle and line. Now we want to go one step further, we want to distinguish not only if a circle or a line is visible, but also on which side the line is (left, right, up, down). This is also easily possible with histograms. The advantage of histograms is also their biggest disadvantage. Histograms are robust against displacements. At the same time, however, this means that you cannot determine the location of an object with a histogram solely. Sometimes this is intentional, but usually you only want to be robust against displacement up to a certain degree. In such cases, you use a trick. The whole image is not summarized in one histogram, but the image is divided into areas for each of which a separate histogram is calculated. This allows us to make slight shifts and still locate the object roughly. We always divide our image into 9 equal histograms:
Then we calculate for each histogram how many white pixels there are in the area:
It is obvious that we now have to do a few more checks to distinguish between circle and line. Therefore we have managed to be up to a certain degree robust regarding displacements and still be able to locate the object roughly. To sum up: Whenever robustness regarding displacement is required, one should consider using one or more histograms.
Semantic Local Binary Pattern
After we know what LBPs are and what histograms are used for, we can finally go to the S-LBPs. We already reduced the LBPs to uninterrupted samples. But a problem remains. Obviously the pattern “00111110” does not differ very much from the pattern “00011100”, if you look at another example graphically it becomes even clearer:
The resulting value, on the other hand, differs significantly. Thus, “00011100” would correspond to the value 28 and “00111110” to the value 62. These two values differ significantly, although they are actually very similar. For this reason, Mu et al.<1 proposed not to store the value of the resulting number per pixel, but to store the position of the center 1 and the number of 1s. This will transfom the two samples as follows:
- 00111110 –> Position: 4; Length: 5
- 00011100 –> Position: 4; Length: 3
“Yes that sounds good, but why do we even care?”, may come to your head now. The answer is simple: we want to combine our LBPs with histograms to achieve robustness against shifts. But if we save our LBPs directly, all values “00011100” would get an extra bin and all “00111110” patterns too, although they are actually very similar. With the new spelling we work around this problem by grouping by places. Thus “00011100” and “0011111010” land on the same place. To be able to handle the two patterns differently, we do not add 1 for each pattern in the histogram, but the length of the LBP. In concrete terms, this histogram would come out for our two patterns:
- 4: 8
To make our SLBPs perfect, one small thing is still missing. If you look at the resulting SLBPs for a simple stroke, you quickly notice that LBPs always occur in pairs:
We already found in a previous chapter that our LBPs are mainly lying on edges without any interruption. This can also be seen in the picture above. Logically, each object with a background has an edge on the left side and an edge on the right side. The resulting LBPs are therefore only mirrored. This means that we can summarize our SLBPs by including the original SLBP and the mirrored version in the same place in the histogram. We define the digits 0 to 3 (inclusive) as “original” SLBPs and 4 to 8 as “mirrored”.
This puts the following SLBPs all in the same place (position 1):
- 11100000 –> Position: 1; Length: 3
- 00001110 –> Position: 5; Length: 3
- 00000100 –> Position: 5; Length: 1
Feature Extraction Overview
In the last chapters we learned that LBPs describe the neighborhood of a pixel. SLBPs reduce LBPs to those patterns without interruptions because these patterns are much more meaningful than those with interruptions. In addition, we got to know histograms that allow us robustness against shifts. We found out that splitting the image into several histograms represents a compromise between displacement robustness and localization of the object. Finally, we refined the SLBPs a bit and realized that grouping by the center 1 in SLBPs is a good way to go. Now that we have finally solved the feature extraction extensively, we can come to the classifier.
The principle of the k-Nearest-Neighbor Classifier (often abbreviated as kNN-Classifier, not to be confused with KNN for german version of Artificial Neural Network) is very simple. A classifier is used to assign an array of extracted features (also called feature vector) to a class. In our case, the classifier has the task of finding out the number based on our extracted SLBP histograms. The kNN classifier uses the sledgehammer method. We know that every classifier must be trained before it can be used. The classifier must first see examples for 1s, 2s, etc., so that it can make its own rules. The kNN classifier has also to go through this phase. With the kNN-Classifier, however, the training phase is very simple. The kNN-Classifier simply saves all examples shown during training. This means that the training of a kNN-Classifier is limited to copying the training data. Therefore a kNN classifier is trained quickly. If you now give the kNN classifier a number to classify, the classifier does nothing else than compare the given number with each stored number. For each trained example, it calculates how well the example matches the given number. The kNN classifier then sorts the scores per trained example in such a way that the trained examples with the greatest similarity are at the beginning. Then the kNN classifier grabs the first k ratings and finds out which number occurs most frequently in these top k numbers. The most common number is then the result of the classifier. We will illustrate this once again with a concrete example. Let’s imagine the classifier was trained with 3 different examples per number and the first 10 ratings look like this (the bigger the number the more similar the stored example is to the given number):
Given number: 2
Generated similarity list:
- 5: 0.94
- 2: 0.93
- 2: 0.85
- 2: 0.71
- 5: 0.53
- 8: 0.44
- 0: 0.22
- 8: 0.21
- 8: 0.19
- 0: 0.19
A kNN classifier with k=1 now looks at the first number with the highest rating. In this example this is the number 5, so the classifier would return 5 as solution, which is of course wrong. Unfortunately, the given number looks like a 5. A Classifier with k=4 looks at the top 4 results and returns the result that dominates. The top 4 results are:
- 5: 0.94
- 2: 0.93
- 2: 0.85
- 2: 0.71
Thus, the 2 occurs three times and the 5 only once.
The classifier with k=4 would now return 2 as result. The result is correct this time. It can be helpful to look at the first k results and not just the best. Nevertheless, one should not choose k too big, otherwise one could find three times the 2 and three times the 5 and would again have no idea what the correct result is. A small picture to illustrate the idea (Classifier with k=4):
Graphically, we want to determine the question mark in the picture by looking at the k closest neighbours.
The Whole Image
In this chapter we finally program our number recognition. But before we can combine all the knowledge we have learned, we have to divide the complete number into individual digits. Because we can only recognize individual digits, not whole numbers. Splitting the numbers, however, is not very difficult. In our case, there is usually always a small space between two digits. This means that we always split our numbers if we only have black pixels in a pixel column. Additionally, we add as a condition that we always split after a maximum of 15 pixels. This avoids problems with figures that do not have such a black space.
Our kNN classifier measures the similarity of the stored data and the current sample by calculating the difference between each position for each histogram and summing up all differences.
Function calculateSimilarity(originalDataSLBPHistograms, trainedDataSLBPHistograms)
// Iterate over every histogram
For i=0 To originalDataSLBPHistograms.size()
// Iterate over all possible positions
For j=0 To 4
// The less the returned score is the more similar are the two given SLBP-Histogram-Arrays
This means that this function does nothing else but look at how many SLBPs of which length/position should be present in the respective histogram and the less present than expected, the more dissimilar the two given samples are.
All in all, our complete program looks like this in pseudo code (based on OpenCV commands):
// Load image as grayscale image
img = imread("classify.jpg",CV_LOAD_IMAGE_GRAYSCALE)
// Get image LBPs. To consider one pixel as 1, it has to be a value which is higher than centrePixelValue+10.
lbpImg = getLBP(img,10)
// Remove all LBPs with gaps (so remove things like "00110111") and convert to SLBP
slbp = convertToSLBP(lbpImg)
// Remove symmetric patterns
For y=0 To slbp.height
For x=0 To slbp.width
If slbp[y][x].position >=4 Then
slbp[y][x].position -= 4
// Split the whole number into several digits
splitDigits = splitNumber(slbp)
ForEach digit In splitDigits
// Calculate histograms of SLBP features. Use a total of 2x3 histograms (2 for width and 3 for height)
histogramArray = calculateHistogramArrayOfSLBP(digit,2,3)
// Classify current digit using kNN with k=1 and append it to result string
maxCharSize = 15
foundXCoordsArray = new Array()
countCols = new Array()
For x=0 To slbp.width
For y=0 To slbp.height
// Check if current SLBP is a valid one (without any gaps)
If slbp[y][x].length!=-1 Then
// Everytime there is a valid SLBP in this colum, add one to countCols[x]
For x=0 To slbp.width
If countCols[x]>0 && foundValidPixel==true Then
// Found first valid SLBP belonging to a number after invalid SLBPs
ElseIf countCols[x]<=0 && foundValidPixel==1 && (foundXCoords.empty() || x-foundXCoords.back()>2) Then
// Found first invalid SLBP belonging NOT to a number after valid SLBPs
// The last found valid number has to be at least 3 pixel width
// If the maximum size of one digit is reached, add the end point of last digit and starting point of new digit
If foundXCoords.size()>0 && x-foundXCoords.back()>maxCharSize Then
digitArray = new Array()
// Extract all digits, foundXCoords[foundX] and foundXCoords[foundX+1] always describe starting and ending point of one digit
For foundX=0 To foundXCoords.size()-1 Step 2
// Extract digit using x,y,width,height
extractedRegion = extractRegion(slbp,foundXCoords[foundX],0,foundXCoords[foundX+1]-foundXCoords[foundX],slbp.height)
Marcel Rupprecht, "Do It Yourself – Optical Character Recognition", in Neural Ocean, 2018, //neuralocean.de/index.php/2018/03/28/do-it-yourself-optical-character-recognition/.
title="Do It Yourself – Optical Character Recognition",
year="2018 (accessed March 30, 2020)"