In many applications, real-time image analysis is mandatory. An industry-robot, which needs five seconds to recognize objects on an image is mostly very limited in usage. On the other hand latest applications require detecting objects and positons as accurate as possible. Using some high-performance GPUs optimized for deep learning and latest research advances like YOLO, kind of circumvent these problems. However these approaches have got two major drawbacks, which may matter, depending on the use-case.
- The power consumption and size of the computers is not sufficient for embedded systems
- GPUs are in general quite expensive compared to small embedded systems
Industry-robots obviously do not suffer from these points, because their power consumption and size does mostly not matter. The prices for high-quality industrial robots start at a few thousand bucks up to millions of dollars. This means in general, that a 400€ computer as processing-unit will not bother anyone.
In small systems like smartphones or embedded systems in cameras, size, time and power consumption matters a lot. Therefore we can not solve object detections on smartphone solely by using deep neural networks, at least not if we want some near real-time processing. This is where a method kicks in, which have been used for many years before CNNs and deep learning was “discovered”. Instead of training only one classifier we train a cascade of classifiers.
Imagine we use a very simple sliding-window-search approach to detect objects. This would result in many areas to evaluate for the CNN and if we do not have some modern GPUs or power and time constraints, our CNN approach would take for ever. At this point even Faster R-CNN and similar would fail, because their speed up mainly comes from shifting region proposals to the GPU by using the CNN. The YOLO architecture may tackle this problem, but has worse detection rates than R-CNN. Using a cascade of classifiers may be able to solve this problem.
We train our CNN like we used to do and we also do the sliding window search approach like we used before. Now we do not evaluate the CNN on every region proposal but evaluate in the first stage a very simple classifier like AdaBoost with a few hundred of decision stumps only trained on fast to calculate features like standard deviation and mean (this can be computed quickly by using integral images). We train the weak classifier on differentiating foreground and background or face and no-face or sign and no-sign (this totaly depends on your usecase). However we make sure, that the clasifier tends to see objects/faces/signs even if the region does not contain any. On this way we may filter about 50% of non-object regions in the first stage. Now we can filter the remaining region proposals by another classifier, which uses more advanced features like LBP-Histograms. After a few stages, we filtered enough regions to directly apply our CNN on them and apply the final classifiaction.
A very popular framework, which used this cascade of classifiers is the Viola–Jones object detection framework. This work was done in 2001, so a few years before CNNs became popular. The framework was able to detect faces on images in real-time with good accuracy. They used a 32-stage cascade classifier and added features the deeper the stage was. The speed of this system is unbelievable, if you know what hardware they used for their test:
On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288
pixel image in about 0.067 seconds (using a starting scale of 1.25 and a step size of 1.5 described below).
This is roughly 15 times faster than the Rowley-Baluja-Kanade detector and about 600 times faster than
the Schneiderman-Kanade detector.
Almost every modern smartphone has at least four times the processing power of this old Pentium III processor. Therefore the exact same algorithm may run in about 0.01675 seconds, which means at a framerate of ~60 FPS.
Even though this approach is used rarely today, it may improve run-time and accuracy when combining it with CNNs and deep learning.