So I finally found an architecture which seem to work pretty good. We also continued annotating images taken and labeled about 1000 more traffic signs/traffic lights. In total we now got about 3000 traffic signs/lights in our own database + 800 from the The German Traffic Sign Detection Benchmark. We applied a lot of data augmention (different jpeg-compressions, rotation between -10° and 10°, brightness + contrast adjusments, simple color balancing, …). So in total we got about 600.000 positive training images which also get augmented when training (e.g. gaussian blur, sharpen img, …). However the next big challenge is to get our NN working reasonably fast.
At the moment there are two really big threats.
- The time to extract some possible regions of interest using Selective-Search takes approximately 2-5s (depending on selected mode).
- Running the current NN on a CPU takes for classifying 111 rectangles, ~23s. This means 200ms per Region of Interest. Whereas on GPU it takes 800ms for 111 rectangles –> 8ms per rectangle.
Obviously if we are not trying to improve performance, we would have to wait about half a minute until we get some results. This is of course much too slow. As mentioned earlier, we want to reach 1-2 FPS. This means we have about 500-1000ms per image. Doing some simple math reveals that we can approximately test 4-5 rectangles per image whether it contains a traffic sign or not (ignoring the time taken for Selective Search). This means, if there are more than 5 traffic signs visible at the same time, we will exceed our limit, too. A solution would be to use some kind of GPU boosted device, like the NVIDIA Jetson Developer-Kit. This may decrease the time for classifying one ROI (Region of Interest) to 10-30ms. Therefore we could approximately check all 111 rectangles (without Selective Search).
However I do not want to buy a new device, nor do I want to cancel the challenge at this stage. That is why I am going to search for alternatives. At the moment the best detection and classification rates are achieved by Deep Neural Networks. So there is no real alternative to NNs for the classification task (Some of you may think about the CNN architecture called YOLO1; But this architecture offers worse detection rates for quite “context-less” objects than region proposal based methods do). But we can decrease the parameter of the network to make Test-Time faster. Even more testing on NN architectures needs to be done… Yay… Getting 10-15 ROIs checked in ~750ms is my current goal (50-75ms per ROI).
But even if we find an architecture, which works quite well while only taking 75ms in Test-Time. There is still the time taken by Selective Search left, which takes 2-5s. Even 2s is too much and decreasing the “accuracy” of Selective Search even more offers poor results. I truely have to think about some alternatives. Faster RCNN2 would be a real alternative. However Faster RCNN does not solve the problem of time taken for finding region proposal directly. They use the first layers of RCNN to compute ROIs. Due to the outsourcing of the “ROI-Detection” to GPU this “algorithm” of course runs much faster than Selective Search on CPU. However as the Laptop I use does not contain a NVIDIA card, I have to think about alternative ways to find regions.
In an old project of mine I used some kind of improved Local Binary Patterns combined with a simple linear SVM to detect humans on an image using sliding window approach. Even though it was only a simple sliding window approach, I still managed to analyze an image of 1280×900 in ~1s and got quite “OK” results. In another project I used Evolutionary Algorithms to speed up a simple sliding window template search but still achieve great accuracy. It may be worth a try to combine a SVM which classifies a given region similar to RCNN into background/foreground and Evolutionary Algorithms to fast find interesting regions. By using LBPs for Feature Extraction we kind of ensure that everything should run fast enough. Maybe also Particle Swarms can help us decreasing the time for finding ROIs.
This will be my next big step towards real-time detection. Hopefully the concept I am thinking of works as expected. But I guess it is a nice try because the SVM does not have to work perfectly, it only has to find the “best 10-15 foreground regions” and finding the real traffic signs in these regions will be done by the CNN which already demonstrated its power in tests done the last days!
Let us keep going!