This repository implements the OCR branch of the method introduced in the E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text paper published by Busta et al. (paper). We wanted to see how the OCR branch performs on traffic signs.
We have noticed that pre-training the network on synthetic data helps improve the overall performance and speeds up the training process. For this purpose, we used a subset of the synthetic word dataset of the Visual Geometry Group of the University of Oxford (website). Our subset contains roughly 20K training and 2.5K validation images. Here are some samples of these words:
As we mentioned above, we want to recognize writings on traffic signs. Thus, we use images collected from test drives to generate our training data. Such an image can look like this: Then we crop the writings and used them to fine-tune our pre-traied network. E.g. the crops from the last example are
Note that E2E-MLT can learn to recognize multiple languages at the same time. We only trained it for english and german characters.
All the details regarding the hyperparameters, loss function, etc can be found in the paper mentioned at the very beginning.
However, the detection was done seperately beforehand, i.e. our labels include the contour points of each traffic sign within the bigger picutre as well as the ground truth labels.
After training/fine-tuning the network for 10 epochs on our data, we use it for prediction. For the sake of continuity, we use the same image, i.e. the same crops, to give a feeling of what the network does. The figure below summarizes all the steps we discussed so far.