How to train Tesseract 3.01

Optical Character Recognition (OCR) is a very popular tool nowadays. It makes machines able to automatically identify text in digital images. A lot of research has been done, resulting in a lot of different techniques and publications. Currently many OCR tools that are available on the market are expensive and not open source, but few of them are free and open source.

One of the most popular OCR tools is Tesseract, because it’s free and very accurate. Tesseract is an Open Source OCR engine, which means thats it’s not a complete tool but just an engine. This makes programmers likes us, able to port it to your specific application. Tesseract was originally developed at HP, as a PhD research. The main purpose was to create a new feature for their scanners (OCR function), but the feature never has been released. Later on, HP stopped working on Tesseract and released the project as Open Source. Currently Tesseract is maintained by Google and the initial developer, Ray Smith, has been moved to Google either. The release to Open Source, caused some changes to the engine too. Some functions that Tesseract used, like neural networks, weren’t Open Source so they replaced that part (It’s a little bit strange because the code is still available, but isn’t used anymore. Google?).

Training

Another advantage is that Tesseract can be trained, but unfortunately the procedure isn’t well documented. That’s why I will try to explain it more clearly so other people will be able to train tesseract too.

The basic idea of how the engine works can be found on the figure below.

In the first three steps, the component analysis and text/word splitting has been done. Afterwards the words are splitted in characters and each character/components is passed to a 2-way recognize pass. In the first pass the results of the recognized characters/words are passed to an adaptive classifier, which uses the data as training data. After that the text will be recognized a second time but now using the adaptive classifier.

The reason why it’s recognized a second time is because the first pass gives us information about the context of the text. It could be that at then end of the first pass, words are recognized which could be helpful for words in the beginning of the text. That’s identically how we (most of us) humans try to learn something. The first time we will read the whole book/text to know what’s exactly in the text afterwards we will read a second, third of fourth time and we try to understand everything.

Let’s train something

Recently a publication has been published, by Nick White, who tried to train tesseract for the ancient greek language. In his publication he mention the difficulty to find good documentation about the procedure. He tries to explain the procedure more in detail but unfortunately doesn’t tell the commands which he used (the publication has been added to the Tesseract repository).

When Tesseract has been installed on a Linux system, several commands are available:

  • tesseract
  • unicharset_extractor
  • mftraining
  • cntraining
  • combine_tessdata

The first command, tesseract, is to recognize images but also can be used to train and to create boxes. When we try to recognize something the result will been stored in a txt file. Below I demonstrated, tried to recognize a number(this is for my master thesis, where I used Tesseract).





Although the recognition is almost perfect, it does recognize characters which we humans won’t think are relevant. Wrong recognition can be caused by overtraining, or undertraining(don’t think that this word exists). With undertraining I mean that an font-family is used which is very different from the originally/standard fonts. With the installation of Tesseract standard training data is delivered for the language english(on the Tesseract website there is training data available for other languages). If you would use the standard trainingdata to recognize characters which aren’t very common with the standard font-families, a low accuracy can occur. That’s why in some situations it could be helpful to create your own training data to increase the accuracy.

In the following steps I will explain you how to train Tesseract for your own font:

  • Create boxes
  • Edit boxes manually
  • Extract unicharset
  • Shape clustering
  • Combine files
  • The commands used are (I will discuss them in detail)

tesseract eng.matrx60x40.exp0.jpg eng.matrx60x40.exp0 batch.nochop makebox
tesseract eng.matrx60x40.exp0.jpg eng.matrx60x40.exp0.box  nobatch box.train
unicharset_extractor eng.matrx60x40.exp0.box
echo "matrx60x40 1 0 0 0 0" > font_properties
mftraining -F font_properties -U unicharset -O eng.unicharset eng.matrx60x40.exp0.box.tr
cntraining eng.matrx60x40.exp0.box.tr
// rename all files created by mftraing en cntraining, add the prefix eng.
combine_tessdata eng.

Make Box

In the first step, the image we are using to train our specific font, is recognized by Tesseract and boxes are created. Each box results in a component which represents a single character. The box has an x- and y-coordinate, a width and height and the character which has been recognized in the region.

In this step we have to manually edit possible mistakes that has been detected by Tesseract. Maybe all characters where wrong because the font you are using isn’t very common. To make this job a lot easier, several tools are available. These tools are called “Tesseract box editors”, the tool I used is jTessBoxEditor. jTessBoxEditor is written and Java and thus platform independent, it has options to merge and split which could be handy. After you corrected the mistakes, we can creating the training file. The file I used is called eng.matrx60x40.exp0.jpg and a box “eng.matrx60x40.exp0.box” will be created.


tesseract eng.matrx60x40.exp0.jpg eng.matrx60x40.exp0 batch.nochop makebox

Training file

In this procedure we will tell Tesseract the correct results. We tell Tesseract the mistakes he made so he won’t make the same mistakes in a next recognition.


tesseract eng.matrx60x40.exp0.jpg eng.matrx60x40.exp0.box nobatch box.train

Unicharset

This step tries to extract the charset from the box file.


unicharset_extractor eng.matrx60x40.exp0.box

Font properties

Create a new file with the properties of the font you are trying to train.
Syntax: fontname italic bold fixed serif fraktur


echo "matrx60x40 1 0 0 0 0" > font_properties

Clustering

The character features are clustered.


mftraining -F font_properties -U unicharset -O eng.unicharset eng.matrx60x40.exp0.box.tr
cntraining eng.matrx60x40.exp0.box.tr
Renaming
We rename all the files generated by mftraining & cntraining. We add the prefix “eng.” to all the files.

Renaming

We rename all the files generated by mftraining & cntraining. We add the prefix “eng.” to all the files.

Combining

In the last step we combine all the files and the trainingdata will be created.


combine_tessdata eng.

Move trainingdata

Move the trainingdata to the tessdata folder, on mac osx “/usr/local/share/tessdata”.

Hey, I'm Cédric a software engineer, who's motivated to broaden his horizon and to discover and learn new methodologies. With a huge interest in web technologies, artificial intelligence and image processing, I try to understand our mountainous world a little bit better.