Home

Welcome to the PERO OCR wiki of UB Mannheim!

Please note that this wiki is not the official documentation, but work in progress which can contain wrong information.

Installation

The instructions here document our experience with a local installation of the PERO OCR web application.

Preconditions

We tried the installation on a current Debian GNU stable. At least these packages will be needed:

apt install libpq-dev libpython3-dev python3-venv

TODO: Run test installation on a minimal Debian installation to find other required packages.

Install pero-ocr

pero-ocr is installed in a virtual Python environment with the following instructions.

# Create and enter a base directory for the installation.
mkdir -p ~/src/github/DCGM
cd ~/src/github/DCGM

# Create and activate the virtual Python environment.
python3.9 -m venv venv3.9
source venv3.9/bin/activate
pip install -U pip setuptools wheel

# Get source code for pero-ocr.
git clone https://github.com/DCGM/pero-ocr.git
cd pero-ocr

# Edit setup.py and remove the line for tensorflow-gpu before continuing.
vim setup.py

# Run installation.
pip install .

Install pero_ocr_web

pero_ocr_web is installed in the same virtual Python environment with the following instructions.

# Enter the base directory for the installation.
cd ~/src/github/DCGM

# Activate the virtual Python environment (if necessary).
source venv3.9/bin/activate

# Get source code for pero_ocr_web.
git clone https://github.com/DCGM/pero_ocr_web.git
cd pero_ocr_web

# Install required Python packages.
pip install arabic_reshaper dateutils drymail config psycopg2 email-validator Flask Flask-Bootstrap Flask-Dropzone Flask-JSGlue Flask-Login Flask-WTF Jinja2 Levenshtein natsort numpy SQLAlchemy opencv-python

# Install Node packages.
cd app
npm install
npm audit fix

# Start the web application.
python run_app.py

Now patch $VIRTUAL_ENV/lib/python*/site-packages/flask_jsglue.py using the following patch:

--- ../pero_ocr_web/venv3.9/lib/python3.9/site-packages/flask_jsglue.py 2022-06-01 15:15:25.720490791 +0200
+++ /home/stweil/src/github/DCGM/venv3.9/lib/python3.9/site-packages/flask_jsglue.py    2022-06-01 18:09:23.499253413 +0200
@@ -1,6 +1,5 @@
 from flask import current_app, make_response, url_for
-from jinja2.utils import markupsafe
-from markupsafe import Markup
+from jinja2 import Markup
 import re, json
 
 JSGLUE_JS_PATH = '/jsglue.js'

Now continue the installation:

# Fix configuration templates.
# Look for "/home" and make sure that those files and folders exist.
vim config_example.py layout_client/config_example.ini ocr_client/config_example.ini

# Copy the configuration templates.
cp config_example.py config.py
cp layout_client/config_example.ini layout_client/config.ini
cp ocr_client/config_example.ini ocr_client/config.ini

# Update database_url in layout_client/register_layout_detectors.py, then add the layout detectors.
vim layout_client/register_layout_detectors.py
python layout_client/register_layout_detectors.py -d $HOME/data/pero_ocr_web_data/db.sqlite
# Update database_url in ocr_client/register_baseline.py, then register the baselines.
vim ocr_client/register_baseline.py
python ocr_client/register_baseline.py -d $HOME/data/pero_ocr_web_data/db.sqlite
# Update database_url in ocr_client/register_language_model.py, then add the ocr language models.
vim ocr_client/register_language_model.py
python ocr_client/register_language_model.py.py -d $HOME/data/pero_ocr_web_data/db.sqlite
# Update database_url in ocr_client/register_ocr.py, then add the ocr detectors.
vim ocr_client/register_ocr.py
python ocr_client/register_ocr.py -d $HOME/data/pero_ocr_web_data/db.sqlite

Run PERO OCR web application

Did you create all required folders (see config.py above)? Then you are ready to start the web application pero_ocr_web with the following instructions.

# Enter the base directory for the installation.
cd ~/src/github/DCGM

# Activate the virtual Python environment (if necessary).
source venv3.9/bin/activate

# Enter the source directory for pero_ocr_web.
cd pero_ocr_web

# Run the worker processes for layout analysis and OCR requests.
# This can also be done in a separate terminal, so log output is not mixed with the web application.
python run_clients.py &

# Start the web application.
python run_app.py

Now connect your web browser to http://127.0.0.1:2000/, sign up, log in and add your first document. Upload also a few page images.

The layout analysis and OCR expect a trusted user with e-mail address [email protected] and password client. Sign up using these values and set that user to trusted:

python utils/set_user_to_trusted.py -d $HOME/data/pero_ocr_web_data/db.sqlite -e [email protected]

Useful commands

# Dump database.
sqlite3 ~/data/pero_ocr_web_data/db.sqlite .dump

Open issues

~~TODO: The next step would be running the layout analysis. That step is not possible because there is no layout detector offered.~~ This was fixed by running register_layout_detectors.py.

TODO: Now it is possible to trigger the layout analysis. ~~, but it is not started. Maybe some worker process is still missing?~~ It starts in the worker ("client") process, but requires layout detectors (complex_printed_and_handwritten_layout, complex_printed_and_handwritten_layout_(experimental), printed_layout, simple_threshold_region). Most of them are currently unavailable.

TODO: Models for OCR and language models are also required and mostly unavailable.

TODO: The original code of pero_ocr_web has parts which expect a sqlite database, while other parts require a postgresql database. Which kind of database should be used?

TODO: Even with available models, OCR throws an error:

REQUEST
##############################################################
REQUEST ID: eb325fcd-3c5b-459f-8ccd-53711805df78
LAYOUT DETECTOR ID: 31aa5818-5e73-4825-b6c2-9013096901b5
IMAGES IDS:
9385ae5c-235b-403e-bf20-3ec27d3565a2
##############################################################

GETTING LAYOUT DETECTOR: 31aa5818-5e73-4825-b6c2-9013096901b5

GETTING IMAGES
##############################################################
1/1 GETTING IMAGE: 9385ae5c-235b-403e-bf20-3ec27d3565a2
##############################################################

STARTING PARSE FOLDER: /Users/stweil/src/github/DCGM/pero-ocr/user_scripts/parse_folder.py
##############################################################
python /Users/stweil/src/github/DCGM/pero-ocr/user_scripts/parse_folder.py -c ./layout_detector/config.ini /Users/stweil/data/pero_ocr_web_data/layout_analysis_client/eb325fcd-3c5b-459f-8ccd-53711805df78
[WARNING] 2022-06-05 17:22:56,082 - tensorflow - From /Users/stweil/src/github/DCGM/venv3.8/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
numba available, importing jit
./layout_detector/./ParseNet.pb loaded
graph initialized
LayoutEngine params are line_end_weight:1.0 vertical_line_connection_range:5 smooth_line_predictions:True line_detection_threshold:0.2 adaptive_downsample:True
/Users/stweil/src/github/DCGM/venv3.8/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/Users/stweil/src/github/DCGM/venv3.8/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Pretrained layers
[Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False), Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False), Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)]
Traceback (most recent call last):
  File "/Users/stweil/src/github/DCGM/pero-ocr/user_scripts/parse_folder.py", line 338, in <module>
    main()
  File "/Users/stweil/src/github/DCGM/pero-ocr/user_scripts/parse_folder.py", line 287, in main
    raise Exception(
Exception: Either INPUT_IMAGE_PATH or INPUT_XML_PATH has to be specified. Both are missing in ./layout_detector/config.ini.
##############################################################
PARSE FOLDER FAILED, SETTING REQUEST TO FAILED

It would also be nice to run PERO OCR on a URL like https://ocr-bw.bib.uni-mannheim.de/pero-ocr. This requires additional code changes. Using APPLICATION_ROOT for that was not successful.

Links

PERO OCR (GitHub) – https://github.com/DCGM/pero-ocr
PERO OCR web application (GitHub) – https://github.com/DCGM/pero_ocr_web
PERO API documentation – https://app.swaggerhub.com/apis/LachubCz/PERO-API/1.0.3
LGTM results for pero_ocr_web – https://lgtm.com/projects/g/DCGM/pero_ocr_web/alerts

Publications

Machine learning in Czech libraries - OCR for early printed and handwritten documents – https://opus4.kobv.de/opus4-bib-info/frontdoor/deliver/index/docId/17860/file/Zabicka_OCR_2022_Leipzig.pdf
https://docplayer.cz/171143330-Pero-ocr-for-old-er-documents-michal-hradis-petr-zabicka-alzbeta-zavrelova.html
User manual – https://www.fit.vut.cz/research/product-file/666/User%20manual%20OCR.pdf
AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions – https://arxiv.org/pdf/2104.13037v1.pdf
Handwritten Text Recognition in Historical Documents – https://repositum.tuwien.at/retrieve/10807

Provide feedback

Saved searches

Use saved searches to filter your results more quickly