-
Notifications
You must be signed in to change notification settings - Fork 195
You might have outdated tesseract tessdatas either in the system-wide or user-local folders which cause tesseract to abort. Try deleting or moving these files out of the way. This is particularly true if the major version of tesseract changed, i.e. switching from tesseract 3.x to tesseract 4.x.
If you self-compiled the application or have installed the package from a repository of a Linux distribution, it may also be that tesseract has been updated compared to the version gImageReader was compiled against. Tesseract does not always update the library version when the library changes in an incompatible way, so it is easy to miss such incompatibilities which can lead to applications loading the tesseract library, such as gImageReader, to silently crash. In this case, try recompiling gImageReader against the currently used version of tesseract, or ask your distributors package maintainer to rebuild the package.
Another frequent cause is when using non-english locales with gimagereader < 3.3.1: tesseract reads its parameters with sscanf
(see here), which respects the system locale, but the parameters it reads are always in C locale. This will result in a crash similar to
[...]
#4 0xb7c4e276 in ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const () from /usr/lib/i386-linux-gnu/libtesseract.so.4
#5 0xb7c4e454 in err_exit() () from /usr/lib/i386-linux-gnu/libtesseract.so.4
#6 0xb7c409cf in DoError(int, char const*) () from /usr/lib/i386-linux-gnu/libtesseract.so.4
#7 0xb7b90ebd in ReadParamDesc(tesseract::TFile*, unsigned short) () from /usr/lib/i386-linux-gnu/libtesseract.so.4
#8 0xb7ba5564 in tesseract::Classify::ReadNormProtos(tesseract::TFile*) () from /usr/lib/i386-linux-gnu/libtesseract.so.4
#9 0xb7b888fb in tesseract::Classify::InitAdaptiveClassifier(tesseract::TessdataManager*) () from /usr/lib/i386-linux-gnu/libtesseract.so.4
[...]
The workaround is to invoke gImageReader like overriding the numeric locale, i.e.:
LC_NUMERIC=C gimagereader-gtk
gImageReader >= 3.3.1 internally overrides the locale when initializing tesseract.
On Windows, gImageReader uses TWAIN for communicating with scanners. gImageReader implements the TWAIN 2.2 specification, and supports both TWAIN 1.x and TWAIN 2.x data sources (drivers). Typically, there are two main causes which leads to a scanner not being recognized under Windows:
-
There is no TWAIN driver installed for the scanner: TWAIN drivers should be installed in
C:\WINDOWS\twain_32
orC:\WINDOWS\twain_64
. Check these locations whether there are any drivers related to your scanner, resp. check the manufacturers website whether they offer a TWAIN driver for download. -
TWAIN driver architecture mismatch: If your scanner ships a 32bit TWAIN driver (i.e. installed in
C:\WINDOWS\twain_32
), then you'll need to use the 32bit (aka i686) version of gImageReader. If your scanner ships a 64bit TWAIN driver, you'll need to use the 64bit (aka x86_64) version of gImageReader.
There is a bug in certain versions of SANE scanner drivers which causes gImageReader to crash when a second page is scanned. If you are affected by this issue, try using the latest sane-backends.
You can see the paths in the gImageReader preferences dialog. In particular, the paths depend on whether you have selected system-wide or user-local paths in the preferences dialog.
Automatic download of spelling dictionaries / (un)installation of tesseract language definitions fails
- On Windows, if you don't have writing permissions to the location where gImageReader is installed, you can selected user-local paths in the preferences dialog. gImageReader will then store these files below your home folder.
- On Linux, (un)installation to/from system-wide folders only works if you have PackageKit installed. Alternatively, you may manage the corresponding packages directly via the distribution's package management tools. Or you can select user-local paths in the preferences dialog, and gImageReader will store the files below your home folder, without using any package management tool.
Make sure you've downloaded the actual traineddata file from the github tessdata page, as illustrated below:
The available image formats depend on which Qt image format plugins are installed on the system. If you are using gImageReader compiled against Qt5, make sure the Qt5 image plugins package is installed (on Debian/Ubuntu it's called qt5-image-formats-plugins
, on Fedora qt5-qtimageformats
).
If you are using the Qt5 interface of gImageReader, you can choose in the program options whether the text output is encoded using system encoding or UTF-8. Default is system encoding.
No, the gImageReader Windows installer bundles the necessary tesseract files. If you want to install tesseract separately for other uses you can clearly do so, but it has no effect on gImageReader.
The easiest way is by using the integrated tessdata manager: from the language selection menu, select Manage languages...
and select the languages you need. You can also install the languages manually, as described in the manual (Help entry in the application menu).
The tesseract language definition is used by tesseract (the OCR engine) for the actual recognition. The spelling dictionary is used for spell-checking the recognized text in the output pane.
- For PDF sources, it specifies with how many dots per inch (DPI) the PDF is rendered to the image on which recognition is performed. For PDF sources, gImageReader defaults to 300 (i.e. 300 dpi).
- For non-PDF sources (so images like JPG, PNG, TIFF, etc), it represents the scaling factor to apply to the image on which recognition is performed. So 100 means 100%, i.e. the original size. 50 means 50%, i.e. half the size. And so on. For non-PDF sources, gImageReader defaults to 100 (i.e. 100%)
- It is true that many image formats (TIFF, JPG, PNG, BMP to list a few) actually support specifying the dpi in the metadata, but few images actually store the physically correct dpi, but rather just the screen dpi (say 72 or 96), or don't store anything at all. The current behaviour means that, by default, one is recognizing the input-image as-is (i.e. at 100% scale).
You might be using a tesseract library compiled with OpenMP support enabled. The upstream default and recommendation is to disable OpenMP. You can either rebuild tesseract with OpenMP support disabled, or limit the maximum number of OpenMP threads before starting gImageReader via environment-variable, say:
$ export OMP_THREAD_LIMIT=1