Feature extraction from tesseract output #2

OlivierBinette · 2021-01-04T17:20:35Z

Given an hOCR output, tidy-up the data and extract relevant features.

Tidy data is a table such that:

every row represents a word,
columns represent features of the word.

Relevant features are:

Font size,
Capitalization styles,
Line/paragraph/continuous area to which the word belongs,
Bounding box of the line/paragraph/continuous area,
other

neel216 · 2021-01-11T01:21:39Z

Uploaded an initial version of hOCR data extraction that for each word: parses the word, word confidence, line number, line height, paragraph number, column area number, page number, and bounding box for each feature and saves as a CSV. The CSV is a bit messy because it stores the relationship info of the word and all of its parent elements, so we may want to switch the file format to better represent relationships between elements, although CSV was the easiest to start with.

The feature extraction uses hOCR standards defined here. The sampleInput.hocr I tested on was generated using tesseract 4.1.1 with the command tesseract alice_1.png sampleInput hocr

As far as I can tell, the new LSTM version of tesseract doesn't support the x_fsize font size property, but it does support the x_size property that defines line height. It is possible to make a calculation for font size based off of this and other properties of the image, but line height might be enough for our purposes.

neel216 · 2021-03-13T06:03:58Z

Did some research after seeing Olivier's comment #16 and just wanted to update the command here to include x_fsize while still using tesseract 4.1.1 (the LSTM version):

tesseract alice_1.png sampleInput hocr -c hocr_font_info=1

For some reason it results in a read_params_file error for the -c and config option but the hOCR file looks fine and now includes x_fsize for each word.
It's worth noting that text that is clearly the same font size has some variation, so we may want to figure out a way to normalize these font size values for use in #5 .

OlivierBinette added this to the Baseline approach milestone Jan 4, 2021

OlivierBinette added the todo label Jan 4, 2021

This was referenced Jan 4, 2021

Study text continuity metrics #4

Open

Identify article titles #5

Open

Implement rule-based article extraction #6

Open

Run batch OCR on newspaper scans #9

Open

neel216 self-assigned this Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature extraction from tesseract output #2

Feature extraction from tesseract output #2

OlivierBinette commented Jan 4, 2021

neel216 commented Jan 11, 2021

neel216 commented Mar 13, 2021

Feature extraction from tesseract output #2

Feature extraction from tesseract output #2

Comments

OlivierBinette commented Jan 4, 2021

neel216 commented Jan 11, 2021

neel216 commented Mar 13, 2021