Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature extraction from tesseract output #2

Open
OlivierBinette opened this issue Jan 4, 2021 · 2 comments
Open

Feature extraction from tesseract output #2

OlivierBinette opened this issue Jan 4, 2021 · 2 comments
Assignees
Labels

Comments

@OlivierBinette
Copy link
Member

Given an hOCR output, tidy-up the data and extract relevant features.

Tidy data is a table such that:

  • every row represents a word,
  • columns represent features of the word.

Relevant features are:

  • Font size,
  • Capitalization styles,
  • Line/paragraph/continuous area to which the word belongs,
  • Bounding box of the line/paragraph/continuous area,
  • other
@neel216
Copy link
Collaborator

neel216 commented Jan 11, 2021

Uploaded an initial version of hOCR data extraction that for each word: parses the word, word confidence, line number, line height, paragraph number, column area number, page number, and bounding box for each feature and saves as a CSV. The CSV is a bit messy because it stores the relationship info of the word and all of its parent elements, so we may want to switch the file format to better represent relationships between elements, although CSV was the easiest to start with.

The feature extraction uses hOCR standards defined here. The sampleInput.hocr I tested on was generated using tesseract 4.1.1 with the command tesseract alice_1.png sampleInput hocr

As far as I can tell, the new LSTM version of tesseract doesn't support the x_fsize font size property, but it does support the x_size property that defines line height. It is possible to make a calculation for font size based off of this and other properties of the image, but line height might be enough for our purposes.

@neel216
Copy link
Collaborator

neel216 commented Mar 13, 2021

Did some research after seeing Olivier's comment #16 and just wanted to update the command here to include x_fsize while still using tesseract 4.1.1 (the LSTM version):

tesseract alice_1.png sampleInput hocr -c hocr_font_info=1

For some reason it results in a read_params_file error for the -c and config option but the hOCR file looks fine and now includes x_fsize for each word.
It's worth noting that text that is clearly the same font size has some variation, so we may want to figure out a way to normalize these font size values for use in #5 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants