You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Uploaded an initial version of hOCR data extraction that for each word: parses the word, word confidence, line number, line height, paragraph number, column area number, page number, and bounding box for each feature and saves as a CSV. The CSV is a bit messy because it stores the relationship info of the word and all of its parent elements, so we may want to switch the file format to better represent relationships between elements, although CSV was the easiest to start with.
The feature extraction uses hOCR standards defined here. The sampleInput.hocr I tested on was generated using tesseract 4.1.1 with the command tesseract alice_1.png sampleInput hocr
As far as I can tell, the new LSTM version of tesseract doesn't support the x_fsize font size property, but it does support the x_size property that defines line height. It is possible to make a calculation for font size based off of this and other properties of the image, but line height might be enough for our purposes.
Did some research after seeing Olivier's comment #16 and just wanted to update the command here to include x_fsize while still using tesseract 4.1.1 (the LSTM version):
For some reason it results in a read_params_file error for the -c and config option but the hOCR file looks fine and now includes x_fsize for each word.
It's worth noting that text that is clearly the same font size has some variation, so we may want to figure out a way to normalize these font size values for use in #5 .
Given an hOCR output, tidy-up the data and extract relevant features.
Tidy data is a table such that:
Relevant features are:
The text was updated successfully, but these errors were encountered: