Parsing hOCR¶
- hocr.parse.hocr_page_get_dimensions(hocr_page)[source]¶
Returns the dimensions (width, height) of a hocr page as returned by hocr_page_iterator.
Args:
hocr_page: a page as returned by hocr_page
Returns:
(width, height): tuple of (int, int)
- hocr.parse.hocr_page_get_scan_res(hocr_page)[source]¶
Returns the X and Y resolution (in DPI) of a hocr page as returned by hocr_page_iterator.
Args:
hocr_page: a page as returned by hocr_page
Returns:
(x_res, y_res): tuple of (int, int)
Or (None, None) if the scan_res property is not present.
- hocr.parse.hocr_page_iterator(fd_or_path)[source]¶
Returns an iterator to iterate over a (potentially large) hOCR XML file in a streaming manner.
Args:
fd_or_path: open file to operate on, or a path (str).
Returns:
Iterator returning a ElementTree.Element hOCR page.
- hocr.parse.hocr_page_to_photo_data(hocr_page, minimum_page_area_pct=10)[source]¶
Parses a single hocr_page into photo data.
Args:
hocr_page: a single hocr_page as returned by hocr_page_iterator
(optional) minimum_page_area_pct: a minimum percentage of the page area the picture should inhabit
Returns:
A list of bounding boxes where photos were found
- hocr.parse.hocr_page_to_word_data(hocr_page, scaler=1)[source]¶
Parses a single hocr_page into word data.
Args:
hocr_page: a single hocr_page as returned by hocr_page_iterator
(optional) scaler: a scalar to scale font sizes by
Returns:
A list of paragraphs, each paragraph containing a list of lines, and each line containing a list of words, plus properties.
Paragraphs have the following attributes:
‘lines’: the lines that form this paragraph
Lines have the following attributes:
‘words’: the words that form this line
‘bbox’: bounding box (tuple of 4 floats)
‘baseline’: baseline of the word (tuple of 2 floats)
Words have the following attributes:
‘text’: word text, str
‘bbox’: bounding box (tuple of 4 floats)
‘fontsize’: fontsize as a float, or 0.
‘writing_direction’: See WRITING_DIRECTION_* constants
‘confidence’: word confidence, 0 - 100
- hocr.parse.hocr_page_to_word_data_fast(hocr_page)[source]¶
Parses a single hocr_page into word data.
Args:
hocr_page: a single hocr_page as returned by hocr_page_iterator
Returns:
A list of paragraph, each paragraph containing a list of lines, and each line containing a list of words, plus properties.
Paragraphs have the following attributes:
‘lines’: the lines that form this paragraph
Lines have the following attributes:
‘words’: the words that form this line
Words have the following attributes:
‘text’: word text, str
‘bbox’: bounding box (tuple of 4 floats)
‘confidence’: word confidence, 0 - 100