Parsing hOCR

hocr.parse.hocr_page_get_dimensions(hocr_page)[source]

Returns the dimensions (width, height) of a hocr page as returned by hocr_page_iterator.

Args:

  • hocr_page: a page as returned by hocr_page

Returns:

  • (width, height): tuple of (int, int)

hocr.parse.hocr_page_get_scan_res(hocr_page)[source]

Returns the X and Y resolution (in DPI) of a hocr page as returned by hocr_page_iterator.

Args:

  • hocr_page: a page as returned by hocr_page

Returns:

  • (x_res, y_res): tuple of (int, int)

Or (None, None) if the scan_res property is not present.

hocr.parse.hocr_page_iterator(fd_or_path)[source]

Returns an iterator to iterate over a (potentially large) hOCR XML file in a streaming manner.

Args:

  • fd_or_path: open file to operate on, or a path (str).

Returns:

  • Iterator returning a ElementTree.Element hOCR page.

hocr.parse.hocr_page_to_photo_data(hocr_page, minimum_page_area_pct=10)[source]

Parses a single hocr_page into photo data.

Args:

  • hocr_page: a single hocr_page as returned by hocr_page_iterator

  • (optional) minimum_page_area_pct: a minimum percentage of the page area the picture should inhabit

Returns:

A list of bounding boxes where photos were found

hocr.parse.hocr_page_to_word_data(hocr_page, scaler=1)[source]

Parses a single hocr_page into word data.

Args:

  • hocr_page: a single hocr_page as returned by hocr_page_iterator

  • (optional) scaler: a scalar to scale font sizes by

Returns:

A list of paragraphs, each paragraph containing a list of lines, and each line containing a list of words, plus properties.

Paragraphs have the following attributes:

  • ‘lines’: the lines that form this paragraph

Lines have the following attributes:

  • ‘words’: the words that form this line

  • ‘bbox’: bounding box (tuple of 4 floats)

  • ‘baseline’: baseline of the word (tuple of 2 floats)

Words have the following attributes:

  • ‘text’: word text, str

  • ‘bbox’: bounding box (tuple of 4 floats)

  • ‘fontsize’: fontsize as a float, or 0.

  • ‘writing_direction’: See WRITING_DIRECTION_* constants

  • ‘confidence’: word confidence, 0 - 100

hocr.parse.hocr_page_to_word_data_fast(hocr_page)[source]

Parses a single hocr_page into word data.

Args:

  • hocr_page: a single hocr_page as returned by hocr_page_iterator

Returns:

A list of paragraph, each paragraph containing a list of lines, and each line containing a list of words, plus properties.

Paragraphs have the following attributes:

  • ‘lines’: the lines that form this paragraph

Lines have the following attributes:

  • ‘words’: the words that form this line

Words have the following attributes:

  • ‘text’: word text, str

  • ‘bbox’: bounding box (tuple of 4 floats)

  • ‘confidence’: word confidence, 0 - 100