Parsing hOCR¶

hocr.parse.hocr_page_get_dimensions(hocr_page)[source]¶

Returns the dimensions (width, height) of a hocr page as returned by hocr_page_iterator.

Args:

hocr_page: a page as returned by hocr_page

Returns:

(width, height): tuple of (int, int)

hocr.parse.hocr_page_get_scan_res(hocr_page)[source]¶

Returns the X and Y resolution (in DPI) of a hocr page as returned by hocr_page_iterator.

Args:

hocr_page: a page as returned by hocr_page

Returns:

(x_res, y_res): tuple of (int, int)

Or (None, None) if the scan_res property is not present.

hocr.parse.hocr_page_iterator(fd_or_path)[source]¶

Returns an iterator to iterate over a (potentially large) hOCR XML file in a streaming manner.

Args:

fd_or_path: open file to operate on, or a path (str).

Returns:

Iterator returning a ElementTree.Element hOCR page.

hocr.parse.hocr_page_to_photo_data(hocr_page, minimum_page_area_pct=10)[source]¶

Parses a single hocr_page into photo data.

Args:

hocr_page: a single hocr_page as returned by hocr_page_iterator
(optional) minimum_page_area_pct: a minimum percentage of the page area the picture should inhabit

Returns:

A list of bounding boxes where photos were found

hocr.parse.hocr_page_to_word_data(hocr_page, scaler=1)[source]¶

Parses a single hocr_page into word data.

Args:

hocr_page: a single hocr_page as returned by hocr_page_iterator
(optional) scaler: a scalar to scale font sizes by

Returns:

A list of paragraphs, each paragraph containing a list of lines, and each line containing a list of words, plus properties.

Paragraphs have the following attributes:

‘lines’: the lines that form this paragraph

Lines have the following attributes:

‘words’: the words that form this line
‘bbox’: bounding box (tuple of 4 floats)
‘baseline’: baseline of the word (tuple of 2 floats)

Words have the following attributes:

‘text’: word text, str
‘bbox’: bounding box (tuple of 4 floats)
‘fontsize’: fontsize as a float, or 0.
‘writing_direction’: See WRITING_DIRECTION_* constants
‘confidence’: word confidence, 0 - 100

hocr.parse.hocr_page_to_word_data_fast(hocr_page)[source]¶

Parses a single hocr_page into word data.

Args:

hocr_page: a single hocr_page as returned by hocr_page_iterator

Returns:

A list of paragraph, each paragraph containing a list of lines, and each line containing a list of words, plus properties.

Paragraphs have the following attributes:

‘lines’: the lines that form this paragraph

Lines have the following attributes:

‘words’: the words that form this line

Words have the following attributes:

‘text’: word text, str
‘bbox’: bounding box (tuple of 4 floats)
‘confidence’: word confidence, 0 - 100

Parsing hOCR¶

Table of Contents

Previous topic

Next topic

This Page