Reading text from hOCR¶

hocr.text.get_paragraph_hocr_words(paragraph)[source]¶

Find all the words in a hOCR paragraph.

Args:

hOCR paragraph as returned by hocr_paragraphs.

Returns a list of hocr words in a hocr paragraph. For this to be usable for matching purposes, only run this on merged hocr paragraphs as returned by hocr_paragraphs.

hocr.text.hocr_get_plaintext_page_offsets(fd_or_path)[source]¶

Builds a list of start and end bytes for each ocr_page in the plain text file. That is, if the plain text were generated from a hOCR XML file, which plaintext is part of which ocr_page element, and where does that text start and end. This can be used to construct a “lookup” table, together with hocr_get_xml_page_offsets.

Args:

fd_or_path: hOCR file to operate on, or a path (str).

Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the text as extracted from the page in the XML file.

hocr.text.hocr_get_xml_page_offsets(fd_or_path)[source]¶

Builds a list of start and end bytes for each ocr_page element in the XML file. This can be used to construct a “lookup” table, together with hocr_get_plaintext_page_offsets.

Args:

fd_or_path: hOCR file to operate on, or a path (str).

Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the page element in the XML file.

hocr.text.hocr_page_text(page)[source]¶

Extract text from a hOCR XML page element.

Args:

page: hOCR XML page element

Returns: page contents (str)

hocr.text.hocr_page_text_from_word_data(word_data)[source]¶

Extract text from a pre-parsed hOCR page

Args:

word_data: as returned by hocr_page_to_word_data or hocr_page_to_word_data_fast

Returns: page contents (str)

hocr.text.hocr_paragraph_text(paragraph)[source]¶

Reconstruct text that matches the FTS text from a hOCR paragraph. Returns a tuple, first item in the tuple is the text, the second is a boolean, indicating if this paragraph is to be merged into the next one, see hocr_paragraphs for more information.

Args:

paragraph: hOCR paragraph as returned by hocr_paragraphs

Returns:

Tuple of (str, bool), where the str is the paragraph data, and the boolean if this text continues is to be merged with the next paragraph.

Reading text from hOCR¶

Table of Contents

Previous topic

Next topic

This Page