Reading text from hOCR

hocr.text.get_paragraph_hocr_words(paragraph)[source]

Find all the words in a hOCR paragraph.

Args:

  • hOCR paragraph as returned by hocr_paragraphs.

Returns a list of hocr words in a hocr paragraph. For this to be usable for matching purposes, only run this on merged hocr paragraphs as returned by hocr_paragraphs.

hocr.text.hocr_get_plaintext_page_offsets(fd_or_path)[source]

Builds a list of start and end bytes for each ocr_page in the plain text file. That is, if the plain text were generated from a hOCR XML file, which plaintext is part of which ocr_page element, and where does that text start and end. This can be used to construct a “lookup” table, together with hocr_get_xml_page_offsets.

Args:

  • fd_or_path: hOCR file to operate on, or a path (str).

Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the text as extracted from the page in the XML file.

hocr.text.hocr_get_xml_page_offsets(fd_or_path)[source]

Builds a list of start and end bytes for each ocr_page element in the XML file. This can be used to construct a “lookup” table, together with hocr_get_plaintext_page_offsets.

Args:

  • fd_or_path: hOCR file to operate on, or a path (str).

Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the page element in the XML file.

hocr.text.hocr_page_text(page)[source]

Extract text from a hOCR XML page element.

Args:

  • page: hOCR XML page element

Returns: page contents (str)

hocr.text.hocr_page_text_from_word_data(word_data)[source]

Extract text from a pre-parsed hOCR page

Args:

  • word_data: as returned by hocr_page_to_word_data or hocr_page_to_word_data_fast

Returns: page contents (str)

hocr.text.hocr_paragraph_text(paragraph)[source]

Reconstruct text that matches the FTS text from a hOCR paragraph. Returns a tuple, first item in the tuple is the text, the second is a boolean, indicating if this paragraph is to be merged into the next one, see hocr_paragraphs for more information.

Args:

  • paragraph: hOCR paragraph as returned by hocr_paragraphs

Returns:

  • Tuple of (str, bool), where the str is the paragraph data, and the boolean if this text continues is to be merged with the next paragraph.