Reading text from hOCR¶
- hocr.text.get_paragraph_hocr_words(paragraph)[source]¶
Find all the words in a hOCR paragraph.
Args:
hOCR paragraph as returned by hocr_paragraphs.
Returns a list of hocr words in a hocr paragraph. For this to be usable for matching purposes, only run this on merged hocr paragraphs as returned by hocr_paragraphs.
- hocr.text.hocr_get_plaintext_page_offsets(fd_or_path)[source]¶
Builds a list of start and end bytes for each ocr_page in the plain text file. That is, if the plain text were generated from a hOCR XML file, which plaintext is part of which ocr_page element, and where does that text start and end. This can be used to construct a “lookup” table, together with hocr_get_xml_page_offsets.
Args:
fd_or_path: hOCR file to operate on, or a path (str).
Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the text as extracted from the page in the XML file.
- hocr.text.hocr_get_xml_page_offsets(fd_or_path)[source]¶
Builds a list of start and end bytes for each ocr_page element in the XML file. This can be used to construct a “lookup” table, together with hocr_get_plaintext_page_offsets.
Args:
fd_or_path: hOCR file to operate on, or a path (str).
Return a list of tuples (start_byte, end_byte) for each ocr_page element in a hOCR file. The start and ends bytes point to the position of the page element in the XML file.
- hocr.text.hocr_page_text(page)[source]¶
Extract text from a hOCR XML page element.
Args:
page: hOCR XML page element
Returns: page contents (str)
- hocr.text.hocr_page_text_from_word_data(word_data)[source]¶
Extract text from a pre-parsed hOCR page
Args:
word_data: as returned by
hocr_page_to_word_data
orhocr_page_to_word_data_fast
Returns: page contents (str)
- hocr.text.hocr_paragraph_text(paragraph)[source]¶
Reconstruct text that matches the FTS text from a hOCR paragraph. Returns a tuple, first item in the tuple is the text, the second is a boolean, indicating if this paragraph is to be merged into the next one, see hocr_paragraphs for more information.
Args:
paragraph: hOCR paragraph as returned by hocr_paragraphs
Returns:
Tuple of (str, bool), where the str is the paragraph data, and the boolean if this text continues is to be merged with the next paragraph.