Searching

hocr.searching.hocr_get_fts_text(fd_or_path)[source]

Return text that can be ingested in a full text search engine like SOLR or Elastic.

Args:

  • fd_or_path: File descriptor or path to hOCR file

Returns:

Repeatedly yields a tuple of (str, list of int), page text and a list of word confidences on the page.

hocr.searching.hocr_get_page_lookup_table(fd_or_path)[source]

Create lookup table for a given hOCR document. This allows for quickly jumping to specific XML pages.

Args:

  • fd_or_path: file descriptor or filepath to the hOCR file

Returns:

Lookup table (list of a list) with each list entry:

  • [text_start_byte, text_end_byte, xml_start_byte, xml_end_byte]

hocr.searching.hocr_load_lookup_table(path)[source]

Load lookup table from JSON.

Args:

  • fd_or_path: File to load from

Returns:

  • Lookup table

hocr.searching.hocr_lookup_by_plaintext_offset(page_lookup_data, pos_bytes_plain)[source]

Get the lookup index and data for a page that corresponds to the plaintext offset as specified in pos_bytes_plain.

Args:

  • page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.

  • pos_bytes_plain: Offset in plaintext of the hOCR file.

hocr.searching.hocr_lookup_page_by_dat(fp, dat)[source]

Get the XML for a specific hOCR page that corresponds to the lookup data dat.

Args:

  • fp: file pointer to hOCR file

  • dat: lookup table entry for the page

hocr.searching.hocr_lookup_page_by_plaintext_offset(fp, page_lookup_data, pos_bytes_plain)[source]

Get the XML for a specific hOCR page that corresponds to the plaintext offset as specified in pos_bytes_plain.

Args:

  • fp: file pointer to hOCR file

  • page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.

  • pos_bytes_plain: Offset in plaintext of the hOCR file.

hocr.searching.hocr_save_lookup_table(lookup_table, fd_or_path)[source]

Save lookup table to JSON.

Args:

  • lookup_table: Lookup table as returned by hocr_get_page_lookup_table

  • fd_or_path: File to save to