Searching¶

hocr.searching.hocr_get_fts_text(fd_or_path)[source]¶

Return text that can be ingested in a full text search engine like SOLR or Elastic.

Args:

Returns:

Repeatedly yields a tuple of (str, list of int), page text and a list of word confidences on the page.

hocr.searching.hocr_get_page_lookup_table(fd_or_path)[source]¶

Create lookup table for a given hOCR document. This allows for quickly jumping to specific XML pages.

Args:

Returns:

Lookup table (list of a list) with each list entry:

hocr.searching.hocr_load_lookup_table(path)[source]¶

Load lookup table from JSON.

Args:

Returns:

hocr.searching.hocr_lookup_by_plaintext_offset(page_lookup_data, pos_bytes_plain)[source]¶

Get the lookup index and data for a page that corresponds to the plaintext offset as specified in pos_bytes_plain.

Args:

page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.
pos_bytes_plain: Offset in plaintext of the hOCR file.

hocr.searching.hocr_lookup_page_by_dat(fp, dat)[source]¶

Get the XML for a specific hOCR page that corresponds to the lookup data dat.

Args:

hocr.searching.hocr_lookup_page_by_plaintext_offset(fp, page_lookup_data, pos_bytes_plain)[source]¶

Get the XML for a specific hOCR page that corresponds to the plaintext offset as specified in pos_bytes_plain.

Args:

fp: file pointer to hOCR file
page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.
pos_bytes_plain: Offset in plaintext of the hOCR file.

hocr.searching.hocr_save_lookup_table(lookup_table, fd_or_path)[source]¶

Save lookup table to JSON.

Args: