Searching¶
- hocr.searching.hocr_get_fts_text(fd_or_path)[source]¶
Return text that can be ingested in a full text search engine like SOLR or Elastic.
Args:
fd_or_path: File descriptor or path to hOCR file
Returns:
Repeatedly yields a tuple of (
str
,list of int
), page text and a list of word confidences on the page.
- hocr.searching.hocr_get_page_lookup_table(fd_or_path)[source]¶
Create lookup table for a given hOCR document. This allows for quickly jumping to specific XML pages.
Args:
fd_or_path: file descriptor or filepath to the hOCR file
Returns:
Lookup table (list of a list) with each list entry:
[text_start_byte, text_end_byte, xml_start_byte, xml_end_byte]
- hocr.searching.hocr_load_lookup_table(path)[source]¶
Load lookup table from JSON.
Args:
fd_or_path: File to load from
Returns:
Lookup table
- hocr.searching.hocr_lookup_by_plaintext_offset(page_lookup_data, pos_bytes_plain)[source]¶
Get the lookup index and data for a page that corresponds to the plaintext offset as specified in pos_bytes_plain.
Args:
page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.
pos_bytes_plain: Offset in plaintext of the hOCR file.
- hocr.searching.hocr_lookup_page_by_dat(fp, dat)[source]¶
Get the XML for a specific hOCR page that corresponds to the lookup data dat.
Args:
fp: file pointer to hOCR file
dat: lookup table entry for the page
- hocr.searching.hocr_lookup_page_by_plaintext_offset(fp, page_lookup_data, pos_bytes_plain)[source]¶
Get the XML for a specific hOCR page that corresponds to the plaintext offset as specified in pos_bytes_plain.
Args:
fp: file pointer to hOCR file
page_lookup_data: Lookup table as returned by hocr_load_lookup_table or hocr_get_page_lookup_table.
pos_bytes_plain: Offset in plaintext of the hOCR file.