Welcome to archive-hocr-tools’s documentation!¶
Tools Usage¶
hocr-combine-stream¶
Tool to combine several hOCR files into a single hOCR file. Works in a streaming mode, so has a low memory footprint regardless of input file size.
Usage:
hocr-combine-stream -g 'hocr-page-*.html' > hocr-combined.html
hocr-fold-chars¶
Convert a character-based hOCR file (with ocrx_cinfo
entries) into a
word-based hOCR file with ocrx_word
entries. This can be useful if you don’t
care about per-character information and wish to decrease the file size to
increase throughput (network or computational).
Usage:
hocr-fold-chars -f hocr-with-characters.html > hocr-with-words.html
hocr-text¶
Creates a text-only version of a hOCR file. Discards words if they are below a certain (currently hardcoded) confidence level. Can be used to turn hOCR files into large text files for ingestion into full text search engines, or just to quickly read or search-for text.
Usage:
hocr-text -f hocr-file.html > hocr-plain.txt
hocr-lookup-create¶
Creates a “lookup table” that maps the start and end of pages (in both plaintext and XML). Can be used to quickly parse only a subset of a big hOCR file. Text offsets are as hocr-text would report them (also discarding certain words with the same hardcoded confidence level).
Can be used in combination with fts-text-match to quickly highlight matches from a FTS engine.
Usage:
hocr-lookup-create -f hocr-file.html > hocr-file-lookup.json
Searching tools¶
fts-text-annotate¶
Annotates a plain-text file as produced by hocr-text with {{{
and
}}}
around matching text. Resulting file can be used as input for
fts-text-match.
Usage:
fts-text-annotate -f hocr-plain.txt -p textbooks > hocr-plain-hl.txt
fts-text-match¶
Finds matches given a hOCR file, a highlighted plaintext file (per fts-text-annotate) and a lookup table (per hocr-lookup-create).
Matches are reported (including word bounds) to standard out, as JSON lines.
Usage:
fts-text-match --hocr hocr-file.html --annotated-text hocr-plain-hl.txt --table hocr-file-lookup.json
abbyy-to-hocr¶
Converts an Abbyy XML file to a hOCR file, preserving as much information as possible.
Usage:
fts-text-match --infile abbyy_file.xml > converted_hocr_file.xml
Testing tools¶
TDB:
hocr-text-paragraphs
hocr-lookup-check
hocr-lookup-reconstruct
Library usage¶
Print all words in a hOCR file¶
Open a hOCR file and get all the word information for each page:
hocr_iter = hocr_page_iterator('test_hocr.html.gz')
for page in hocr_iter:
w, h = hocr_page_get_dimensions(page)
word_data = hocr_page_to_word_data(page)
for paragraph in word_data:
for line in paragraph['lines']:
for word in line['words']:
print(word['text'], word['bbox'])
Create a lookup table for a hOCR file¶
import sys
filename = sys.argv[1]
# Build lookup table
page_info = hocr_get_page_lookup_table(filename)
# Find the last page in the document, take the plain text start byte and
# subtract one, to get to the second last page of the document
second_last_page = page_info[-1][0]-1
page = hocr_lookup_page_by_plaintext_offset(f, page_info, second_last_page)
txt = hocr_page_text(page)
print('Text', txt)