Welcome to archive-hocr-tools’s documentation!

Tools Usage

hocr-combine-stream

Tool to combine several hOCR files into a single hOCR file. Works in a streaming mode, so has a low memory footprint regardless of input file size.

Usage:

hocr-combine-stream -g 'hocr-page-*.html' > hocr-combined.html

hocr-fold-chars

Convert a character-based hOCR file (with ocrx_cinfo entries) into a word-based hOCR file with ocrx_word entries. This can be useful if you don’t care about per-character information and wish to decrease the file size to increase throughput (network or computational).

Usage:

hocr-fold-chars -f hocr-with-characters.html > hocr-with-words.html

hocr-text

Creates a text-only version of a hOCR file. Discards words if they are below a certain (currently hardcoded) confidence level. Can be used to turn hOCR files into large text files for ingestion into full text search engines, or just to quickly read or search-for text.

Usage:

hocr-text -f hocr-file.html > hocr-plain.txt

hocr-lookup-create

Creates a “lookup table” that maps the start and end of pages (in both plaintext and XML). Can be used to quickly parse only a subset of a big hOCR file. Text offsets are as hocr-text would report them (also discarding certain words with the same hardcoded confidence level).

Can be used in combination with fts-text-match to quickly highlight matches from a FTS engine.

Usage:

hocr-lookup-create -f hocr-file.html > hocr-file-lookup.json

Searching tools

fts-text-annotate

Annotates a plain-text file as produced by hocr-text with {{{ and }}} around matching text. Resulting file can be used as input for fts-text-match.

Usage:

fts-text-annotate -f hocr-plain.txt -p textbooks > hocr-plain-hl.txt

fts-text-match

Finds matches given a hOCR file, a highlighted plaintext file (per fts-text-annotate) and a lookup table (per hocr-lookup-create).

Matches are reported (including word bounds) to standard out, as JSON lines.

Usage:

fts-text-match --hocr hocr-file.html --annotated-text hocr-plain-hl.txt --table hocr-file-lookup.json

abbyy-to-hocr

Converts an Abbyy XML file to a hOCR file, preserving as much information as possible.

Usage:

fts-text-match --infile abbyy_file.xml > converted_hocr_file.xml

Testing tools

TDB:

  • hocr-text-paragraphs

  • hocr-lookup-check

  • hocr-lookup-reconstruct

Library usage

Create a lookup table for a hOCR file

import sys
filename = sys.argv[1]

# Build lookup table
page_info = hocr_get_page_lookup_table(filename)
# Find the last page in the document, take the plain text start byte and
# subtract one, to get to the second last page of the document
second_last_page = page_info[-1][0]-1
page = hocr_lookup_page_by_plaintext_offset(f, page_info, second_last_page)
txt = hocr_page_text(page)
print('Text', txt)

Components

Indices and tables