Welcome to archive-hocr-tools’s documentation!¶

Tools Usage¶

hocr-combine-stream¶

Tool to combine several hOCR files into a single hOCR file. Works in a streaming mode, so has a low memory footprint regardless of input file size.

Usage:

hocr-combine-stream -g 'hocr-page-*.html' > hocr-combined.html

hocr-fold-chars¶

Convert a character-based hOCR file (with ocrx_cinfo entries) into a word-based hOCR file with ocrx_word entries. This can be useful if you don’t care about per-character information and wish to decrease the file size to increase throughput (network or computational).

Usage:

hocr-fold-chars -f hocr-with-characters.html > hocr-with-words.html

hocr-text¶

Creates a text-only version of a hOCR file. Discards words if they are below a certain (currently hardcoded) confidence level. Can be used to turn hOCR files into large text files for ingestion into full text search engines, or just to quickly read or search-for text.

Usage:

hocr-text -f hocr-file.html > hocr-plain.txt

hocr-lookup-create¶

Creates a “lookup table” that maps the start and end of pages (in both plaintext and XML). Can be used to quickly parse only a subset of a big hOCR file. Text offsets are as hocr-text would report them (also discarding certain words with the same hardcoded confidence level).

Can be used in combination with fts-text-match to quickly highlight matches from a FTS engine.

Usage:

hocr-lookup-create -f hocr-file.html > hocr-file-lookup.json

Searching tools¶

fts-text-annotate¶

Annotates a plain-text file as produced by hocr-text with {{{ and }}} around matching text. Resulting file can be used as input for fts-text-match.

Usage:

fts-text-annotate -f hocr-plain.txt -p textbooks > hocr-plain-hl.txt

fts-text-match¶

Finds matches given a hOCR file, a highlighted plaintext file (per fts-text-annotate) and a lookup table (per hocr-lookup-create).

Matches are reported (including word bounds) to standard out, as JSON lines.

Usage:

fts-text-match --hocr hocr-file.html --annotated-text hocr-plain-hl.txt --table hocr-file-lookup.json

abbyy-to-hocr¶

Converts an Abbyy XML file to a hOCR file, preserving as much information as possible.

Usage:

fts-text-match --infile abbyy_file.xml > converted_hocr_file.xml

Testing tools¶

TDB:

hocr-text-paragraphs
hocr-lookup-check
hocr-lookup-reconstruct

Library usage¶

Print all words in a hOCR file¶

Open a hOCR file and get all the word information for each page:

hocr_iter = hocr_page_iterator('test_hocr.html.gz')
for page in hocr_iter:
    w, h = hocr_page_get_dimensions(page)
    word_data = hocr_page_to_word_data(page)

    for paragraph in word_data:
        for line in paragraph['lines']:
            for word in line['words']:
                print(word['text'], word['bbox'])

Create a lookup table for a hOCR file¶

import sys
filename = sys.argv[1]

# Build lookup table
page_info = hocr_get_page_lookup_table(filename)
# Find the last page in the document, take the plain text start byte and
# subtract one, to get to the second last page of the document
second_last_page = page_info[-1][0]-1
page = hocr_lookup_page_by_plaintext_offset(f, page_info, second_last_page)
txt = hocr_page_text(page)
print('Text', txt)

Welcome to archive-hocr-tools’s documentation!¶

Tools Usage¶

hocr-combine-stream¶

hocr-fold-chars¶

hocr-text¶

hocr-lookup-create¶

Searching tools¶

fts-text-annotate¶

fts-text-match¶

abbyy-to-hocr¶

Testing tools¶

Library usage¶

Print all words in a hOCR file¶

Create a lookup table for a hOCR file¶

Components¶

Indices and tables¶

Table of Contents

Next topic

This Page