.. archive-hocr-tools documentation master file, created by sphinx-quickstart on Thu Jan 7 16:12:28 2021. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to archive-hocr-tools's documentation! ============================================== Tools Usage ----------- hocr-combine-stream ~~~~~~~~~~~~~~~~~~~ Tool to combine several hOCR files into a single hOCR file. Works in a streaming mode, so has a low memory footprint regardless of input file size. Usage:: hocr-combine-stream -g 'hocr-page-*.html' > hocr-combined.html hocr-fold-chars ~~~~~~~~~~~~~~~ Convert a character-based hOCR file (with ``ocrx_cinfo`` entries) into a word-based hOCR file with ``ocrx_word`` entries. This can be useful if you don't care about per-character information and wish to decrease the file size to increase throughput (network or computational). Usage:: hocr-fold-chars -f hocr-with-characters.html > hocr-with-words.html hocr-text ~~~~~~~~~ Creates a text-only version of a hOCR file. Discards words if they are below a certain (currently hardcoded) confidence level. Can be used to turn hOCR files into large text files for ingestion into full text search engines, or just to quickly read or search-for text. Usage:: hocr-text -f hocr-file.html > hocr-plain.txt hocr-lookup-create ~~~~~~~~~~~~~~~~~~ Creates a "lookup table" that maps the start and end of pages (in both plaintext and XML). Can be used to quickly parse only a subset of a big hOCR file. Text offsets are as `hocr-text`_ would report them (also discarding certain words with the same hardcoded confidence level). Can be used in combination with `fts-text-match`_ to quickly highlight matches from a FTS engine. Usage:: hocr-lookup-create -f hocr-file.html > hocr-file-lookup.json Searching tools ~~~~~~~~~~~~~~~ fts-text-annotate ~~~~~~~~~~~~~~~~~ Annotates a plain-text file as produced by `hocr-text`_ with ``{{{`` and ``}}}`` around matching text. Resulting file can be used as input for `fts-text-match`_. Usage:: fts-text-annotate -f hocr-plain.txt -p textbooks > hocr-plain-hl.txt fts-text-match ~~~~~~~~~~~~~~ Finds matches given a hOCR file, a highlighted plaintext file (per `fts-text-annotate`_) and a lookup table (per `hocr-lookup-create`_). Matches are reported (including word bounds) to standard out, as `JSON lines `_. Usage:: fts-text-match --hocr hocr-file.html --annotated-text hocr-plain-hl.txt --table hocr-file-lookup.json abbyy-to-hocr ~~~~~~~~~~~~~ Converts an Abbyy XML file to a hOCR file, preserving as much information as possible. Usage:: fts-text-match --infile abbyy_file.xml > converted_hocr_file.xml Testing tools ------------- TDB: * hocr-text-paragraphs * hocr-lookup-check * hocr-lookup-reconstruct Library usage ------------- Print all words in a hOCR file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Open a hOCR file and get all the word information for each page:: hocr_iter = hocr_page_iterator('test_hocr.html.gz') for page in hocr_iter: w, h = hocr_page_get_dimensions(page) word_data = hocr_page_to_word_data(page) for paragraph in word_data: for line in paragraph['lines']: for word in line['words']: print(word['text'], word['bbox']) Create a lookup table for a hOCR file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :: import sys filename = sys.argv[1] # Build lookup table page_info = hocr_get_page_lookup_table(filename) # Find the last page in the document, take the plain text start byte and # subtract one, to get to the second last page of the document second_last_page = page_info[-1][0]-1 page = hocr_lookup_page_by_plaintext_offset(f, page_info, second_last_page) txt = hocr_page_text(page) print('Text', txt) Components ---------- .. toctree:: :maxdepth: 2 :caption: Sections: parse.rst text.rst searching.rst Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`