From e5fbc00de99f2f641106ea47f413878848fa9709 Mon Sep 17 00:00:00 2001 From: Nick White Date: Mon, 15 Jun 2020 14:02:12 +0100 Subject: Update tools tour post; final --- content/posts/tool-overview/index.md | 117 +++++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 content/posts/tool-overview/index.md (limited to 'content/posts/tool-overview') diff --git a/content/posts/tool-overview/index.md b/content/posts/tool-overview/index.md new file mode 100644 index 0000000..ecaaa06 --- /dev/null +++ b/content/posts/tool-overview/index.md @@ -0,0 +1,117 @@ +--- +title: "A tour of our tools" +date: 2020-06-15 +categories: [binarisation, preprocessing, software, code, tools] +--- +All of the code that powers our OCR work is released as open source, +freely available for anyone to take, use, and build on as they see +fit. This is a brief tour of some of the software we have written to +help with our mission of creating machine transcriptions of early printed +books. + +Almost all of our development these days is in the [Go](https://golang.org) +language, because we love it. That means that all of the tools we'll +discuss should work fine on any platform, Linux, Mac, Windows or anything +else supported by the Go tooling. We rely on the +[Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) with +specially crafted training sets for our OCR, but in reality a lot of things +have to happen before and after the recognition step to ensure high quality +output for difficult cases like historical printed works. These pre- and +post-processing processes, as well as the automatic managing and combining +of them into a fast and reliable pipeline, have been the focus of much of +our development work, and are the focus of this post. + +The tools are split across several different Go packages, which each +contain some tools in the `cmd/` directory, and some shared library +functions. They are all thoroughly documented, and the documentation can +be read online at [pkg.go.dev](https://pkg.go.dev). + +## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline)) + +The central package behind our work is bookpipeline, which is also the +name of the main command in our armoury. The `bookpipeline` command brings +together most of the preprocessing, ocr and postprocessing tools we have +and ties them to the cloud computing infrastructure we use, so that book +processing can be easily scaled to run simultaneously on as many servers +as we need, with strong fault tolerance, redundancy, and all that fun +stuff. It does this by organising the different tasks into queues, which +are then checked, and the tasks are done in an appropriate order. + +If you want to run the full pipeline yourself you'll have to set +the appropriate account details in `cloud.go` in the package, or you can +just try the 'local' connection type to get a reduced functionality version +which just runs locally. + +As with all of our commands, you can run bookpipeline with the `-h` flag to +get an overview of how to use it, like this `bookpipeline -h`. + +There are several other comands in the package that interact with +the pipeline, such as `booktopipeline`, which uploads a book to the +pipeline, and `lspipeline`, which lists important status information about +how the pipeline is getting along. + +There are also several commands which are useful outside of the pipeline +environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph` +create a graphs showing the OCR confidence of different parts of a book +or individual page, given hOCR input. `pdfbook` creates a searchable PDF +from a directory of hOCR and image files - there are several tools online +that could do this, but our `pdfbook` has several great features they lack; +it can smartly reduce the size and quality of pages while maintaining the +correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible +text layer, which works reliably with all PDF readers. + +## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc)) + +preproc is a package of image preprocessing tools which we use to prepare +page images for OCR. They are designed to be very fast, and to work well +even in the common (for us) case of weird and dirty pages which have been +scanned badly. Many of the operations take advantage of our +[integralimg](https://rescribe.xyz/integralimg) ([docs](https://pkg.go.dev/rescribe.xyz/integralimg)) +package, which uses clever mathematics to make the image operations very +fast. + +There are two main commands (plus a number of exported functions to use in +your own Go projects) in the preproc package, `binarize` and `wipe`, as well +as a command that combines the two processes together, `preprocess`. The +`binarize` tool binarises an image; that is, takes a colour or grey image and +makes it black and white. This sounds simple, but as our +[binarisation](/categories/binarisation) posts here have described, doing it +well takes a lot of work, and can make a massive difference to OCR quality. +The `wipe` tool detects a content area in the page (where the text is), and +removes everything outside of it. This is important to avoid noise in the +margins from negatively affecting the final OCR result. + +## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils)) + +The utils package contains a variety of small utilities and packages that +we needed. Probably the most useful for others would be the `hocr` package +(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and +provides several handy functions such as calculating the total confidence +for a page using word or character level OCR confidences. The `hocrtotxt` +command is also a very handy simple command to output plain text from a +hOCR file. + +Other useful tools include `eeboxmltohocr`, which converts the XML available +from the [Early English Books Online](https://eebo.chadwyck.com) project into +hOCR, `fonttobytes`, which outputs a Go format byte list for a font file +enabling it to be easily included in a Go binary (as used by `bookpipeline`), +`dehyphenate` which follows simple rules to dehyphenate a text file, and +`pare-gt`, which splits some files in a directory containing ground truth +files (partial page images with a corresponding transcription) out into a +separate directory, which we use extensively in preparing new OCR training +sets, in particular to extract good OCR testing data. + +## Summary + +We are proud of the tools we've written, but there will inevitably be plenty +of shortcomings and missing features. We're very keen for others to try +them out and let us know what works well and what doesn't, so that we can +improve them for everyone. While we have written everything to be as robust +and correct as possible, there are sure to be plenty of bugs lurking, so +please [do let us know](mailto:info@rescribe.xyz) if anything doesn't work +right or needs to be explained better, if there are features you would find +useful, or anything else you want to share. And of course, if you want to +share patches fixing or changing things, that would be even better! We hope +that by releasing these tools, and explaining how they work and can be used, +more of the world's historical culture can be more accessible and searchable, +and be used and understood and new ways. -- cgit v1.2.1-24-ge1ad