From e5fbc00de99f2f641106ea47f413878848fa9709 Mon Sep 17 00:00:00 2001 From: Nick White Date: Mon, 15 Jun 2020 14:02:12 +0100 Subject: Update tools tour post; final --- content/posts/tools-released/index.md | 107 ---------------------------------- 1 file changed, 107 deletions(-) delete mode 100644 content/posts/tools-released/index.md (limited to 'content/posts/tools-released') diff --git a/content/posts/tools-released/index.md b/content/posts/tools-released/index.md deleted file mode 100644 index f15e197..0000000 --- a/content/posts/tools-released/index.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -title: "A tour of our tools" -date: 2020-06-01 -categories: [binarisation, preprocessing, software, code, tools] ---- -All of the code that powers our OCR work is released as open source, -freely available for anyone to take, use, and build on as they see -fit. However until very recently we hadn't taken the time to document -it all, and make it easily discoverable. That has now changed, so we -thought it would be an ideal time to give a brief tour of some of the -software we have written to help with our mission creating machine -transcriptions of early printed books. - -Almost all of our development these days is in the [Go](https://golang.org) -language, because we love it. That means that all of the tools we'll -discuss should work fine on any platform, Linux, Mac, Windows or anything -else supported by the Go tooling. While we're written everything to be as -robust and good as possible, there are sure to be plenty of bugs lurking, -so please [do let us know](mailto:info@rescribe.xyz) if anything doesn't -work right, if there are features you would find useful, or anything else -you want to share. And of course, if you want to share patches fixing or -changing things, that would be even better! - -The tools are split across several different Go packages, which each -contain some tools in the `cmd/` directory, and some shared library -functions. They are all thoroughly documented, and the documentation can -be read online at [pkg.go.dev](https://pkg.go.dev). - -## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline)) - -The central package behind our work is bookpipeline, which is also the -name of the main command in our armoury. The `bookpipeline` command brings -together most of the preprocessing, ocr and postprocessing tools we have -and ties them to the cloud computing infrastructure we use, so that book -processing can be easily scaled to run simultaneously on as many servers -as we need, with strong fault tolerance, redundancy, and all that fun -stuff. If you want to run the full pipeline yourself you'll have to set -the appropriate account details in `cloud.go` in the package, or you can -just try the 'local' connection type to get a reduced functionality version -which just runs locally. - -There are several other comands in the package that interact with -the pipeline, such as `booktopipeline`, which uploads a book to the -pipeline, and `lspipeline`, which lists important status information about -how the pipeline is getting along. - -There are also several commands which are useful outside of the pipeline -environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph` -create a graphs showing the OCR confidence of different parts of a book -or individual page, given hOCR input. `pdfbook` creates a searchable PDF -from a directory of hOCR and image files - there are several tools online -that could do this, but our `pdfbook` has several great features they lack; -it can smartly reduce the size and quality of pages while maintaining the -correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible -text layer, which works reliably with all PDF readers. - -## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc)) - -preproc is a package of image preprocessing tools which we use to prepare -page images for OCR. They are designed to be very fast, and to work well -even in the common (for us) case of weird and dirty pages which have been -scanned badly. Many of the operations take advantage of our -[integralimg](https://rescribe.xyz/integralimg) ([docs](https://pkg.go.dev/rescribe.xyz/integralimg)) -package, which uses clever mathematics to make the image operations very -fast. - -There are two main commands (plus a number of exported functions to use in -your own Go projects) in the preproc package, `binarize` and `wipe`, as well -as a command that combines the two processes together, `preprocess`. The -`binarize` tool binarises an image; that is, takes a colour or grey image and -makes it black and white. This sounds simple, but as our -[binarisation](/categories/binarisation) posts here have described, doing it -well takes a lot of work, and can make a massive difference to OCR quality. -The `wipe` tool detects a content area in the page (where the text is), and -removes everything outside of it. This is important to avoid various scanning -artifacts or decorative marginalia negatively affecting the final OCR result. - -## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils)) - -The utils package contains a variety of small utilities and packages that -we needed. Probably the most useful for others would be the `hocr` package -(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and -provides several handy functions such as calculating the total confidence -for a page using word or character level OCR confidences. The `hocrtotxt` -command is also a very handy simple command to output plain text from a -hOCR file. - -Other useful tools include `eeboxmltohocr`, which converts the XML available -from the [Early English Books Online](https://eebo.chadwyck.com) project into -hOCR, `fonttobytes`, which outputs a Go format byte list for a font file -enabling it to be easily included in a Go binary (as used by `bookpipeline`), -`dehyphenate` which follows simple rules to dehyphenate a text file, and -`pare-gt`, which splits some files in a directory containing ground truth -files (partial page images with a corresponding transcription) out into a -separate directory, which we use extensively in preparing new OCR training -sets, in particular to extract good OCR testing data. - -## Summary - -We are proud of the tools we've written, but inevitably there will be lots -of shortcomings and missing features. We're very keen for others to try -them out and let us know what works well and what doesn't, so that we can -improve them for everyone. OCR is a quite well-established technology, but -the lack of free and open source tooling around it has held back its -potential. We hope that by releasing these tools, and explaining how they -work and can be used, more of the world's historical culture can be more -accessible, searchable, and used and understood and new ways. -- cgit v1.2.1-24-ge1ad