diff options
-rw-r--r-- | content/posts/tool-overview/index.md (renamed from content/posts/tools-released/index.md) | 52 |
1 files changed, 31 insertions, 21 deletions
diff --git a/content/posts/tools-released/index.md b/content/posts/tool-overview/index.md index f15e197..ecaaa06 100644 --- a/content/posts/tools-released/index.md +++ b/content/posts/tool-overview/index.md @@ -1,25 +1,25 @@ --- title: "A tour of our tools" -date: 2020-06-01 +date: 2020-06-15 categories: [binarisation, preprocessing, software, code, tools] --- All of the code that powers our OCR work is released as open source, freely available for anyone to take, use, and build on as they see -fit. However until very recently we hadn't taken the time to document -it all, and make it easily discoverable. That has now changed, so we -thought it would be an ideal time to give a brief tour of some of the -software we have written to help with our mission creating machine -transcriptions of early printed books. +fit. This is a brief tour of some of the software we have written to +help with our mission of creating machine transcriptions of early printed +books. Almost all of our development these days is in the [Go](https://golang.org) language, because we love it. That means that all of the tools we'll discuss should work fine on any platform, Linux, Mac, Windows or anything -else supported by the Go tooling. While we're written everything to be as -robust and good as possible, there are sure to be plenty of bugs lurking, -so please [do let us know](mailto:info@rescribe.xyz) if anything doesn't -work right, if there are features you would find useful, or anything else -you want to share. And of course, if you want to share patches fixing or -changing things, that would be even better! +else supported by the Go tooling. We rely on the +[Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) with +specially crafted training sets for our OCR, but in reality a lot of things +have to happen before and after the recognition step to ensure high quality +output for difficult cases like historical printed works. These pre- and +post-processing processes, as well as the automatic managing and combining +of them into a fast and reliable pipeline, have been the focus of much of +our development work, and are the focus of this post. The tools are split across several different Go packages, which each contain some tools in the `cmd/` directory, and some shared library @@ -34,11 +34,17 @@ together most of the preprocessing, ocr and postprocessing tools we have and ties them to the cloud computing infrastructure we use, so that book processing can be easily scaled to run simultaneously on as many servers as we need, with strong fault tolerance, redundancy, and all that fun -stuff. If you want to run the full pipeline yourself you'll have to set +stuff. It does this by organising the different tasks into queues, which +are then checked, and the tasks are done in an appropriate order. + +If you want to run the full pipeline yourself you'll have to set the appropriate account details in `cloud.go` in the package, or you can just try the 'local' connection type to get a reduced functionality version which just runs locally. +As with all of our commands, you can run bookpipeline with the `-h` flag to +get an overview of how to use it, like this `bookpipeline -h`. + There are several other comands in the package that interact with the pipeline, such as `booktopipeline`, which uploads a book to the pipeline, and `lspipeline`, which lists important status information about @@ -72,8 +78,8 @@ makes it black and white. This sounds simple, but as our [binarisation](/categories/binarisation) posts here have described, doing it well takes a lot of work, and can make a massive difference to OCR quality. The `wipe` tool detects a content area in the page (where the text is), and -removes everything outside of it. This is important to avoid various scanning -artifacts or decorative marginalia negatively affecting the final OCR result. +removes everything outside of it. This is important to avoid noise in the +margins from negatively affecting the final OCR result. ## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils)) @@ -97,11 +103,15 @@ sets, in particular to extract good OCR testing data. ## Summary -We are proud of the tools we've written, but inevitably there will be lots +We are proud of the tools we've written, but there will inevitably be plenty of shortcomings and missing features. We're very keen for others to try them out and let us know what works well and what doesn't, so that we can -improve them for everyone. OCR is a quite well-established technology, but -the lack of free and open source tooling around it has held back its -potential. We hope that by releasing these tools, and explaining how they -work and can be used, more of the world's historical culture can be more -accessible, searchable, and used and understood and new ways. +improve them for everyone. While we have written everything to be as robust +and correct as possible, there are sure to be plenty of bugs lurking, so +please [do let us know](mailto:info@rescribe.xyz) if anything doesn't work +right or needs to be explained better, if there are features you would find +useful, or anything else you want to share. And of course, if you want to +share patches fixing or changing things, that would be even better! We hope +that by releasing these tools, and explaining how they work and can be used, +more of the world's historical culture can be more accessible and searchable, +and be used and understood and new ways. |