From e5fbc00de99f2f641106ea47f413878848fa9709 Mon Sep 17 00:00:00 2001
From: Nick White <git@njw.name>
Date: Mon, 15 Jun 2020 14:02:12 +0100
Subject: Update tools tour post; final

---
 content/posts/tool-overview/index.md  | 117 ++++++++++++++++++++++++++++++++++
 content/posts/tools-released/index.md | 107 -------------------------------
 2 files changed, 117 insertions(+), 107 deletions(-)
 create mode 100644 content/posts/tool-overview/index.md
 delete mode 100644 content/posts/tools-released/index.md

diff --git a/content/posts/tool-overview/index.md b/content/posts/tool-overview/index.md
new file mode 100644
index 0000000..ecaaa06
--- /dev/null
+++ b/content/posts/tool-overview/index.md
@@ -0,0 +1,117 @@
+---
+title: "A tour of our tools"
+date: 2020-06-15
+categories: [binarisation, preprocessing, software, code, tools]
+---
+All of the code that powers our OCR work is released as open source,
+freely available for anyone to take, use, and build on as they see
+fit. This is a brief tour of some of the software we have written to
+help with our mission of creating machine transcriptions of early printed
+books.
+
+Almost all of our development these days is in the [Go](https://golang.org)
+language, because we love it. That means that all of the tools we'll
+discuss should work fine on any platform, Linux, Mac, Windows or anything
+else supported by the Go tooling. We rely on the
+[Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) with
+specially crafted training sets for our OCR, but in reality a lot of things
+have to happen before and after the recognition step to ensure high quality
+output for difficult cases like historical printed works. These pre- and
+post-processing processes, as well as the automatic managing and combining
+of them into a fast and reliable pipeline, have been the focus of much of
+our development work, and are the focus of this post.
+
+The tools are split across several different Go packages, which each
+contain some tools in the `cmd/` directory, and some shared library
+functions. They are all thoroughly documented, and the documentation can
+be read online at [pkg.go.dev](https://pkg.go.dev).
+
+## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline))
+
+The central package behind our work is bookpipeline, which is also the
+name of the main command in our armoury. The `bookpipeline` command brings
+together most of the preprocessing, ocr and postprocessing tools we have
+and ties them to the cloud computing infrastructure we use, so that book
+processing can be easily scaled to run simultaneously on as many servers
+as we need, with strong fault tolerance, redundancy, and all that fun
+stuff. It does this by organising the different tasks into queues, which
+are then checked, and the tasks are done in an appropriate order.
+
+If you want to run the full pipeline yourself you'll have to set
+the appropriate account details in `cloud.go` in the package, or you can
+just try the 'local' connection type to get a reduced functionality version
+which just runs locally.
+
+As with all of our commands, you can run bookpipeline with the `-h` flag to
+get an overview of how to use it, like this `bookpipeline -h`.
+
+There are several other comands in the package that interact with
+the pipeline, such as `booktopipeline`, which uploads a book to the
+pipeline, and `lspipeline`, which lists important status information about
+how the pipeline is getting along.
+
+There are also several commands which are useful outside of the pipeline
+environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph`
+create a graphs showing the OCR confidence of different parts of a book
+or individual page, given hOCR input. `pdfbook` creates a searchable PDF
+from a directory of hOCR and image files - there are several tools online 
+that could do this, but our `pdfbook` has several great features they lack;
+it can smartly reduce the size and quality of pages while maintaining the
+correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible
+text layer, which works reliably with all PDF readers.
+
+## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc))
+
+preproc is a package of image preprocessing tools which we use to prepare
+page images for OCR. They are designed to be very fast, and to work well
+even in the common (for us) case of weird and dirty pages which have been
+scanned badly. Many of the operations take advantage of our
+[integralimg](https://rescribe.xyz/integralimg) ([docs](https://pkg.go.dev/rescribe.xyz/integralimg))
+package, which uses clever mathematics to make the image operations very
+fast.
+
+There are two main commands (plus a number of exported functions to use in
+your own Go projects) in the preproc package, `binarize` and `wipe`, as well
+as a command that combines the two processes together, `preprocess`. The
+`binarize` tool binarises an image; that is, takes a colour or grey image and
+makes it black and white. This sounds simple, but as our
+[binarisation](/categories/binarisation) posts here have described, doing it
+well takes a lot of work, and can make a massive difference to OCR quality.
+The `wipe` tool detects a content area in the page (where the text is), and
+removes everything outside of it. This is important to avoid noise in the
+margins from negatively affecting the final OCR result.
+
+## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils))
+
+The utils package contains a variety of small utilities and packages that
+we needed. Probably the most useful for others would be the `hocr` package
+(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and
+provides several handy functions such as calculating the total confidence
+for a page using word or character level OCR confidences. The `hocrtotxt`
+command is also a very handy simple command to output plain text from a
+hOCR file.
+
+Other useful tools include `eeboxmltohocr`, which converts the XML available
+from the [Early English Books Online](https://eebo.chadwyck.com) project into
+hOCR, `fonttobytes`, which outputs a Go format byte list for a font file
+enabling it to be easily included in a Go binary (as used by `bookpipeline`),
+`dehyphenate` which follows simple rules to dehyphenate a text file, and
+`pare-gt`, which splits some files in a directory containing ground truth
+files (partial page images with a corresponding transcription) out into a
+separate directory, which we use extensively in preparing new OCR training
+sets, in particular to extract good OCR testing data.
+
+## Summary
+
+We are proud of the tools we've written, but there will inevitably be plenty
+of shortcomings and missing features. We're very keen for others to try
+them out and let us know what works well and what doesn't, so that we can
+improve them for everyone. While we have written everything to be as robust
+and correct as possible, there are sure to be plenty of bugs lurking, so
+please [do let us know](mailto:info@rescribe.xyz) if anything doesn't work
+right or needs to be explained better, if there are features you would find
+useful, or anything else you want to share. And of course, if you want to
+share patches fixing or changing things, that would be even better! We hope
+that by releasing these tools, and explaining how they work and can be used,
+more of the world's historical culture can be more accessible and searchable,
+and be used and understood and new ways.
diff --git a/content/posts/tools-released/index.md b/content/posts/tools-released/index.md
deleted file mode 100644
index f15e197..0000000
--- a/content/posts/tools-released/index.md
+++ /dev/null
@@ -1,107 +0,0 @@
----
-title: "A tour of our tools"
-date: 2020-06-01
-categories: [binarisation, preprocessing, software, code, tools]
----
-All of the code that powers our OCR work is released as open source,
-freely available for anyone to take, use, and build on as they see
-fit. However until very recently we hadn't taken the time to document
-it all, and make it easily discoverable. That has now changed, so we
-thought it would be an ideal time to give a brief tour of some of the
-software we have written to help with our mission creating machine
-transcriptions of early printed books.
-
-Almost all of our development these days is in the [Go](https://golang.org)
-language, because we love it. That means that all of the tools we'll
-discuss should work fine on any platform, Linux, Mac, Windows or anything
-else supported by the Go tooling. While we're written everything to be as
-robust and good as possible, there are sure to be plenty of bugs lurking,
-so please [do let us know](mailto:info@rescribe.xyz) if anything doesn't
-work right, if there are features you would find useful, or anything else
-you want to share. And of course, if you want to share patches fixing or
-changing things, that would be even better!
-
-The tools are split across several different Go packages, which each
-contain some tools in the `cmd/` directory, and some shared library
-functions. They are all thoroughly documented, and the documentation can
-be read online at [pkg.go.dev](https://pkg.go.dev).
-
-## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline))
-
-The central package behind our work is bookpipeline, which is also the
-name of the main command in our armoury. The `bookpipeline` command brings
-together most of the preprocessing, ocr and postprocessing tools we have
-and ties them to the cloud computing infrastructure we use, so that book
-processing can be easily scaled to run simultaneously on as many servers
-as we need, with strong fault tolerance, redundancy, and all that fun
-stuff. If you want to run the full pipeline yourself you'll have to set
-the appropriate account details in `cloud.go` in the package, or you can
-just try the 'local' connection type to get a reduced functionality version
-which just runs locally.
-
-There are several other comands in the package that interact with
-the pipeline, such as `booktopipeline`, which uploads a book to the
-pipeline, and `lspipeline`, which lists important status information about
-how the pipeline is getting along.
-
-There are also several commands which are useful outside of the pipeline
-environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph`
-create a graphs showing the OCR confidence of different parts of a book
-or individual page, given hOCR input. `pdfbook` creates a searchable PDF
-from a directory of hOCR and image files - there are several tools online 
-that could do this, but our `pdfbook` has several great features they lack;
-it can smartly reduce the size and quality of pages while maintaining the
-correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible
-text layer, which works reliably with all PDF readers.
-
-## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc))
-
-preproc is a package of image preprocessing tools which we use to prepare
-page images for OCR. They are designed to be very fast, and to work well
-even in the common (for us) case of weird and dirty pages which have been
-scanned badly. Many of the operations take advantage of our
-[integralimg](https://rescribe.xyz/integralimg) ([docs](https://pkg.go.dev/rescribe.xyz/integralimg))
-package, which uses clever mathematics to make the image operations very
-fast.
-
-There are two main commands (plus a number of exported functions to use in
-your own Go projects) in the preproc package, `binarize` and `wipe`, as well
-as a command that combines the two processes together, `preprocess`. The
-`binarize` tool binarises an image; that is, takes a colour or grey image and
-makes it black and white. This sounds simple, but as our
-[binarisation](/categories/binarisation) posts here have described, doing it
-well takes a lot of work, and can make a massive difference to OCR quality.
-The `wipe` tool detects a content area in the page (where the text is), and
-removes everything outside of it. This is important to avoid various scanning
-artifacts or decorative marginalia negatively affecting the final OCR result.
-
-## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils))
-
-The utils package contains a variety of small utilities and packages that
-we needed. Probably the most useful for others would be the `hocr` package
-(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and
-provides several handy functions such as calculating the total confidence
-for a page using word or character level OCR confidences. The `hocrtotxt`
-command is also a very handy simple command to output plain text from a
-hOCR file.
-
-Other useful tools include `eeboxmltohocr`, which converts the XML available
-from the [Early English Books Online](https://eebo.chadwyck.com) project into
-hOCR, `fonttobytes`, which outputs a Go format byte list for a font file
-enabling it to be easily included in a Go binary (as used by `bookpipeline`),
-`dehyphenate` which follows simple rules to dehyphenate a text file, and
-`pare-gt`, which splits some files in a directory containing ground truth
-files (partial page images with a corresponding transcription) out into a
-separate directory, which we use extensively in preparing new OCR training
-sets, in particular to extract good OCR testing data.
-
-## Summary
-
-We are proud of the tools we've written, but inevitably there will be lots
-of shortcomings and missing features. We're very keen for others to try
-them out and let us know what works well and what doesn't, so that we can
-improve them for everyone. OCR is a quite well-established technology, but
-the lack of free and open source tooling around it has held back its
-potential. We hope that by releasing these tools, and explaining how they
-work and can be used, more of the world's historical culture can be more
-accessible, searchable, and used and understood and new ways.
-- 
cgit v1.2.1-24-ge1ad