summaryrefslogtreecommitdiff
path: root/content/posts/tool-overview/index.md
blob: 2a87649222f5a0186beada95907e1cf56c4cffd8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
title: "A tour of our tools"
date: 2020-06-15
categories: [binarisation, preprocessing, software, code, tools]
---
All of the code that powers our OCR work is released as open source,
freely available for anyone to take, use, and build on as they see
fit. This is a brief tour of some of the software we have written to
help with our mission of creating machine transcriptions of early printed
books.

Almost all of our development these days is in the [Go](https://golang.org)
language, because we love it. That means that all of the tools we'll
discuss should work fine on any platform, Linux, Mac, Windows or anything
else supported by the Go tooling. We rely on the
[Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) with
specially crafted training sets for our OCR, but in reality a lot of things
have to happen before and after the recognition step to ensure high quality
output for difficult cases like historical printed works. These pre- and
post-processing processes, as well as the automatic managing and combining
of them into a fast and reliable pipeline, have been the focus of much of
our development work, and are the focus of this post.

The tools are split across several different Go packages, which each
contain some tools in the `cmd/` directory, and some shared library
functions. They are all thoroughly documented, and the documentation can
be read online at [pkg.go.dev](https://pkg.go.dev).

## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline))

The central package behind our work is bookpipeline, which is also the
name of the main command in our armoury. The `bookpipeline` command brings
together most of the preprocessing, ocr and postprocessing tools we have
and ties them to the cloud computing infrastructure we use, so that book
processing can be easily scaled to run simultaneously on as many servers
as we need, with strong fault tolerance, redundancy, and all that fun
stuff. It does this by organising the different tasks into queues, which
are then checked, and the tasks are done in an appropriate order.

If you want to run the full pipeline yourself you'll have to set
the appropriate account details in `cloud.go` in the package, or you can
just try the 'local' connection type to get a reduced functionality version
which just runs locally.

As with all of our commands, you can run bookpipeline with the `-h` flag to
get an overview of how to use it, like this `bookpipeline -h`.

There are several other comands in the package that interact with
the pipeline, such as `booktopipeline`, which uploads a book to the
pipeline, and `lspipeline`, which lists important status information about
how the pipeline is getting along.

There are also several commands which are useful outside of the pipeline
environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph`
create a graphs showing the OCR confidence of different parts of a book
or individual page, given hOCR input. `pdfbook` creates a searchable PDF
from a directory of hOCR and image files - there are several tools online 
that could do this, but our `pdfbook` has several great features they lack;
it can smartly reduce the size and quality of pages while maintaining the
correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible
text layer, which works reliably with all PDF readers.

## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc))

preproc is a package of image preprocessing tools which we use to prepare
page images for OCR. They are designed to be very fast, and to work well
even in the common (for us) case of weird and dirty pages which have been
scanned badly. Many of the operations take advantage of our
[integral](https://rescribe.xyz/integral) ([docs](https://pkg.go.dev/rescribe.xyz/integral))
package, which uses clever mathematics to make the image operations very
fast.

There are two main commands (plus a number of exported functions to use in
your own Go projects) in the preproc package, `binarize` and `wipe`, as well
as a command that combines the two processes together, `preprocess`. The
`binarize` tool binarises an image; that is, takes a colour or grey image and
makes it black and white. This sounds simple, but as our
[binarisation](/categories/binarisation) posts here have described, doing it
well takes a lot of work, and can make a massive difference to OCR quality.
The `wipe` tool detects a content area in the page (where the text is), and
removes everything outside of it. This is important to avoid noise in the
margins from negatively affecting the final OCR result.

## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils))

The utils package contains a variety of small utilities and packages that
we needed. Probably the most useful for others would be the `hocr` package
(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and
provides several handy functions such as calculating the total confidence
for a page using word or character level OCR confidences. The `hocrtotxt`
command is also a very handy simple command to output plain text from a
hOCR file.

Other useful tools include `eeboxmltohocr`, which converts the XML available
from the [Early English Books Online](https://eebo.chadwyck.com) project into
hOCR, `fonttobytes`, which outputs a Go format byte list for a font file
enabling it to be easily included in a Go binary (as used by `bookpipeline`),
`dehyphenate` which follows simple rules to dehyphenate a text file, and
`pare-gt`, which splits some files in a directory containing ground truth
files (partial page images with a corresponding transcription) out into a
separate directory, which we use extensively in preparing new OCR training
sets, in particular to extract good OCR testing data.

## Summary

We are proud of the tools we've written, but there will inevitably be plenty
of shortcomings and missing features. We're very keen for others to try
them out and let us know what works well and what doesn't, so that we can
improve them for everyone. While we have written everything to be as robust
and correct as possible, there are sure to be plenty of bugs lurking, so
please [do let us know](mailto:info@rescribe.xyz) if anything doesn't work
right or needs to be explained better, if there are features you would find
useful, or anything else you want to share. And of course, if you want to
share patches fixing or changing things, that would be even better! We hope
that by releasing these tools, and explaining how they work and can be used,
more of the world's historical culture can be more accessible and searchable,
and be used and understood and new ways.