summaryrefslogtreecommitdiff
path: root/content/posts/tools-released/index.md
blob: f15e19798a0d234d6eab545ac0749dbc08978828 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "A tour of our tools"
date: 2020-06-01
categories: [binarisation, preprocessing, software, code, tools]
---
All of the code that powers our OCR work is released as open source,
freely available for anyone to take, use, and build on as they see
fit. However until very recently we hadn't taken the time to document
it all, and make it easily discoverable. That has now changed, so we
thought it would be an ideal time to give a brief tour of some of the
software we have written to help with our mission creating machine
transcriptions of early printed books.

Almost all of our development these days is in the [Go](https://golang.org)
language, because we love it. That means that all of the tools we'll
discuss should work fine on any platform, Linux, Mac, Windows or anything
else supported by the Go tooling. While we're written everything to be as
robust and good as possible, there are sure to be plenty of bugs lurking,
so please [do let us know](mailto:info@rescribe.xyz) if anything doesn't
work right, if there are features you would find useful, or anything else
you want to share. And of course, if you want to share patches fixing or
changing things, that would be even better!

The tools are split across several different Go packages, which each
contain some tools in the `cmd/` directory, and some shared library
functions. They are all thoroughly documented, and the documentation can
be read online at [pkg.go.dev](https://pkg.go.dev).

## [bookpipeline](https://rescribe.xyz/bookpipeline) ([docs](https://pkg.go.dev/rescribe.xyz/bookpipeline))

The central package behind our work is bookpipeline, which is also the
name of the main command in our armoury. The `bookpipeline` command brings
together most of the preprocessing, ocr and postprocessing tools we have
and ties them to the cloud computing infrastructure we use, so that book
processing can be easily scaled to run simultaneously on as many servers
as we need, with strong fault tolerance, redundancy, and all that fun
stuff. If you want to run the full pipeline yourself you'll have to set
the appropriate account details in `cloud.go` in the package, or you can
just try the 'local' connection type to get a reduced functionality version
which just runs locally.

There are several other comands in the package that interact with
the pipeline, such as `booktopipeline`, which uploads a book to the
pipeline, and `lspipeline`, which lists important status information about
how the pipeline is getting along.

There are also several commands which are useful outside of the pipeline
environment; `confgraph`, `pagegraph` and `pdfbook`. `confgraph` and `pagegraph`
create a graphs showing the OCR confidence of different parts of a book
or individual page, given hOCR input. `pdfbook` creates a searchable PDF
from a directory of hOCR and image files - there are several tools online 
that could do this, but our `pdfbook` has several great features they lack;
it can smartly reduce the size and quality of pages while maintaining the
correct DPI and OCR coordinates, and it uses 'strokeless' text for the invisible
text layer, which works reliably with all PDF readers.

## [preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc))

preproc is a package of image preprocessing tools which we use to prepare
page images for OCR. They are designed to be very fast, and to work well
even in the common (for us) case of weird and dirty pages which have been
scanned badly. Many of the operations take advantage of our
[integralimg](https://rescribe.xyz/integralimg) ([docs](https://pkg.go.dev/rescribe.xyz/integralimg))
package, which uses clever mathematics to make the image operations very
fast.

There are two main commands (plus a number of exported functions to use in
your own Go projects) in the preproc package, `binarize` and `wipe`, as well
as a command that combines the two processes together, `preprocess`. The
`binarize` tool binarises an image; that is, takes a colour or grey image and
makes it black and white. This sounds simple, but as our
[binarisation](/categories/binarisation) posts here have described, doing it
well takes a lot of work, and can make a massive difference to OCR quality.
The `wipe` tool detects a content area in the page (where the text is), and
removes everything outside of it. This is important to avoid various scanning
artifacts or decorative marginalia negatively affecting the final OCR result.

## [utils](https://rescribe.xyz/utils) ([docs](https://pkg.go.dev/rescribe.xyz/utils))

The utils package contains a variety of small utilities and packages that
we needed. Probably the most useful for others would be the `hocr` package
(`https://rescribe.xyz/utils/pkg/hocr`), which parses a hOCR file and
provides several handy functions such as calculating the total confidence
for a page using word or character level OCR confidences. The `hocrtotxt`
command is also a very handy simple command to output plain text from a
hOCR file.

Other useful tools include `eeboxmltohocr`, which converts the XML available
from the [Early English Books Online](https://eebo.chadwyck.com) project into
hOCR, `fonttobytes`, which outputs a Go format byte list for a font file
enabling it to be easily included in a Go binary (as used by `bookpipeline`),
`dehyphenate` which follows simple rules to dehyphenate a text file, and
`pare-gt`, which splits some files in a directory containing ground truth
files (partial page images with a corresponding transcription) out into a
separate directory, which we use extensively in preparing new OCR training
sets, in particular to extract good OCR testing data.

## Summary

We are proud of the tools we've written, but inevitably there will be lots
of shortcomings and missing features. We're very keen for others to try
them out and let us know what works well and what doesn't, so that we can
improve them for everyone. OCR is a quite well-established technology, but
the lack of free and open source tooling around it has held back its
potential. We hope that by releasing these tools, and explaining how they
work and can be used, more of the world's historical culture can be more
accessible, searchable, and used and understood and new ways.