summaryrefslogtreecommitdiff
path: root/doc.go
blob: b7d9d99e130d1de41fcdb88fb0a0f42ab2b1829a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
// Copyright 2020 Nick White.
// Use of this source code is governed by the GPLv3
// license that can be found in the LICENSE file.

/*
Package bookpipeline contains various tools and functions for the OCR of
books, with a focus on distributed OCR using short-lived virtual servers.

Introduction

The book pipeline is a way to split the different processes that for book OCR
into small jobs, which can be processed when a computer is ready for them. It
is currently implemented with Amazon's AWS cloud systems, and can scale from
zero to many computers, with jobs being processed faster when more servers are
available.

Central to the bookpipeline in terms of software is the bookpipeline command,
which is part of the rescribe.xyz/bookpipeline package. Presuming you have the
go tools installed, you can install it, and useful tools to control the
system, with this command:
  go get -u rescribe.xyz/bookpipeline/...

All of the tools provided in the bookpipeline package will give information on
what they do and how they work with the '-h' flag, so for example to get usage
information on the booktopipeline tool simply run the following:
  booktopipeline -h

You'll also need to set up your ~/.aws/credentials appropriately so that the
tools work.

Managing servers

Most of the time the bookpipeline is expected to be run from potentially
short-lived servers on Amazon's EC2 system. EC2 provides servers which have
no guaranteed of stability (though in practice they seem to be), called "Spot
Instances", which we use for bookpipeline. bookpipeline can handle a process
or server being suddenly destroyed without warning (more on this later), so
Spot Instances are perfect for us. We have set up a machine image with
bookpipeline preinstalled which will launch at bootup, which is all that's
needed to launch an bookpipeline instance. Presuming the bookpipeline package
has been installed on your computer (see above), the spot instance can be
started with the command:
  spotme

You can keep an eye on the servers (spot or otherwise) that are running, and
the jobs left to do and in progress, with the "lspipeline" tool (which is
also part of the bookpipeline package). It's recommended to use this with
the ssh private key for the servers, so that it can also report on what each
server is currently doing, but it can run successfully without it. It takes
a little while to run, so be patient. It can be run with the command:
  lspipeline -i key.pem

Spot instances can be terminated with ssh, using their ip address which can
be found with lspipeline, like so:
  ssh -i key.pem admin@<ip-address> sudo poweroff

The bookpipeline program is run as a service managed by systemd on the
servers. The system is fully resiliant in the face of unexpected failures.
See the section "How the pipeline works" for details on this. bookpipeline
can be managed like any other systemd service. A few examples:
  # show all logs for bookpipeline:
  ssh -i key.pem admin@<ip-address> journalctl -n all -u bookpipeline
  # restart bookpipeline
  ssh -i key.pem admin@<ip-address> systemctl restart bookpipeline

Using the pipeline

Books can be added to the pipeline using the "booktopipeline" tool. This
takes a directory of page images as input, and uploads them all to S3, adding
a job to the pipeline queue to start processing them. So it can be used like
this:
  booktopipeline -v ExcellentBook/

Getting a finished book

Once a book has been finished, it can be downloaded using the
"getpipelinebook" tool. This has several options to download specific parts
of a book, but the default case will download the best hOCR for each page,
PDFs, and the best, conf and graph.png files. Use it like this:
  getpipelinebook ExcellentBook

To get the plain text from the book, use the hocrtotxt tool, which is part
of the rescribe.xyz/utils package. You can get the package, and run the tool,
like this:
  go get -u rescribe.xyz/utils/...
  hocrtotext ExcellentBook/0010_bin0.2.hocr > ExcellentBook/0010_bin0.2.txt

How the pipeline works

The central part of the book pipeline is several SQS queues, which contain
jobs which need to be done by a server running bookpipeline. The exact content
of the SQS messages vary according to each queue, as some jobs need more
information than others. Each queue is checked at least once every couple of
minutes on any server that isn't currently processing a job.

When a job is taken from the queue by a process, it is hidden from the queue
for 2 minutes so that no other process can take it. Once per minute when
processing a job the process sends a message updating the queue, to tell it
to keep the job hidden for two minutes. This is called the "heartbeat", as if
the process fails for any reason the heartbeat will stop, and in 2 minutes
the job will reappear on the queue for another process to have a go at. Once
a job is completed successfully it is deleted from the queue.

Queues

rescribepreprocess

Each message in the rescribepreprocess queue is a bookname, optionally
followed by a space and the name of the training to use. Each page of the
bookname will be binarised with several different parameters, and then
wiped, with each version uploaded to S3, with the path of the preprocessed
page, plus the training name if it was provided, will be added to the
rescribeocrpage queue. The pages are binarised with different parameters as
it can be difficult to determine which binarisation level will be best prior
to OCR, so several different options are used, and in the rescribeanalyse
step the best one is chosen, based on the confidence of the OCR output.

  example message: APolishGentleman_MemoirByAdamKruczkiewicz
  example message: APolishGentleman_MemoirByAdamKruczkiewicz rescribelatv7

rescribewipeonly

This queue works the same as rescribepreprocess, except that it doesn't
binarise the pages, only runs the wiper. Hence it is designed for books
which have been prebinarised.

  example message: APolishGentleman_MemoirByAdamKruczkiewicz
  example message: APolishGentleman_MemoirByAdamKruczkiewicz rescribefrav2

rescribeocr

This queue is no longer used, as it could result in processes that took more
than 12 hours to complete, which was unreliable with SQS. Instead pages are
submitted individually to the rescribeocrpage by the preprocess and wipe
functions, which has the added advantage that different pages can be processed
in parallel on different servers, enabling books to be processed significantly
faster. The code for processing books from the rescribeocr queue is still
present in bookpipeline, and the queue is still checked, but it is not
expected to be used.

  example message: APolishGentleman_MemoirByAdamKruczkiewicz
  example message: APolishGentleman_MemoirByAdamKruczkiewicz rescribefrav2

rescribeocrpage

This queue contains the path of individual pages, optionally followed by
a space and the name of the training to use. Each page is OCRed, and the
results are uploaded to S3. After each page is OCRed, a check is made to
see whether all pages that look like they were preprocessed have
corresponding .hocr files. If so, the bookname is added to the
rescribeanalyse queue.

  example message: APolishGentleman_MemoirByAdamKruczkiewicz/00162_bin0.0.png
  example message: APolishGentleman_MemoirByAdamKruczkiewicz/00162_bin0.0.png rescribelatv7

rescribeanalyse

A message on the rescribeanalyse queue contains only a book name. The
confidences for each page are calculated and saved in the 'conf' file, and
the best version of each page is decided upon and saved in the 'best' file.
PDFs are then generated, and the confidence graph is generated.

  example message: APolishGentleman_MemoirByAdamKruczkiewicz

Queue manipulation

The queues should generally only be messed with by the bookpipeline and
booktopipeline tools, but if you're feeling ambitious you can take a look at
a couple of tools:
  - addtoqueue
  - unstickocr

Remember that messages in a queue are hidden for a few minutes when they are
read, so for example you couldn't straightforwardly delete a message which was
currently being processed by a server, as you wouldn't be able to see it.

Page naming

At present the bookpipeline has some silly limitations of file names for book
pages to be recognised. This is something which will be fixed in due course.
  Pages that are to be fully processed: *[0-9]{4}.jpg$
  Pages that are to be wiped only: *[0-9]{6}(.bin)?.png$
*/
package bookpipeline