summaryrefslogtreecommitdiff
path: root/content/posts/desktop-tool/index.md
blob: 21deb690762327c541e74cc93a20268faee5400a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: "Desktop Tool"
date: 2020-11-11
categories: [software, code, tools]
---
While [our pipeline](/posts/tool-overview) works well for OCR of
a corpus efficiently using cloud servers, it was hard to get the
features of the pipeline on your own computer. So we spent a bit of
time recently creating a new tool which is designed to run self-
contained on a desktop computer. We're calling the tool *rescribe*,
because why not? At the moment it's a command line only tool.

## Install dependencies

*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline)
package, and we provide pre built executables for it which can be
downloaded for each platform here:

* [Linux](https://rescribe.xyz/rescribe/0.3.0/rescribe)
* [OS X](https://rescribe.xyz/rescribe/0.3.0/osx/rescribe)
* [Windows](https://rescribe.xyz/rescribe/0.3.0/rescribe.exe)

Note that if you're on Linux or OS X you will probably need to run
`chmod +x rescribe` after downloading, to make it executable.

Next, you need to install the Tesseract OCR engine, which the
tool uses for the core OCR step. If you're on Linux this should be
available from your package manager,
[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or 
[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki).

Finally, you will need to download an OCR training set for the
language / script you're interested in. We provide trainings for
[Caroline Miniscule](https://manuscriptocr.org),
[early printed Latin](https://latinocr.org) and
[Ancient Greek](https://ancientgreekocr.org). Any other Tesseract
OCR training set will also work fine.

## Usage

Still here? Great. Now open up a terminal window. Don't worry, it
will be worth it. If you're on Windows, you can type cmd.exe into
the run box, on OSX it's under Applications -> Utilities -> Terminal,
and if you're on Linux I bet you already know where to find your
terminal.

You use *rescribe* by giving it the path of a training file to use
and the directory containing the book or manuscript pages you want
to OCR. Basic usage looks like this:
```
./rescribe -t ../trainings/carolinems.traineddata mybook
```
This will run rescribe with a training at
*../trainings/carolinems.traineddata* over all pages in the
directory *mybook*. A successful run will add several new files to
*mybook*:

* A PDF file named after the directory (`mybook.pdf` in the above
  example), which is fully searchable.
* A `text` directory, containing plain text versions of the OCR
  results for each page.
* A `hocr` directory, containing hOCR formatted OCR results for each
  page.
* A `graph.png` file, which shows the OCR confidence of each page (a
  rough indicator of the quality of the OCR over the book).
* A `conf` file, which lists the OCR confidence of each page, at each
  preprocessing [binarisation threshold](/posts/adaptive-binarisation)
  attempted.

## Limitations

One limitation at the moment is that *rescribe* is very sensitive
to how page images are named. It will only work on pages named
`<anything>0001.png` or `<anything>0001.jpg`, where *`0001`* is any
four digit number (and *`<anything>`* is anything!).

There are likely to be bugs! [Let us know](mailto:info@rescribe.xyz)
of any issues you have, any features you'd like, or just that you're
enjoying using it!