summaryrefslogtreecommitdiff
path: root/content/posts/desktop-tool/index.md
blob: 66c6b9f1b4c5a16987577acf9e03541010b8a390 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
title: "Desktop Tool"
date: 2020-11-11
categories: [software, code, tools]
---
While [our pipeline](/posts/tool-overview) works well for OCR of
a corpus efficiently using cloud servers, it was hard to get the
features of the pipeline on your own computer. So we spent a bit of
time recently creating a new tool which is designed to run self-
contained on a desktop computer. We're calling the tool *rescribe*,
because why not? At the moment it's a command line only tool.

## Install and build

*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline)
package, and is written in Go, so there are a few things you need
to install to get it working.

1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/),
which will be used to build and install *rescribe*.
2. Next, you need to install the Tesseract OCR engine, which the
tool uses for the core OCR step. If you're on Linux this should be
available from your package manager,
[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or 
[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki).
3. Then you'll need to install [git](https://git-scm.com/downloads)
if you don't already have it, so you can get the bookpipeline package.
4. Download an OCR training set for the language you're interested in.
We provide trainings for [Caroline Miniscule](https://manuscriptocr.org),
[early printed Latin](https://latinocr.org) and
[Ancient Greek](https://ancientgreekocr.org).

Still here? Great. Now open up a terminal window. Don't worry, it
will be worth it.

1. Clone the latest version of the bookpipeline package:
`git clone https://git.rescribe.xyz/bookpipeline`
2. Change into the bookpipeline directory and build the rescribe tool:
```
cd bookpipeline
go build ./cmd/rescribe
```

Now everything is ready for action, and there will be an executable
inside the bookpipeline directory called *rescribe* (*rescribe.exe*
on Windows).

## Usage

You use *rescribe* by giving it the path of a training file to use
and the directory containing the book or manuscript pages you want
to OCR. Basic usage looks like this:
```
rescribe -t ../trainings/carolinems.traineddata mybook
```
This will run rescribe with a training at
*../trainings/carolinems.traineddata* over all pages in the
directory *mybook*.

One limitation at the moment is that *rescribe* is very sensitive
to how page images are named. It will only work on pages named
`<anything>0001.png` or `<anything>0001.jpg`, where *`0001`* is any
four digit number (and *`<anything>`* is anything!).