summaryrefslogtreecommitdiff
path: root/content/posts
diff options
context:
space:
mode:
authorNick White <git@njw.name>2020-11-11 17:32:01 +0000
committerNick White <git@njw.name>2020-11-11 17:32:01 +0000
commit70959ff850ebca6e38c947d03b86efc94f8a8dd4 (patch)
tree350f0f41e7d539741b83fcf609a80e21ed1c790e /content/posts
parentcef132388d253dfd75a51b5f94d016ea4cc29b81 (diff)
Add desktop-tool draft
Diffstat (limited to 'content/posts')
-rw-r--r--content/posts/desktop-tool/index.md65
1 files changed, 65 insertions, 0 deletions
diff --git a/content/posts/desktop-tool/index.md b/content/posts/desktop-tool/index.md
new file mode 100644
index 0000000..66c6b9f
--- /dev/null
+++ b/content/posts/desktop-tool/index.md
@@ -0,0 +1,65 @@
+---
+title: "Desktop Tool"
+date: 2020-11-11
+categories: [software, code, tools]
+---
+While [our pipeline](/posts/tool-overview) works well for OCR of
+a corpus efficiently using cloud servers, it was hard to get the
+features of the pipeline on your own computer. So we spent a bit of
+time recently creating a new tool which is designed to run self-
+contained on a desktop computer. We're calling the tool *rescribe*,
+because why not? At the moment it's a command line only tool.
+
+## Install and build
+
+*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline)
+package, and is written in Go, so there are a few things you need
+to install to get it working.
+
+1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/),
+which will be used to build and install *rescribe*.
+2. Next, you need to install the Tesseract OCR engine, which the
+tool uses for the core OCR step. If you're on Linux this should be
+available from your package manager,
+[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or
+[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki).
+3. Then you'll need to install [git](https://git-scm.com/downloads)
+if you don't already have it, so you can get the bookpipeline package.
+4. Download an OCR training set for the language you're interested in.
+We provide trainings for [Caroline Miniscule](https://manuscriptocr.org),
+[early printed Latin](https://latinocr.org) and
+[Ancient Greek](https://ancientgreekocr.org).
+
+Still here? Great. Now open up a terminal window. Don't worry, it
+will be worth it.
+
+1. Clone the latest version of the bookpipeline package:
+`git clone https://git.rescribe.xyz/bookpipeline`
+2. Change into the bookpipeline directory and build the rescribe tool:
+```
+cd bookpipeline
+go build ./cmd/rescribe
+```
+
+Now everything is ready for action, and there will be an executable
+inside the bookpipeline directory called *rescribe* (*rescribe.exe*
+on Windows).
+
+## Usage
+
+You use *rescribe* by giving it the path of a training file to use
+and the directory containing the book or manuscript pages you want
+to OCR. Basic usage looks like this:
+```
+rescribe -t ../trainings/carolinems.traineddata mybook
+```
+This will run rescribe with a training at
+*../trainings/carolinems.traineddata* over all pages in the
+directory *mybook*.
+
+One limitation at the moment is that *rescribe* is very sensitive
+to how page images are named. It will only work on pages named
+`<anything>0001.png` or `<anything>0001.jpg`, where *`0001`* is any
+four digit number (and *`<anything>`* is anything!).
+
+