diff options
author | Nick White <git@njw.name> | 2020-11-11 17:32:01 +0000 |
---|---|---|
committer | Nick White <git@njw.name> | 2020-11-11 17:32:01 +0000 |
commit | 70959ff850ebca6e38c947d03b86efc94f8a8dd4 (patch) | |
tree | 350f0f41e7d539741b83fcf609a80e21ed1c790e /content/posts | |
parent | cef132388d253dfd75a51b5f94d016ea4cc29b81 (diff) |
Add desktop-tool draft
Diffstat (limited to 'content/posts')
-rw-r--r-- | content/posts/desktop-tool/index.md | 65 |
1 files changed, 65 insertions, 0 deletions
diff --git a/content/posts/desktop-tool/index.md b/content/posts/desktop-tool/index.md new file mode 100644 index 0000000..66c6b9f --- /dev/null +++ b/content/posts/desktop-tool/index.md @@ -0,0 +1,65 @@ +--- +title: "Desktop Tool" +date: 2020-11-11 +categories: [software, code, tools] +--- +While [our pipeline](/posts/tool-overview) works well for OCR of +a corpus efficiently using cloud servers, it was hard to get the +features of the pipeline on your own computer. So we spent a bit of +time recently creating a new tool which is designed to run self- +contained on a desktop computer. We're calling the tool *rescribe*, +because why not? At the moment it's a command line only tool. + +## Install and build + +*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline) +package, and is written in Go, so there are a few things you need +to install to get it working. + +1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/), +which will be used to build and install *rescribe*. +2. Next, you need to install the Tesseract OCR engine, which the +tool uses for the core OCR step. If you're on Linux this should be +available from your package manager, +[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or +[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki). +3. Then you'll need to install [git](https://git-scm.com/downloads) +if you don't already have it, so you can get the bookpipeline package. +4. Download an OCR training set for the language you're interested in. +We provide trainings for [Caroline Miniscule](https://manuscriptocr.org), +[early printed Latin](https://latinocr.org) and +[Ancient Greek](https://ancientgreekocr.org). + +Still here? Great. Now open up a terminal window. Don't worry, it +will be worth it. + +1. Clone the latest version of the bookpipeline package: +`git clone https://git.rescribe.xyz/bookpipeline` +2. Change into the bookpipeline directory and build the rescribe tool: +``` +cd bookpipeline +go build ./cmd/rescribe +``` + +Now everything is ready for action, and there will be an executable +inside the bookpipeline directory called *rescribe* (*rescribe.exe* +on Windows). + +## Usage + +You use *rescribe* by giving it the path of a training file to use +and the directory containing the book or manuscript pages you want +to OCR. Basic usage looks like this: +``` +rescribe -t ../trainings/carolinems.traineddata mybook +``` +This will run rescribe with a training at +*../trainings/carolinems.traineddata* over all pages in the +directory *mybook*. + +One limitation at the moment is that *rescribe* is very sensitive +to how page images are named. It will only work on pages named +`<anything>0001.png` or `<anything>0001.jpg`, where *`0001`* is any +four digit number (and *`<anything>`* is anything!). + + |