From 70959ff850ebca6e38c947d03b86efc94f8a8dd4 Mon Sep 17 00:00:00 2001 From: Nick White Date: Wed, 11 Nov 2020 17:32:01 +0000 Subject: Add desktop-tool draft --- content/posts/desktop-tool/index.md | 65 +++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 content/posts/desktop-tool/index.md (limited to 'content/posts') diff --git a/content/posts/desktop-tool/index.md b/content/posts/desktop-tool/index.md new file mode 100644 index 0000000..66c6b9f --- /dev/null +++ b/content/posts/desktop-tool/index.md @@ -0,0 +1,65 @@ +--- +title: "Desktop Tool" +date: 2020-11-11 +categories: [software, code, tools] +--- +While [our pipeline](/posts/tool-overview) works well for OCR of +a corpus efficiently using cloud servers, it was hard to get the +features of the pipeline on your own computer. So we spent a bit of +time recently creating a new tool which is designed to run self- +contained on a desktop computer. We're calling the tool *rescribe*, +because why not? At the moment it's a command line only tool. + +## Install and build + +*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline) +package, and is written in Go, so there are a few things you need +to install to get it working. + +1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/), +which will be used to build and install *rescribe*. +2. Next, you need to install the Tesseract OCR engine, which the +tool uses for the core OCR step. If you're on Linux this should be +available from your package manager, +[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or +[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki). +3. Then you'll need to install [git](https://git-scm.com/downloads) +if you don't already have it, so you can get the bookpipeline package. +4. Download an OCR training set for the language you're interested in. +We provide trainings for [Caroline Miniscule](https://manuscriptocr.org), +[early printed Latin](https://latinocr.org) and +[Ancient Greek](https://ancientgreekocr.org). + +Still here? Great. Now open up a terminal window. Don't worry, it +will be worth it. + +1. Clone the latest version of the bookpipeline package: +`git clone https://git.rescribe.xyz/bookpipeline` +2. Change into the bookpipeline directory and build the rescribe tool: +``` +cd bookpipeline +go build ./cmd/rescribe +``` + +Now everything is ready for action, and there will be an executable +inside the bookpipeline directory called *rescribe* (*rescribe.exe* +on Windows). + +## Usage + +You use *rescribe* by giving it the path of a training file to use +and the directory containing the book or manuscript pages you want +to OCR. Basic usage looks like this: +``` +rescribe -t ../trainings/carolinems.traineddata mybook +``` +This will run rescribe with a training at +*../trainings/carolinems.traineddata* over all pages in the +directory *mybook*. + +One limitation at the moment is that *rescribe* is very sensitive +to how page images are named. It will only work on pages named +`0001.png` or `0001.jpg`, where *`0001`* is any +four digit number (and *``* is anything!). + + -- cgit v1.2.1-24-ge1ad