From 852edcd1c8ff81e8701ebd97dd48c18afa741f08 Mon Sep 17 00:00:00 2001 From: Nick White Date: Tue, 17 Nov 2020 14:28:54 +0000 Subject: Improve desktop-tool page --- content/posts/desktop-tool/index.md | 72 ++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 29 deletions(-) diff --git a/content/posts/desktop-tool/index.md b/content/posts/desktop-tool/index.md index 66c6b9f..21deb69 100644 --- a/content/posts/desktop-tool/index.md +++ b/content/posts/desktop-tool/index.md @@ -10,56 +10,70 @@ time recently creating a new tool which is designed to run self- contained on a desktop computer. We're calling the tool *rescribe*, because why not? At the moment it's a command line only tool. -## Install and build +## Install dependencies *rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline) -package, and is written in Go, so there are a few things you need -to install to get it working. +package, and we provide pre built executables for it which can be +downloaded for each platform here: -1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/), -which will be used to build and install *rescribe*. -2. Next, you need to install the Tesseract OCR engine, which the +* [Linux](https://rescribe.xyz/rescribe/0.3.0/rescribe) +* [OS X](https://rescribe.xyz/rescribe/0.3.0/osx/rescribe) +* [Windows](https://rescribe.xyz/rescribe/0.3.0/rescribe.exe) + +Note that if you're on Linux or OS X you will probably need to run +`chmod +x rescribe` after downloading, to make it executable. + +Next, you need to install the Tesseract OCR engine, which the tool uses for the core OCR step. If you're on Linux this should be available from your package manager, [follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or [download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki). -3. Then you'll need to install [git](https://git-scm.com/downloads) -if you don't already have it, so you can get the bookpipeline package. -4. Download an OCR training set for the language you're interested in. -We provide trainings for [Caroline Miniscule](https://manuscriptocr.org), -[early printed Latin](https://latinocr.org) and -[Ancient Greek](https://ancientgreekocr.org). - -Still here? Great. Now open up a terminal window. Don't worry, it -will be worth it. - -1. Clone the latest version of the bookpipeline package: -`git clone https://git.rescribe.xyz/bookpipeline` -2. Change into the bookpipeline directory and build the rescribe tool: -``` -cd bookpipeline -go build ./cmd/rescribe -``` -Now everything is ready for action, and there will be an executable -inside the bookpipeline directory called *rescribe* (*rescribe.exe* -on Windows). +Finally, you will need to download an OCR training set for the +language / script you're interested in. We provide trainings for +[Caroline Miniscule](https://manuscriptocr.org), +[early printed Latin](https://latinocr.org) and +[Ancient Greek](https://ancientgreekocr.org). Any other Tesseract +OCR training set will also work fine. ## Usage +Still here? Great. Now open up a terminal window. Don't worry, it +will be worth it. If you're on Windows, you can type cmd.exe into +the run box, on OSX it's under Applications -> Utilities -> Terminal, +and if you're on Linux I bet you already know where to find your +terminal. + You use *rescribe* by giving it the path of a training file to use and the directory containing the book or manuscript pages you want to OCR. Basic usage looks like this: ``` -rescribe -t ../trainings/carolinems.traineddata mybook +./rescribe -t ../trainings/carolinems.traineddata mybook ``` This will run rescribe with a training at *../trainings/carolinems.traineddata* over all pages in the -directory *mybook*. +directory *mybook*. A successful run will add several new files to +*mybook*: + +* A PDF file named after the directory (`mybook.pdf` in the above + example), which is fully searchable. +* A `text` directory, containing plain text versions of the OCR + results for each page. +* A `hocr` directory, containing hOCR formatted OCR results for each + page. +* A `graph.png` file, which shows the OCR confidence of each page (a + rough indicator of the quality of the OCR over the book). +* A `conf` file, which lists the OCR confidence of each page, at each + preprocessing [binarisation threshold](/posts/adaptive-binarisation) + attempted. + +## Limitations One limitation at the moment is that *rescribe* is very sensitive to how page images are named. It will only work on pages named `0001.png` or `0001.jpg`, where *`0001`* is any four digit number (and *``* is anything!). - +There are likely to be bugs! [Let us know](mailto:info@rescribe.xyz) +of any issues you have, any features you'd like, or just that you're +enjoying using it! -- cgit v1.2.1-24-ge1ad