summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNick White <git@njw.name>2020-11-17 14:28:54 +0000
committerNick White <git@njw.name>2020-11-17 14:28:54 +0000
commit852edcd1c8ff81e8701ebd97dd48c18afa741f08 (patch)
tree9be3c76802ba5b1ba4867578bbba3dee444591a6
parent70959ff850ebca6e38c947d03b86efc94f8a8dd4 (diff)
Improve desktop-tool page
-rw-r--r--content/posts/desktop-tool/index.md72
1 files changed, 43 insertions, 29 deletions
diff --git a/content/posts/desktop-tool/index.md b/content/posts/desktop-tool/index.md
index 66c6b9f..21deb69 100644
--- a/content/posts/desktop-tool/index.md
+++ b/content/posts/desktop-tool/index.md
@@ -10,56 +10,70 @@ time recently creating a new tool which is designed to run self-
contained on a desktop computer. We're calling the tool *rescribe*,
because why not? At the moment it's a command line only tool.
-## Install and build
+## Install dependencies
*rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline)
-package, and is written in Go, so there are a few things you need
-to install to get it working.
+package, and we provide pre built executables for it which can be
+downloaded for each platform here:
-1. Firstly, you need to [download and install the Go tools](https://golang.org/dl/),
-which will be used to build and install *rescribe*.
-2. Next, you need to install the Tesseract OCR engine, which the
+* [Linux](https://rescribe.xyz/rescribe/0.3.0/rescribe)
+* [OS X](https://rescribe.xyz/rescribe/0.3.0/osx/rescribe)
+* [Windows](https://rescribe.xyz/rescribe/0.3.0/rescribe.exe)
+
+Note that if you're on Linux or OS X you will probably need to run
+`chmod +x rescribe` after downloading, to make it executable.
+
+Next, you need to install the Tesseract OCR engine, which the
tool uses for the core OCR step. If you're on Linux this should be
available from your package manager,
[follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or
[download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki).
-3. Then you'll need to install [git](https://git-scm.com/downloads)
-if you don't already have it, so you can get the bookpipeline package.
-4. Download an OCR training set for the language you're interested in.
-We provide trainings for [Caroline Miniscule](https://manuscriptocr.org),
-[early printed Latin](https://latinocr.org) and
-[Ancient Greek](https://ancientgreekocr.org).
-
-Still here? Great. Now open up a terminal window. Don't worry, it
-will be worth it.
-
-1. Clone the latest version of the bookpipeline package:
-`git clone https://git.rescribe.xyz/bookpipeline`
-2. Change into the bookpipeline directory and build the rescribe tool:
-```
-cd bookpipeline
-go build ./cmd/rescribe
-```
-Now everything is ready for action, and there will be an executable
-inside the bookpipeline directory called *rescribe* (*rescribe.exe*
-on Windows).
+Finally, you will need to download an OCR training set for the
+language / script you're interested in. We provide trainings for
+[Caroline Miniscule](https://manuscriptocr.org),
+[early printed Latin](https://latinocr.org) and
+[Ancient Greek](https://ancientgreekocr.org). Any other Tesseract
+OCR training set will also work fine.
## Usage
+Still here? Great. Now open up a terminal window. Don't worry, it
+will be worth it. If you're on Windows, you can type cmd.exe into
+the run box, on OSX it's under Applications -> Utilities -> Terminal,
+and if you're on Linux I bet you already know where to find your
+terminal.
+
You use *rescribe* by giving it the path of a training file to use
and the directory containing the book or manuscript pages you want
to OCR. Basic usage looks like this:
```
-rescribe -t ../trainings/carolinems.traineddata mybook
+./rescribe -t ../trainings/carolinems.traineddata mybook
```
This will run rescribe with a training at
*../trainings/carolinems.traineddata* over all pages in the
-directory *mybook*.
+directory *mybook*. A successful run will add several new files to
+*mybook*:
+
+* A PDF file named after the directory (`mybook.pdf` in the above
+ example), which is fully searchable.
+* A `text` directory, containing plain text versions of the OCR
+ results for each page.
+* A `hocr` directory, containing hOCR formatted OCR results for each
+ page.
+* A `graph.png` file, which shows the OCR confidence of each page (a
+ rough indicator of the quality of the OCR over the book).
+* A `conf` file, which lists the OCR confidence of each page, at each
+ preprocessing [binarisation threshold](/posts/adaptive-binarisation)
+ attempted.
+
+## Limitations
One limitation at the moment is that *rescribe* is very sensitive
to how page images are named. It will only work on pages named
`<anything>0001.png` or `<anything>0001.jpg`, where *`0001`* is any
four digit number (and *`<anything>`* is anything!).
-
+There are likely to be bugs! [Let us know](mailto:info@rescribe.xyz)
+of any issues you have, any features you'd like, or just that you're
+enjoying using it!