--- title: "Desktop Tool" date: 2020-11-11 categories: [software, code, tools] --- While [our pipeline](/posts/tool-overview) works well for OCR of a corpus efficiently using cloud servers, it was hard to get the features of the pipeline on your own computer. So we spent a bit of time recently creating a new tool which is designed to run self- contained on a desktop computer. We're calling the tool *rescribe*, because why not? At the moment it's a command line only tool. ## Install dependencies *rescribe* is a part of our [bookpipeline](https://rescribe.xyz/bookpipeline) package, and we provide pre built executables for it which can be downloaded for each platform here - save them to wherever you want to run the program from: * [Linux](https://rescribe.xyz/rescribe/0.3.0/rescribe) * [OS X](https://rescribe.xyz/rescribe/0.3.0/osx/rescribe) * [Windows](https://rescribe.xyz/rescribe/0.3.0/rescribe.exe) Next, you need to install the Tesseract OCR engine, which the tool uses for the core OCR step. If you're on Linux this should be available from your package manager, [follow these instructions if you're on a Mac](https://tesseract-ocr.github.io/tessdoc/Home.html#macos), or [download and run an installer from this page for Windows](https://github.com/UB-Mannheim/tesseract/wiki). Finally, you will need to download an OCR training set for the language / script you're interested in. We provide trainings for [Caroline Miniscule](https://manuscriptocr.org), [early printed Latin](https://latinocr.org) and [Ancient Greek](https://ancientgreekocr.org). Any other Tesseract OCR training set will also work fine. ## Usage Still here? Great. Now open up a terminal window. Don't worry, it will be worth it. If you're on Windows, you can type cmd.exe into the run box, on OSX it's under Applications -> Utilities -> Terminal, and if you're on Linux I bet you already know where to find your terminal. Firstly, if you're on Linux or OSX you will probably need to make the program executable after downloading it, so do that now by running `chmod +x rescribe`. You'll only have to do that once. You use *rescribe* by giving it the path of a training file to use and the directory containing the book or manuscript pages you want to OCR. Basic usage looks like this: ``` ./rescribe -t ../trainings/carolinems.traineddata mybook ``` This will run rescribe with a training at *../trainings/carolinems.traineddata* over all pages in the directory *mybook*. A successful run will add several new files to *mybook*: * A PDF file named after the directory (`mybook.pdf` in the above example), which is fully searchable. * A `text` directory, containing plain text versions of the OCR results for each page. * A `hocr` directory, containing hOCR formatted OCR results for each page. * A `graph.png` file, which shows the OCR confidence of each page (a rough indicator of the quality of the OCR over the book). * A `conf` file, which lists the OCR confidence of each page, at each preprocessing [binarisation threshold](/posts/adaptive-binarisation) attempted. ## Limitations One limitation at the moment is that *rescribe* is very sensitive to how page images are named. It will only work on pages named `0001.png` or `0001.jpg`, where *`0001`* is any four digit number (and *``* is anything!). There are likely to be bugs! [Let us know](mailto:info@rescribe.xyz) of any issues you have, any features you'd like, or just that you're enjoying using it!