summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-12-15[rmbook] Add -dryrun flagNick White
2020-12-14Add rmbook toolNick White
2020-12-14Update preproc module used to incorporate an important crash fixNick White
2020-12-07[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2Nick White
already has a hocr directory in it will work
2020-12-07[rescribe] Allow saving of results to somewhere other than a directory named ↵Nick White
after the book being processed
2020-12-04Ensure mkdir will succeed in uploadNick White
2020-12-03[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵Nick White
txt-ified on windows
2020-12-03Don't upload binarised pdf twice needlesslyNick White
This can also result in the file being uploaded twice simultaneously, as up() is running in a separate goroutine. This can cause failures on Windows as the file is attempted to be removed by one upload process while being open to upload by the other process. Probably it could also fail if the process completed by one (so the file was deleted) before being started by the other.
2020-11-30Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipelineNick White
2020-11-30Add getstats toolNick White
2020-11-24[booktopipeline] Add a check to disallow adding a book that already existsNick White
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.
2020-11-18Switch to a maintained version of gofpdfNick White
2020-11-18Describe rescribe tool in documentationv0.3.1Nick White
2020-11-17Add trimqueue and logwholequeue utilities which can help deal with weird ↵Nick White
queue states
2020-11-17Remove _bin0.x from txt filenamesv0.3.0Nick White
2020-11-16Some changes to ensure the pipeline works correctly on WindowsNick White
There were a couple of places where a file was uploaded while still open, which resulted in an attempt to remove it, which causes an error from Windows. The allOCRed function also included an assumption that the path separator would be a /, which is always correct for AWS, and correct for local on Linux and OSX, but not for local Windows. Fixed by leaving the separator well alone. Also, the local connection was not stripping leading \, like it did /, which caused an issue with Windows local. Windows local is now tested and working, at least through wine.
2020-11-16[rescribe] Default to an appropriate tesscmd for WindowsNick White
2020-11-16[rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵Nick White
they're more user-friendly
2020-11-16[rescribe] Mention in usage that things can be saved in a different directoryNick White
2020-11-16Add makefile for generating cross compiled rescribe binariesNick White
2020-11-10gofmtNick White
2020-11-10[rescribe] Enable custom paths to tesseract command to be set (also improve ↵Nick White
some error output)
2020-11-10[rescribe] Change -t to the path of the traineddata file, and set ↵Nick White
TESSDATA_PREFIX accordingly
2020-11-10[rescribe] Handle errors in processbook correctly, and improve console outputNick White
2020-11-10[getpipelinebook] Rewrite to use internal package functionsNick White
2020-11-10Switch booktopipeline to use internal pipeline functionsNick White
2020-11-09Add a couple of things that should not be forgottenNick White
2020-11-09Switch Preprocess() to take the thresholds to use, and have rescribe tool ↵separatelocalNick White
only use 0.1,0.2,0.3
2020-11-09[rescribe] Local only combo tool basically now working. Testing is still ↵Nick White
minimal.
2020-11-09[rescribe] work in progress at a self-contained local pipeline processor, ↵Nick White
called rescribe
2020-11-09[bookpipeline] Split most functionality out to package internal/pipelineNick White
No functionality changes, but this should make it easier to make custom builds using the pipeline in slightly different ways.
2020-11-09Add -autostop, so time to shutdown can be specified, and so the process can ↵Nick White
just be stopped after a period, rather than the whole computer shut down
2020-11-09[bookpipeline] Improve interface, particularly for local use, by disabling ↵Nick White
(failing) log saving, mail sending, and removing erroneous references to AWS
2020-11-09Set hocr config options directly rather than relying on 'hocr' config fileNick White
This ensures that bookpipeline will still work even if TESSDATA_PREFIX has been set to a directory without configs in it.
2020-11-06Fix the README to be valid markdown in the local exampleNick White
2020-11-06Document the local modeNick White
2020-11-06Add git clone advice to readmeNick White
2020-10-21Fix a bug that caused analyse step to not be triggered with local connectionNick White
2020-10-20Improve logging by using Println, which ensures there is a space between ↵Nick White
arguments, even if all are strings
2020-10-20Fix local queue deletion properlyNick White
2020-10-20Hopefully fix off-by-one error causing errors with local bookpipelineNick White
2020-10-20Add postprocess-bythresh cmdNick White
2020-10-20Update spot image to useNick White
2020-09-22[booktopipeline] Check that all images are valid before adding to pipelineNick White
2020-09-15Abort and delete a failed wipeonly job, like we do with preprocessingNick White
There was no reason not to do this with wipeonly as well, and sure enough a single broken PNG image in a wipeonly task would cause the queue to exponentially fill as happened previously.
2020-09-07Update spot instance ami once againNick White
2020-09-01Update spot instance ami to useNick White
2020-09-01Fix confusing usage message for booktopipelineNick White
2020-08-24update getsamplepages to just get jpg pagesNick White
2020-08-19Add getsamplepagesNick White