Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-06-21 | rescribe: Set up so only Tesseract needed for the build platform is embedded | Nick White | |
2021-06-21 | rescribe: Embed Tesseract into binary so that no Tesseract install is necessary | Nick White | |
2021-05-31 | Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵ | Nick White | |
available to Pipeliner | |||
2021-03-16 | rescribe: change default training directory to trainings/v0.3.3 | Nick White | |
2021-02-22 | lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵ | Nick White | |
lspipeline as there are some hard to debug issues in concurrency version | |||
2021-02-15 | getsamplepages: Add -prefix option, and use 'best' to get random page numbers | Nick White | |
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that. | |||
2021-01-26 | Make ListObjectsWithMeta generic again and create a specialised ↵ | Nick White | |
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api | |||
2021-01-26 | Improve lspipeline concurrency by removing WaitGroup stuff | Nick White | |
2021-01-26 | Speed up lspipeline by making s3 requests concurrently and only processing ↵ | Nick White | |
single results from ListObjects requests | |||
2020-12-15 | [rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵ | Nick White | |
books starting with "1" | |||
2020-12-15 | [rmbook] Add -dryrun flag | Nick White | |
2020-12-14 | Add rmbook tool | Nick White | |
2020-12-07 | [rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2 | Nick White | |
already has a hocr directory in it will work | |||
2020-12-07 | [rescribe] Allow saving of results to somewhere other than a directory named ↵ | Nick White | |
after the book being processed | |||
2020-12-03 | [rescribe] Fix portability issue where hocrs may not be correctly moved and ↵ | Nick White | |
txt-ified on windows | |||
2020-11-30 | Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline | Nick White | |
2020-11-30 | Add getstats tool | Nick White | |
2020-11-24 | [booktopipeline] Add a check to disallow adding a book that already exists | Nick White | |
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline. | |||
2020-11-17 | Add trimqueue and logwholequeue utilities which can help deal with weird ↵ | Nick White | |
queue states | |||
2020-11-17 | Remove _bin0.x from txt filenamesv0.3.0 | Nick White | |
2020-11-16 | [rescribe] Default to an appropriate tesscmd for Windows | Nick White | |
2020-11-16 | [rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵ | Nick White | |
they're more user-friendly | |||
2020-11-16 | [rescribe] Mention in usage that things can be saved in a different directory | Nick White | |
2020-11-10 | gofmt | Nick White | |
2020-11-10 | [rescribe] Enable custom paths to tesseract command to be set (also improve ↵ | Nick White | |
some error output) | |||
2020-11-10 | [rescribe] Change -t to the path of the traineddata file, and set ↵ | Nick White | |
TESSDATA_PREFIX accordingly | |||
2020-11-10 | [rescribe] Handle errors in processbook correctly, and improve console output | Nick White | |
2020-11-10 | [getpipelinebook] Rewrite to use internal package functions | Nick White | |
2020-11-10 | Switch booktopipeline to use internal pipeline functions | Nick White | |
2020-11-09 | Add a couple of things that should not be forgotten | Nick White | |
2020-11-09 | Switch Preprocess() to take the thresholds to use, and have rescribe tool ↵separatelocal | Nick White | |
only use 0.1,0.2,0.3 | |||
2020-11-09 | [rescribe] Local only combo tool basically now working. Testing is still ↵ | Nick White | |
minimal. | |||
2020-11-09 | [rescribe] work in progress at a self-contained local pipeline processor, ↵ | Nick White | |
called rescribe | |||
2020-11-09 | [bookpipeline] Split most functionality out to package internal/pipeline | Nick White | |
No functionality changes, but this should make it easier to make custom builds using the pipeline in slightly different ways. | |||
2020-11-09 | Add -autostop, so time to shutdown can be specified, and so the process can ↵ | Nick White | |
just be stopped after a period, rather than the whole computer shut down | |||
2020-11-09 | [bookpipeline] Improve interface, particularly for local use, by disabling ↵ | Nick White | |
(failing) log saving, mail sending, and removing erroneous references to AWS | |||
2020-11-09 | Set hocr config options directly rather than relying on 'hocr' config file | Nick White | |
This ensures that bookpipeline will still work even if TESSDATA_PREFIX has been set to a directory without configs in it. | |||
2020-10-20 | Improve logging by using Println, which ensures there is a space between ↵ | Nick White | |
arguments, even if all are strings | |||
2020-10-20 | Add postprocess-bythresh cmd | Nick White | |
2020-09-22 | [booktopipeline] Check that all images are valid before adding to pipeline | Nick White | |
2020-09-15 | Abort and delete a failed wipeonly job, like we do with preprocessing | Nick White | |
There was no reason not to do this with wipeonly as well, and sure enough a single broken PNG image in a wipeonly task would cause the queue to exponentially fill as happened previously. | |||
2020-09-01 | Fix confusing usage message for booktopipeline | Nick White | |
2020-08-24 | update getsamplepages to just get jpg pages | Nick White | |
2020-08-19 | Add getsamplepages | Nick White | |
2020-08-18 | Update preproc to v0.4.0 to enable vertical wiping | Nick White | |
2020-07-28 | Allow override of autodetected queues for booktopipeline | Nick White | |
2020-07-28 | Autodetect queue for booktopipeline based on file extension | Antonia Karaisl | |
2020-07-27 | Use os.Getenv() to find config dir, rather than rely on os.UserConfigDir(), ↵ | Nick White | |
as that isnt present on go1.11 | |||
2020-07-27 | Switch mail settings to an externally set file | Nick White | |
2020-07-21 | [bookpipeline] If preprocessing fails, email us and remove the job from the ↵ | Nick White | |
queue This prevents the current situation where a failed preprocessing job is endlessly repeated, potentially spawning thousands of ocrpage jobs in its wake each time. Note that the email stuff works but requires putting secrets into .go files, so need to rewrite that to read from somewhere more sensible like a dotfile on the host. |