Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-06-22 | rescribe: Make it clearer that embedded training files are available to use | Nick White | |
2021-06-22 | rescribe: add embedded tesseract for linux | Nick White | |
2021-06-22 | rescribe: allow use of embedded training even if -systess is used | Nick White | |
2021-06-22 | rescribe: Add go generate command to download the needed files to embed | Nick White | |
2021-06-22 | rescribe: Add an embedded tessdata | Nick White | |
2021-06-21 | rescribe: Set up so only Tesseract needed for the build platform is embedded | Nick White | |
2021-06-21 | rescribe: Embed Tesseract into binary so that no Tesseract install is necessary | Nick White | |
2021-05-31 | Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵ | Nick White | |
available to Pipeliner | |||
2021-03-16 | rescribe: change default training directory to trainings/v0.3.3 | Nick White | |
2021-02-22 | lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵ | Nick White | |
lspipeline as there are some hard to debug issues in concurrency version | |||
2021-02-15 | getsamplepages: Add -prefix option, and use 'best' to get random page numbers | Nick White | |
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that. | |||
2021-01-26 | Make ListObjectsWithMeta generic again and create a specialised ↵ | Nick White | |
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api | |||
2021-01-26 | Improve lspipeline concurrency by removing WaitGroup stuff | Nick White | |
2021-01-26 | Speed up lspipeline by making s3 requests concurrently and only processing ↵ | Nick White | |
single results from ListObjects requests | |||
2020-12-15 | [rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵ | Nick White | |
books starting with "1" | |||
2020-12-15 | [rmbook] Add -dryrun flag | Nick White | |
2020-12-14 | Add rmbook tool | Nick White | |
2020-12-07 | [rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2 | Nick White | |
already has a hocr directory in it will work | |||
2020-12-07 | [rescribe] Allow saving of results to somewhere other than a directory named ↵ | Nick White | |
after the book being processed | |||
2020-12-03 | [rescribe] Fix portability issue where hocrs may not be correctly moved and ↵ | Nick White | |
txt-ified on windows | |||
2020-11-30 | Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline | Nick White | |
2020-11-30 | Add getstats tool | Nick White | |
2020-11-24 | [booktopipeline] Add a check to disallow adding a book that already exists | Nick White | |
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline. | |||
2020-11-17 | Add trimqueue and logwholequeue utilities which can help deal with weird ↵ | Nick White | |
queue states | |||
2020-11-17 | Remove _bin0.x from txt filenamesv0.3.0 | Nick White | |
2020-11-16 | [rescribe] Default to an appropriate tesscmd for Windows | Nick White | |
2020-11-16 | [rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵ | Nick White | |
they're more user-friendly | |||
2020-11-16 | [rescribe] Mention in usage that things can be saved in a different directory | Nick White | |
2020-11-10 | gofmt | Nick White | |
2020-11-10 | [rescribe] Enable custom paths to tesseract command to be set (also improve ↵ | Nick White | |
some error output) | |||
2020-11-10 | [rescribe] Change -t to the path of the traineddata file, and set ↵ | Nick White | |
TESSDATA_PREFIX accordingly | |||
2020-11-10 | [rescribe] Handle errors in processbook correctly, and improve console output | Nick White | |
2020-11-10 | [getpipelinebook] Rewrite to use internal package functions | Nick White | |
2020-11-10 | Switch booktopipeline to use internal pipeline functions | Nick White | |
2020-11-09 | Add a couple of things that should not be forgotten | Nick White | |
2020-11-09 | Switch Preprocess() to take the thresholds to use, and have rescribe tool ↵separatelocal | Nick White | |
only use 0.1,0.2,0.3 | |||
2020-11-09 | [rescribe] Local only combo tool basically now working. Testing is still ↵ | Nick White | |
minimal. | |||
2020-11-09 | [rescribe] work in progress at a self-contained local pipeline processor, ↵ | Nick White | |
called rescribe | |||
2020-11-09 | [bookpipeline] Split most functionality out to package internal/pipeline | Nick White | |
No functionality changes, but this should make it easier to make custom builds using the pipeline in slightly different ways. | |||
2020-11-09 | Add -autostop, so time to shutdown can be specified, and so the process can ↵ | Nick White | |
just be stopped after a period, rather than the whole computer shut down | |||
2020-11-09 | [bookpipeline] Improve interface, particularly for local use, by disabling ↵ | Nick White | |
(failing) log saving, mail sending, and removing erroneous references to AWS | |||
2020-11-09 | Set hocr config options directly rather than relying on 'hocr' config file | Nick White | |
This ensures that bookpipeline will still work even if TESSDATA_PREFIX has been set to a directory without configs in it. | |||
2020-10-20 | Improve logging by using Println, which ensures there is a space between ↵ | Nick White | |
arguments, even if all are strings | |||
2020-10-20 | Add postprocess-bythresh cmd | Nick White | |
2020-09-22 | [booktopipeline] Check that all images are valid before adding to pipeline | Nick White | |
2020-09-15 | Abort and delete a failed wipeonly job, like we do with preprocessing | Nick White | |
There was no reason not to do this with wipeonly as well, and sure enough a single broken PNG image in a wipeonly task would cause the queue to exponentially fill as happened previously. | |||
2020-09-01 | Fix confusing usage message for booktopipeline | Nick White | |
2020-08-24 | update getsamplepages to just get jpg pages | Nick White | |
2020-08-19 | Add getsamplepages | Nick White | |
2020-08-18 | Update preproc to v0.4.0 to enable vertical wiping | Nick White | |