summaryrefslogtreecommitdiff
path: root/cmd
AgeCommit message (Collapse)Author
2021-10-12rescribe: fix lookup of external training filev0.5.3Nick White
2021-10-01rescribe: Include new tessdata in embed getterv0.5.2Nick White
2021-10-01rescribe: Add embedded lat.traineddataNick White
2021-10-01rescribe: Add both original training path and embedded version on error ↵Nick White
output for training file not found, so that its clear that the file specified may not exist
2021-08-24rescribe: improve makefile to match the way we deploy to the websiteNick White
2021-08-19lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0Nick White
EC2's rate limiting
2021-08-18rescribe: Update documentation on how to deal with M1 signing, and move ↵Nick White
makefile to where it makes sense
2021-08-17pipeline: use regular storage for tests, rather than a separate oneNick White
2021-08-02rescribe: Add experimental m1 buildNick White
2021-08-02internal/pipeline: Add test (incomplete but working) for UploadImagesNick White
2021-07-20Cleanup thanks to go vetNick White
2021-07-13gofmtNick White
2021-07-12Add necessary pipeliner dependency for testqueue (probably remove this from ↵Nick White
internal library later as its only needed for tests
2021-07-12Add test for upAndQueue functionNick White
This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08rescribe: Exit with an error if directory doesn't existNick White
2021-06-29rescribe: add documentation on how to generate embedded dataNick White
2021-06-29rescribe: Add embed target for darwin (osx) tooNick White
2021-06-22rescribe: Remove erroneous unnecessary mkdirNick White
2021-06-22rescribe: Make it clearer that embedded training files are available to useNick White
2021-06-22rescribe: add embedded tesseract for linuxNick White
2021-06-22rescribe: allow use of embedded training even if -systess is usedNick White
2021-06-22rescribe: Add go generate command to download the needed files to embedNick White
2021-06-22rescribe: Add an embedded tessdataNick White
2021-06-21rescribe: Set up so only Tesseract needed for the build platform is embeddedNick White
2021-06-21rescribe: Embed Tesseract into binary so that no Tesseract install is necessaryNick White
2021-05-31Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵Nick White
available to Pipeliner
2021-03-16rescribe: change default training directory to trainings/v0.3.3Nick White
2021-02-22lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵Nick White
lspipeline as there are some hard to debug issues in concurrency version
2021-02-15getsamplepages: Add -prefix option, and use 'best' to get random page numbersNick White
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-01-26Make ListObjectsWithMeta generic again and create a specialised ↵Nick White
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26Improve lspipeline concurrency by removing WaitGroup stuffNick White
2021-01-26Speed up lspipeline by making s3 requests concurrently and only processing ↵Nick White
single results from ListObjects requests
2020-12-15[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵Nick White
books starting with "1"
2020-12-15[rmbook] Add -dryrun flagNick White
2020-12-14Add rmbook toolNick White
2020-12-07[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2Nick White
already has a hocr directory in it will work
2020-12-07[rescribe] Allow saving of results to somewhere other than a directory named ↵Nick White
after the book being processed
2020-12-03[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵Nick White
txt-ified on windows
2020-11-30Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipelineNick White
2020-11-30Add getstats toolNick White
2020-11-24[booktopipeline] Add a check to disallow adding a book that already existsNick White
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.
2020-11-17Add trimqueue and logwholequeue utilities which can help deal with weird ↵Nick White
queue states
2020-11-17Remove _bin0.x from txt filenamesv0.3.0Nick White
2020-11-16[rescribe] Default to an appropriate tesscmd for WindowsNick White
2020-11-16[rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵Nick White
they're more user-friendly
2020-11-16[rescribe] Mention in usage that things can be saved in a different directoryNick White
2020-11-10gofmtNick White
2020-11-10[rescribe] Enable custom paths to tesseract command to be set (also improve ↵Nick White
some error output)
2020-11-10[rescribe] Change -t to the path of the traineddata file, and set ↵Nick White
TESSDATA_PREFIX accordingly
2020-11-10[rescribe] Handle errors in processbook correctly, and improve console outputNick White