summaryrefslogtreecommitdiff
path: root/cmd
AgeCommit message (Collapse)Author
2021-08-17pipeline: use regular storage for tests, rather than a separate oneNick White
2021-08-02rescribe: Add experimental m1 buildNick White
2021-08-02internal/pipeline: Add test (incomplete but working) for UploadImagesNick White
2021-07-20Cleanup thanks to go vetNick White
2021-07-13gofmtNick White
2021-07-12Add necessary pipeliner dependency for testqueue (probably remove this from ↵Nick White
internal library later as its only needed for tests
2021-07-12Add test for upAndQueue functionNick White
This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08rescribe: Exit with an error if directory doesn't existNick White
2021-06-29rescribe: add documentation on how to generate embedded dataNick White
2021-06-29rescribe: Add embed target for darwin (osx) tooNick White
2021-06-22rescribe: Remove erroneous unnecessary mkdirNick White
2021-06-22rescribe: Make it clearer that embedded training files are available to useNick White
2021-06-22rescribe: add embedded tesseract for linuxNick White
2021-06-22rescribe: allow use of embedded training even if -systess is usedNick White
2021-06-22rescribe: Add go generate command to download the needed files to embedNick White
2021-06-22rescribe: Add an embedded tessdataNick White
2021-06-21rescribe: Set up so only Tesseract needed for the build platform is embeddedNick White
2021-06-21rescribe: Embed Tesseract into binary so that no Tesseract install is necessaryNick White
2021-05-31Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵Nick White
available to Pipeliner
2021-03-16rescribe: change default training directory to trainings/v0.3.3Nick White
2021-02-22lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵Nick White
lspipeline as there are some hard to debug issues in concurrency version
2021-02-15getsamplepages: Add -prefix option, and use 'best' to get random page numbersNick White
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-01-26Make ListObjectsWithMeta generic again and create a specialised ↵Nick White
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26Improve lspipeline concurrency by removing WaitGroup stuffNick White
2021-01-26Speed up lspipeline by making s3 requests concurrently and only processing ↵Nick White
single results from ListObjects requests
2020-12-15[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵Nick White
books starting with "1"
2020-12-15[rmbook] Add -dryrun flagNick White
2020-12-14Add rmbook toolNick White
2020-12-07[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2Nick White
already has a hocr directory in it will work
2020-12-07[rescribe] Allow saving of results to somewhere other than a directory named ↵Nick White
after the book being processed
2020-12-03[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵Nick White
txt-ified on windows
2020-11-30Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipelineNick White
2020-11-30Add getstats toolNick White
2020-11-24[booktopipeline] Add a check to disallow adding a book that already existsNick White
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.
2020-11-17Add trimqueue and logwholequeue utilities which can help deal with weird ↵Nick White
queue states
2020-11-17Remove _bin0.x from txt filenamesv0.3.0Nick White
2020-11-16[rescribe] Default to an appropriate tesscmd for WindowsNick White
2020-11-16[rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵Nick White
they're more user-friendly
2020-11-16[rescribe] Mention in usage that things can be saved in a different directoryNick White
2020-11-10gofmtNick White
2020-11-10[rescribe] Enable custom paths to tesseract command to be set (also improve ↵Nick White
some error output)
2020-11-10[rescribe] Change -t to the path of the traineddata file, and set ↵Nick White
TESSDATA_PREFIX accordingly
2020-11-10[rescribe] Handle errors in processbook correctly, and improve console outputNick White
2020-11-10[getpipelinebook] Rewrite to use internal package functionsNick White
2020-11-10Switch booktopipeline to use internal pipeline functionsNick White
2020-11-09Add a couple of things that should not be forgottenNick White
2020-11-09Switch Preprocess() to take the thresholds to use, and have rescribe tool ↵separatelocalNick White
only use 0.1,0.2,0.3
2020-11-09[rescribe] Local only combo tool basically now working. Testing is still ↵Nick White
minimal.
2020-11-09[rescribe] work in progress at a self-contained local pipeline processor, ↵Nick White
called rescribe
2020-11-09[bookpipeline] Split most functionality out to package internal/pipelineNick White
No functionality changes, but this should make it easier to make custom builds using the pipeline in slightly different ways.