summaryrefslogtreecommitdiff
path: root/cmd
AgeCommit message (Collapse)Author
2021-11-23rescribe: Remove debugging printfs related to PDF parsingNick White
2021-11-23rescribe: Improve pdf consumption by ensuring only jpg or png are saved to ↵Nick White
upload
2021-11-23gofmt, plus update documentation of recently changed pipeline.UploadImagesNick White
2021-11-22rescribe: Add support for reading images directly from PDFsNick White
There are several TODO items before this can be considered "good enough", let alone complete. See the comments in the code for details. On a good day, with a fair wind, though, this works.
2021-11-22rescribe: replace errors.New with fmt.ErrorfNick White
2021-11-09lspipeline-ng: Remove debugging printfNick White
2021-11-02rescribe: handle directories with spaces correctlyNick White
2021-10-26rescribe: Separate gui code, and organise it better (should be no functional ↵Nick White
change)
2021-10-25rescribe: wip gui using fyneNick White
2021-10-12rescribe: fix lookup of external training filev0.5.3Nick White
2021-10-01rescribe: Include new tessdata in embed getterv0.5.2Nick White
2021-10-01rescribe: Add embedded lat.traineddataNick White
2021-10-01rescribe: Add both original training path and embedded version on error ↵Nick White
output for training file not found, so that its clear that the file specified may not exist
2021-08-24rescribe: improve makefile to match the way we deploy to the websiteNick White
2021-08-19lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0Nick White
EC2's rate limiting
2021-08-18rescribe: Update documentation on how to deal with M1 signing, and move ↵Nick White
makefile to where it makes sense
2021-08-17pipeline: use regular storage for tests, rather than a separate oneNick White
2021-08-02rescribe: Add experimental m1 buildNick White
2021-08-02internal/pipeline: Add test (incomplete but working) for UploadImagesNick White
2021-07-20Cleanup thanks to go vetNick White
2021-07-13gofmtNick White
2021-07-12Add necessary pipeliner dependency for testqueue (probably remove this from ↵Nick White
internal library later as its only needed for tests
2021-07-12Add test for upAndQueue functionNick White
This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08rescribe: Exit with an error if directory doesn't existNick White
2021-06-29rescribe: add documentation on how to generate embedded dataNick White
2021-06-29rescribe: Add embed target for darwin (osx) tooNick White
2021-06-22rescribe: Remove erroneous unnecessary mkdirNick White
2021-06-22rescribe: Make it clearer that embedded training files are available to useNick White
2021-06-22rescribe: add embedded tesseract for linuxNick White
2021-06-22rescribe: allow use of embedded training even if -systess is usedNick White
2021-06-22rescribe: Add go generate command to download the needed files to embedNick White
2021-06-22rescribe: Add an embedded tessdataNick White
2021-06-21rescribe: Set up so only Tesseract needed for the build platform is embeddedNick White
2021-06-21rescribe: Embed Tesseract into binary so that no Tesseract install is necessaryNick White
2021-05-31Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵Nick White
available to Pipeliner
2021-03-16rescribe: change default training directory to trainings/v0.3.3Nick White
2021-02-22lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵Nick White
lspipeline as there are some hard to debug issues in concurrency version
2021-02-15getsamplepages: Add -prefix option, and use 'best' to get random page numbersNick White
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-01-26Make ListObjectsWithMeta generic again and create a specialised ↵Nick White
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26Improve lspipeline concurrency by removing WaitGroup stuffNick White
2021-01-26Speed up lspipeline by making s3 requests concurrently and only processing ↵Nick White
single results from ListObjects requests
2020-12-15[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵Nick White
books starting with "1"
2020-12-15[rmbook] Add -dryrun flagNick White
2020-12-14Add rmbook toolNick White
2020-12-07[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2Nick White
already has a hocr directory in it will work
2020-12-07[rescribe] Allow saving of results to somewhere other than a directory named ↵Nick White
after the book being processed
2020-12-03[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵Nick White
txt-ified on windows
2020-11-30Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipelineNick White
2020-11-30Add getstats toolNick White
2020-11-24[booktopipeline] Add a check to disallow adding a book that already existsNick White
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.