summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-22rescribe: Add an embedded tessdataNick White
2021-06-21Merge remote-tracking branch 'ssh/master'Nick White
2021-06-21rescribe: Set up so only Tesseract needed for the build platform is embeddedNick White
2021-06-21rescribe: Embed Tesseract into binary so that no Tesseract install is necessaryNick White
2021-06-21update spot image usedNick White
2021-06-15pipeline: Ignore hidden files when checking and uploadingNick White
This prevents issues if a .DS_Store file is present in a directory.
2021-05-31local: Only create a file once we are sure that it will be writeableNick White
2021-05-31Add a test for up(), and document download() and up() properlyNick White
2021-05-31Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵Nick White
available to Pipeliner
2021-05-19Close process channel after writing to err channel in download(), in case of ↵Nick White
an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal.
2021-05-19Add tests for download()Nick White
2021-05-19Fix syntax with another Errorf callNick White
2021-05-19Local download now tries to open the source file before creating a ↵Nick White
destination file, so if it fails an empty file isnt left behind
2021-05-19Add basic DeleteObjects implementation to local.goNick White
2021-05-19Fix syntax for some fmt.Errorf callsNick White
2021-04-12Update preproc dependencyNick White
2021-03-16rescribe: change default training directory to trainings/v0.3.3Nick White
2021-02-22lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵Nick White
lspipeline as there are some hard to debug issues in concurrency version
2021-02-15getsamplepages: Add -prefix option, and use 'best' to get random page numbersNick White
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-02-05Merge branch 'master' of ↵Nick White
ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline
2021-02-05Update go-chart dependencyNick White
2021-02-01Update AWS dependency to 1.37.1Nick White
2021-02-01Ensure DeleteObjects can handle over 1000 files to delete; fixes rmbook for ↵Nick White
large books
2021-01-26Make ListObjectsWithMeta generic again and create a specialised ↵Nick White
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26Improve lspipeline concurrency by removing WaitGroup stuffNick White
2021-01-26Speed up lspipeline by making s3 requests concurrently and only processing ↵Nick White
single results from ListObjects requests
2021-01-26Stop limiting keys returned from listobjectprefixes' api usage; this speeds ↵Nick White
up the request markedly
2020-12-15[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵Nick White
books starting with "1"
2020-12-15[rmbook] Add -dryrun flagNick White
2020-12-14Add rmbook toolNick White
2020-12-14Update preproc module used to incorporate an important crash fixNick White
2020-12-07[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2Nick White
already has a hocr directory in it will work
2020-12-07[rescribe] Allow saving of results to somewhere other than a directory named ↵Nick White
after the book being processed
2020-12-04Ensure mkdir will succeed in uploadNick White
2020-12-03[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵Nick White
txt-ified on windows
2020-12-03Don't upload binarised pdf twice needlesslyNick White
This can also result in the file being uploaded twice simultaneously, as up() is running in a separate goroutine. This can cause failures on Windows as the file is attempted to be removed by one upload process while being open to upload by the other process. Probably it could also fail if the process completed by one (so the file was deleted) before being started by the other.
2020-11-30Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipelineNick White
2020-11-30Add getstats toolNick White
2020-11-24[booktopipeline] Add a check to disallow adding a book that already existsNick White
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.
2020-11-18Switch to a maintained version of gofpdfNick White
2020-11-18Describe rescribe tool in documentationv0.3.1Nick White
2020-11-17Add trimqueue and logwholequeue utilities which can help deal with weird ↵Nick White
queue states
2020-11-17Remove _bin0.x from txt filenamesv0.3.0Nick White
2020-11-16Some changes to ensure the pipeline works correctly on WindowsNick White
There were a couple of places where a file was uploaded while still open, which resulted in an attempt to remove it, which causes an error from Windows. The allOCRed function also included an assumption that the path separator would be a /, which is always correct for AWS, and correct for local on Linux and OSX, but not for local Windows. Fixed by leaving the separator well alone. Also, the local connection was not stripping leading \, like it did /, which caused an issue with Windows local. Windows local is now tested and working, at least through wine.
2020-11-16[rescribe] Default to an appropriate tesscmd for WindowsNick White
2020-11-16[rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵Nick White
they're more user-friendly
2020-11-16[rescribe] Mention in usage that things can be saved in a different directoryNick White
2020-11-16Add makefile for generating cross compiled rescribe binariesNick White
2020-11-10gofmtNick White
2020-11-10[rescribe] Enable custom paths to tesseract command to be set (also improve ↵Nick White
some error output)