Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-07-12 | Add necessary pipeliner dependency for testqueue (probably remove this from ↵ | Nick White | |
internal library later as its only needed for tests | |||
2021-07-12 | Add test for upAndQueue function | Nick White | |
This involved adding a test queue, so it can be run safely without intefering with the pipeline. | |||
2021-07-08 | rescribe: Exit with an error if directory doesn't exist | Nick White | |
2021-06-29 | rescribe: add documentation on how to generate embedded data | Nick White | |
2021-06-29 | rescribe: Add embed target for darwin (osx) too | Nick White | |
2021-06-22 | rescribe: Remove erroneous unnecessary mkdir | Nick White | |
2021-06-22 | rescribe: Make it clearer that embedded training files are available to use | Nick White | |
2021-06-22 | rescribe: add embedded tesseract for linux | Nick White | |
2021-06-22 | rescribe: allow use of embedded training even if -systess is used | Nick White | |
2021-06-22 | cloud: update spot image to latest version that wont attempt to build ↵ | Nick White | |
rescribe tool | |||
2021-06-22 | rescribe: Add go generate command to download the needed files to embed | Nick White | |
2021-06-22 | rescribe: Add an embedded tessdata | Nick White | |
2021-06-21 | Merge remote-tracking branch 'ssh/master' | Nick White | |
2021-06-21 | rescribe: Set up so only Tesseract needed for the build platform is embedded | Nick White | |
2021-06-21 | rescribe: Embed Tesseract into binary so that no Tesseract install is necessary | Nick White | |
2021-06-21 | update spot image used | Nick White | |
2021-06-15 | pipeline: Ignore hidden files when checking and uploading | Nick White | |
This prevents issues if a .DS_Store file is present in a directory. | |||
2021-05-31 | local: Only create a file once we are sure that it will be writeable | Nick White | |
2021-05-31 | Add a test for up(), and document download() and up() properly | Nick White | |
2021-05-31 | Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵ | Nick White | |
available to Pipeliner | |||
2021-05-19 | Close process channel after writing to err channel in download(), in case of ↵ | Nick White | |
an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal. | |||
2021-05-19 | Add tests for download() | Nick White | |
2021-05-19 | Fix syntax with another Errorf call | Nick White | |
2021-05-19 | Local download now tries to open the source file before creating a ↵ | Nick White | |
destination file, so if it fails an empty file isnt left behind | |||
2021-05-19 | Add basic DeleteObjects implementation to local.go | Nick White | |
2021-05-19 | Fix syntax for some fmt.Errorf calls | Nick White | |
2021-04-12 | Update preproc dependency | Nick White | |
2021-03-16 | rescribe: change default training directory to trainings/v0.3.3 | Nick White | |
2021-02-22 | lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵ | Nick White | |
lspipeline as there are some hard to debug issues in concurrency version | |||
2021-02-15 | getsamplepages: Add -prefix option, and use 'best' to get random page numbers | Nick White | |
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that. | |||
2021-02-05 | Merge branch 'master' of ↵ | Nick White | |
ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline | |||
2021-02-05 | Update go-chart dependency | Nick White | |
2021-02-01 | Update AWS dependency to 1.37.1 | Nick White | |
2021-02-01 | Ensure DeleteObjects can handle over 1000 files to delete; fixes rmbook for ↵ | Nick White | |
large books | |||
2021-01-26 | Make ListObjectsWithMeta generic again and create a specialised ↵ | Nick White | |
ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api | |||
2021-01-26 | Improve lspipeline concurrency by removing WaitGroup stuff | Nick White | |
2021-01-26 | Speed up lspipeline by making s3 requests concurrently and only processing ↵ | Nick White | |
single results from ListObjects requests | |||
2021-01-26 | Stop limiting keys returned from listobjectprefixes' api usage; this speeds ↵ | Nick White | |
up the request markedly | |||
2020-12-15 | [rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵ | Nick White | |
books starting with "1" | |||
2020-12-15 | [rmbook] Add -dryrun flag | Nick White | |
2020-12-14 | Add rmbook tool | Nick White | |
2020-12-14 | Update preproc module used to incorporate an important crash fix | Nick White | |
2020-12-07 | [rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2 | Nick White | |
already has a hocr directory in it will work | |||
2020-12-07 | [rescribe] Allow saving of results to somewhere other than a directory named ↵ | Nick White | |
after the book being processed | |||
2020-12-04 | Ensure mkdir will succeed in upload | Nick White | |
2020-12-03 | [rescribe] Fix portability issue where hocrs may not be correctly moved and ↵ | Nick White | |
txt-ified on windows | |||
2020-12-03 | Don't upload binarised pdf twice needlessly | Nick White | |
This can also result in the file being uploaded twice simultaneously, as up() is running in a separate goroutine. This can cause failures on Windows as the file is attempted to be removed by one upload process while being open to upload by the other process. Probably it could also fail if the process completed by one (so the file was deleted) before being started by the other. | |||
2020-11-30 | Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline | Nick White | |
2020-11-30 | Add getstats tool | Nick White | |
2020-11-24 | [booktopipeline] Add a check to disallow adding a book that already exists | Nick White | |
This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline. |