bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2021-07-12	Add necessary pipeliner dependency for testqueue (probably remove this from ↵	Nick White
	internal library later as its only needed for tests
2021-07-12	Add test for upAndQueue function	Nick White
	This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08	rescribe: Exit with an error if directory doesn't exist	Nick White

2021-06-29	rescribe: add documentation on how to generate embedded data	Nick White

2021-06-29	rescribe: Add embed target for darwin (osx) too	Nick White

2021-06-22	rescribe: Remove erroneous unnecessary mkdir	Nick White

2021-06-22	rescribe: Make it clearer that embedded training files are available to use	Nick White

2021-06-22	rescribe: add embedded tesseract for linux	Nick White

2021-06-22	rescribe: allow use of embedded training even if -systess is used	Nick White

2021-06-22	cloud: update spot image to latest version that wont attempt to build ↵	Nick White
	rescribe tool
2021-06-22	rescribe: Add go generate command to download the needed files to embed	Nick White

2021-06-22	rescribe: Add an embedded tessdata	Nick White

2021-06-21	Merge remote-tracking branch 'ssh/master'	Nick White

2021-06-21	rescribe: Set up so only Tesseract needed for the build platform is embedded	Nick White

2021-06-21	rescribe: Embed Tesseract into binary so that no Tesseract install is necessary	Nick White

2021-06-21	update spot image used	Nick White

2021-06-15	pipeline: Ignore hidden files when checking and uploading	Nick White
	This prevents issues if a .DS_Store file is present in a directory.
2021-05-31	local: Only create a file once we are sure that it will be writeable	Nick White

2021-05-31	Add a test for up(), and document download() and up() properly	Nick White

2021-05-31	Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵	Nick White
	available to Pipeliner
2021-05-19	Close process channel after writing to err channel in download(), in case of ↵	Nick White
	an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal.
2021-05-19	Add tests for download()	Nick White

2021-05-19	Fix syntax with another Errorf call	Nick White

2021-05-19	Local download now tries to open the source file before creating a ↵	Nick White
	destination file, so if it fails an empty file isnt left behind
2021-05-19	Add basic DeleteObjects implementation to local.go	Nick White

2021-05-19	Fix syntax for some fmt.Errorf calls	Nick White

2021-04-12	Update preproc dependency	Nick White

2021-03-16	rescribe: change default training directory to trainings/v0.3.3	Nick White

2021-02-22	lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵	Nick White
	lspipeline as there are some hard to debug issues in concurrency version
2021-02-15	getsamplepages: Add -prefix option, and use 'best' to get random page numbers	Nick White
	The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-02-05	Merge branch 'master' of ↵	Nick White
	ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline
2021-02-05	Update go-chart dependency	Nick White

2021-02-01	Update AWS dependency to 1.37.1	Nick White

2021-02-01	Ensure DeleteObjects can handle over 1000 files to delete; fixes rmbook for ↵	Nick White
	large books
2021-01-26	Make ListObjectsWithMeta generic again and create a specialised ↵	Nick White
	ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26	Improve lspipeline concurrency by removing WaitGroup stuff	Nick White

2021-01-26	Speed up lspipeline by making s3 requests concurrently and only processing ↵	Nick White
	single results from ListObjects requests
2021-01-26	Stop limiting keys returned from listobjectprefixes' api usage; this speeds ↵	Nick White
	up the request markedly
2020-12-15	[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵	Nick White
	books starting with "1"
2020-12-15	[rmbook] Add -dryrun flag	Nick White

2020-12-14	Add rmbook tool	Nick White

2020-12-14	Update preproc module used to incorporate an important crash fix	Nick White

2020-12-07	[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2	Nick White
	already has a hocr directory in it will work
2020-12-07	[rescribe] Allow saving of results to somewhere other than a directory named ↵	Nick White
	after the book being processed
2020-12-04	Ensure mkdir will succeed in upload	Nick White

2020-12-03	[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵	Nick White
	txt-ified on windows
2020-12-03	Don't upload binarised pdf twice needlessly	Nick White
	This can also result in the file being uploaded twice simultaneously, as up() is running in a separate goroutine. This can cause failures on Windows as the file is attempted to be removed by one upload process while being open to upload by the other process. Probably it could also fail if the process completed by one (so the file was deleted) before being started by the other.
2020-11-30	Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline	Nick White

2020-11-30	Add getstats tool	Nick White

2020-11-24	[booktopipeline] Add a check to disallow adding a book that already exists	Nick White
	This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.