bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2021-07-20	Cleanup thanks to go vet	Nick White

2021-07-19	internal/pipeline: Be more explicit with exactly what functions are in each ↵	Nick White
	interface, to ensure no "duplicate function" errors when compiling
2021-07-13	Fix up tests a bit	Nick White

2021-07-13	gofmt	Nick White

2021-07-13	internal/pipeline: Reorganise interfaces so that functions only declare what ↵	Nick White
	they need We were using Pipeliner as a catch-all, but it's nicer if the functions can just state that e.g. they need download functionality, so decompose things so that that's how we do things now.
2021-07-13	aws: Only look up test queue id when asked for, as for most Init()s it won't ↵	Nick White
	be needed
2021-07-12	Add necessary pipeliner dependency for testqueue (probably remove this from ↵	Nick White
	internal library later as its only needed for tests
2021-07-12	Add test for upAndQueue function	Nick White
	This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08	rescribe: Exit with an error if directory doesn't exist	Nick White

2021-06-29	rescribe: add documentation on how to generate embedded data	Nick White

2021-06-29	rescribe: Add embed target for darwin (osx) too	Nick White

2021-06-22	rescribe: Remove erroneous unnecessary mkdir	Nick White

2021-06-22	rescribe: Make it clearer that embedded training files are available to use	Nick White

2021-06-22	rescribe: add embedded tesseract for linux	Nick White

2021-06-22	rescribe: allow use of embedded training even if -systess is used	Nick White

2021-06-22	cloud: update spot image to latest version that wont attempt to build ↵	Nick White
	rescribe tool
2021-06-22	rescribe: Add go generate command to download the needed files to embed	Nick White

2021-06-22	rescribe: Add an embedded tessdata	Nick White

2021-06-21	Merge remote-tracking branch 'ssh/master'	Nick White

2021-06-21	rescribe: Set up so only Tesseract needed for the build platform is embedded	Nick White

2021-06-21	rescribe: Embed Tesseract into binary so that no Tesseract install is necessary	Nick White

2021-06-21	update spot image used	Nick White

2021-06-15	pipeline: Ignore hidden files when checking and uploading	Nick White
	This prevents issues if a .DS_Store file is present in a directory.
2021-05-31	local: Only create a file once we are sure that it will be writeable	Nick White

2021-05-31	Add a test for up(), and document download() and up() properly	Nick White

2021-05-31	Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵	Nick White
	available to Pipeliner
2021-05-19	Close process channel after writing to err channel in download(), in case of ↵	Nick White
	an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal.
2021-05-19	Add tests for download()	Nick White

2021-05-19	Fix syntax with another Errorf call	Nick White

2021-05-19	Local download now tries to open the source file before creating a ↵	Nick White
	destination file, so if it fails an empty file isnt left behind
2021-05-19	Add basic DeleteObjects implementation to local.go	Nick White

2021-05-19	Fix syntax for some fmt.Errorf calls	Nick White

2021-04-12	Update preproc dependency	Nick White

2021-03-16	rescribe: change default training directory to trainings/v0.3.3	Nick White

2021-02-22	lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵	Nick White
	lspipeline as there are some hard to debug issues in concurrency version
2021-02-15	getsamplepages: Add -prefix option, and use 'best' to get random page numbers	Nick White
	The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-02-05	Merge branch 'master' of ↵	Nick White
	ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline
2021-02-05	Update go-chart dependency	Nick White

2021-02-01	Update AWS dependency to 1.37.1	Nick White

2021-02-01	Ensure DeleteObjects can handle over 1000 files to delete; fixes rmbook for ↵	Nick White
	large books
2021-01-26	Make ListObjectsWithMeta generic again and create a specialised ↵	Nick White
	ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26	Improve lspipeline concurrency by removing WaitGroup stuff	Nick White

2021-01-26	Speed up lspipeline by making s3 requests concurrently and only processing ↵	Nick White
	single results from ListObjects requests
2021-01-26	Stop limiting keys returned from listobjectprefixes' api usage; this speeds ↵	Nick White
	up the request markedly
2020-12-15	[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵	Nick White
	books starting with "1"
2020-12-15	[rmbook] Add -dryrun flag	Nick White

2020-12-14	Add rmbook tool	Nick White

2020-12-14	Update preproc module used to incorporate an important crash fix	Nick White

2020-12-07	[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2	Nick White
	already has a hocr directory in it will work
2020-12-07	[rescribe] Allow saving of results to somewhere other than a directory named ↵	Nick White
	after the book being processed