bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2021-10-01	rescribe: Add embedded lat.traineddata	Nick White

2021-10-01	rescribe: Add both original training path and embedded version on error ↵	Nick White
	output for training file not found, so that its clear that the file specified may not exist
2021-08-24	rescribe: improve makefile to match the way we deploy to the website	Nick White

2021-08-19	lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0	Nick White
	EC2's rate limiting
2021-08-18	rescribe: Update documentation on how to deal with M1 signing, and move ↵	Nick White
	makefile to where it makes sense
2021-08-17	pipeline: use regular storage for tests, rather than a separate one	Nick White

2021-08-02	rescribe: Add experimental m1 build	Nick White

2021-08-02	internal/pipeline: Add test (incomplete but working) for UploadImages	Nick White

2021-07-20	Cleanup thanks to go vet	Nick White

2021-07-13	gofmt	Nick White

2021-07-12	Add necessary pipeliner dependency for testqueue (probably remove this from ↵	Nick White
	internal library later as its only needed for tests
2021-07-12	Add test for upAndQueue function	Nick White
	This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08	rescribe: Exit with an error if directory doesn't exist	Nick White

2021-06-29	rescribe: add documentation on how to generate embedded data	Nick White

2021-06-29	rescribe: Add embed target for darwin (osx) too	Nick White

2021-06-22	rescribe: Remove erroneous unnecessary mkdir	Nick White

2021-06-22	rescribe: Make it clearer that embedded training files are available to use	Nick White

2021-06-22	rescribe: add embedded tesseract for linux	Nick White

2021-06-22	rescribe: allow use of embedded training even if -systess is used	Nick White

2021-06-22	rescribe: Add go generate command to download the needed files to embed	Nick White

2021-06-22	rescribe: Add an embedded tessdata	Nick White

2021-06-21	rescribe: Set up so only Tesseract needed for the build platform is embedded	Nick White

2021-06-21	rescribe: Embed Tesseract into binary so that no Tesseract install is necessary	Nick White

2021-05-31	Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵	Nick White
	available to Pipeliner
2021-03-16	rescribe: change default training directory to trainings/v0.3.3	Nick White

2021-02-22	lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵	Nick White
	lspipeline as there are some hard to debug issues in concurrency version
2021-02-15	getsamplepages: Add -prefix option, and use 'best' to get random page numbers	Nick White
	The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-01-26	Make ListObjectsWithMeta generic again and create a specialised ↵	Nick White
	ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26	Improve lspipeline concurrency by removing WaitGroup stuff	Nick White

2021-01-26	Speed up lspipeline by making s3 requests concurrently and only processing ↵	Nick White
	single results from ListObjects requests
2020-12-15	[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵	Nick White
	books starting with "1"
2020-12-15	[rmbook] Add -dryrun flag	Nick White

2020-12-14	Add rmbook tool	Nick White

2020-12-07	[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2	Nick White
	already has a hocr directory in it will work
2020-12-07	[rescribe] Allow saving of results to somewhere other than a directory named ↵	Nick White
	after the book being processed
2020-12-03	[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵	Nick White
	txt-ified on windows
2020-11-30	Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline	Nick White

2020-11-30	Add getstats tool	Nick White

2020-11-24	[booktopipeline] Add a check to disallow adding a book that already exists	Nick White
	This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.
2020-11-17	Add trimqueue and logwholequeue utilities which can help deal with weird ↵	Nick White
	queue states
2020-11-17	Remove _bin0.x from txt filenamesv0.3.0	Nick White

2020-11-16	[rescribe] Default to an appropriate tesscmd for Windows	Nick White

2020-11-16	[rescribe] Add txt output, only keep colour pdf, and reorganise files so ↵	Nick White
	they're more user-friendly
2020-11-16	[rescribe] Mention in usage that things can be saved in a different directory	Nick White

2020-11-10	gofmt	Nick White

2020-11-10	[rescribe] Enable custom paths to tesseract command to be set (also improve ↵	Nick White
	some error output)
2020-11-10	[rescribe] Change -t to the path of the traineddata file, and set ↵	Nick White
	TESSDATA_PREFIX accordingly
2020-11-10	[rescribe] Handle errors in processbook correctly, and improve console output	Nick White

2020-11-10	[getpipelinebook] Rewrite to use internal package functions	Nick White

2020-11-10	Switch booktopipeline to use internal pipeline functions	Nick White