bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2021-11-23	rescribe: Remove debugging printfs related to PDF parsing	Nick White

2021-11-23	rescribe: Improve pdf consumption by ensuring only jpg or png are saved to ↵	Nick White
	upload
2021-11-23	gofmt, plus update documentation of recently changed pipeline.UploadImages	Nick White

2021-11-22	rescribe: Add support for reading images directly from PDFs	Nick White
	There are several TODO items before this can be considered "good enough", let alone complete. See the comments in the code for details. On a good day, with a fair wind, though, this works.
2021-11-22	rescribe: replace errors.New with fmt.Errorf	Nick White

2021-11-09	lspipeline-ng: Remove debugging printf	Nick White

2021-11-02	rescribe: handle directories with spaces correctly	Nick White

2021-10-26	rescribe: Separate gui code, and organise it better (should be no functional ↵	Nick White
	change)
2021-10-25	rescribe: wip gui using fyne	Nick White

2021-10-12	rescribe: fix lookup of external training filev0.5.3	Nick White

2021-10-01	rescribe: Include new tessdata in embed getterv0.5.2	Nick White

2021-10-01	rescribe: Add embedded lat.traineddata	Nick White

2021-10-01	rescribe: Add both original training path and embedded version on error ↵	Nick White
	output for training file not found, so that its clear that the file specified may not exist
2021-08-24	rescribe: improve makefile to match the way we deploy to the website	Nick White

2021-08-19	lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0	Nick White
	EC2's rate limiting
2021-08-18	rescribe: Update documentation on how to deal with M1 signing, and move ↵	Nick White
	makefile to where it makes sense
2021-08-17	pipeline: use regular storage for tests, rather than a separate one	Nick White

2021-08-02	rescribe: Add experimental m1 build	Nick White

2021-08-02	internal/pipeline: Add test (incomplete but working) for UploadImages	Nick White

2021-07-20	Cleanup thanks to go vet	Nick White

2021-07-13	gofmt	Nick White

2021-07-12	Add necessary pipeliner dependency for testqueue (probably remove this from ↵	Nick White
	internal library later as its only needed for tests
2021-07-12	Add test for upAndQueue function	Nick White
	This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08	rescribe: Exit with an error if directory doesn't exist	Nick White

2021-06-29	rescribe: add documentation on how to generate embedded data	Nick White

2021-06-29	rescribe: Add embed target for darwin (osx) too	Nick White

2021-06-22	rescribe: Remove erroneous unnecessary mkdir	Nick White

2021-06-22	rescribe: Make it clearer that embedded training files are available to use	Nick White

2021-06-22	rescribe: add embedded tesseract for linux	Nick White

2021-06-22	rescribe: allow use of embedded training even if -systess is used	Nick White

2021-06-22	rescribe: Add go generate command to download the needed files to embed	Nick White

2021-06-22	rescribe: Add an embedded tessdata	Nick White

2021-06-21	rescribe: Set up so only Tesseract needed for the build platform is embedded	Nick White

2021-06-21	rescribe: Embed Tesseract into binary so that no Tesseract install is necessary	Nick White

2021-05-31	Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵	Nick White
	available to Pipeliner
2021-03-16	rescribe: change default training directory to trainings/v0.3.3	Nick White

2021-02-22	lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵	Nick White
	lspipeline as there are some hard to debug issues in concurrency version
2021-02-15	getsamplepages: Add -prefix option, and use 'best' to get random page numbers	Nick White
	The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-01-26	Make ListObjectsWithMeta generic again and create a specialised ↵	Nick White
	ListObjectWithMeta for single file listing, so we can still be as fast, but do not have a misleading api
2021-01-26	Improve lspipeline concurrency by removing WaitGroup stuff	Nick White

2021-01-26	Speed up lspipeline by making s3 requests concurrently and only processing ↵	Nick White
	single results from ListObjects requests
2020-12-15	[rmbook] Append / to end of bookname, to ensure e.g. "1" doesnt match all ↵	Nick White
	books starting with "1"
2020-12-15	[rmbook] Add -dryrun flag	Nick White

2020-12-14	Add rmbook tool	Nick White

2020-12-07	[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2	Nick White
	already has a hocr directory in it will work
2020-12-07	[rescribe] Allow saving of results to somewhere other than a directory named ↵	Nick White
	after the book being processed
2020-12-03	[rescribe] Fix portability issue where hocrs may not be correctly moved and ↵	Nick White
	txt-ified on windows
2020-11-30	Merge branch 'master' of ssh://hammerhead/home/nick/rescribe/src/bookpipeline	Nick White

2020-11-30	Add getstats tool	Nick White

2020-11-24	[booktopipeline] Add a check to disallow adding a book that already exists	Nick White
	This is important as if a book is added which has already been done, then an analyse job will be added every time a page is OCRed, which will clog up the pipeline with unnecessary work. Also if a book was added with the same name but differently named files, or a different number of pages, the results would almost certainly not be as intended. In the case of a book really wanting to be added with a particular name, either the original directory can be removed on S3, or "v2" or similar can be appended to the book name before calling booktopipeline.