bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2019-12-11	Clarify use of -training in tools	Nick White

2019-12-11	Clean up and correct book name parsing in the pipeline, and update usage of ↵	Nick White
	getpipelinebook
2019-12-11	Add ability to set a different training for the ocr job	Nick White

2019-12-11	Use aws.go with mkpipeline too, plus fix one log.Fatal call in aws.go which ↵	Nick White
	should have been handled by caller
2019-12-06	Don't abort PDF generation if pages aren't found, just do the best that can ↵	Nick White
	be done and move on; not all books will have all page types (such as wipeonly books)
2019-12-05	Remove (the generally empty) files in the case of a failed download	Nick White

2019-12-05	Default getpipelinebook to downloading pdfs instead of images	Nick White

2019-12-05	Fix the PDF in analyse step part of bookpipeline	Nick White

2019-12-05	Add pdf generation to analyse step (untested)	Nick White

2019-12-03	Rewrite lspipeline book listing part to be much faster by taking advantage ↵	Nick White
	of the aws CommonPrefixes output
2019-12-03	Don't pause between OCR page jobs; this should save us significant amounts ↵	Nick White
	of time when there are large numbers of pages
2019-11-29	Make error message clear what page is causing issues	Nick White

2019-11-26	Improve usage notice	Nick White

2019-11-26	Ensure error in file walking is correctly returned	Nick White

2019-11-20	Add x/image to go.mod	Nick White

2019-11-20	Merge branch 'addpdf'	Nick White

2019-11-20	Implement image resizing option into PDF generation, so that smaller PDFs to ↵	Nick White
	be generated
2019-11-19	Send pages to the individual OCR Page queue by default	Nick White
	This now concludes the OCR Page queue stuff; it should all be working out of the box now.
2019-11-19	Add ocrpage queue for processing individual pages	Nick White
	This should be a good way to get around the ongoing heartbeat issue, as individual page jobs will never come close to a the 12 hour mark that can cause the bug. The OCR page processing is done and working now, still to do is to populate the queue (rather than the ocr queue) after preprocessing / wiping.
2019-11-12	Merge branch 'addpdf'	Nick White

2019-11-12	Embed a font, compressed, into the binary	Nick White

2019-11-12	Fix sleep in unstickocr	Nick White

2019-11-12	Add unstickocr tool, until the heartbeat bug is eliminated	Nick White

2019-11-12	Add spotme command to start appropriate spot instances	Nick White

2019-11-12	Merge branch 'addpdf'	Nick White

2019-11-11	Add go.mod and go.sum	Nick White

2019-11-11	Switch to main gofpdf, now our SetTextRenderingMode has been merged	Nick White

2019-11-01	Compress the font with zlib, and include it in repo	Nick White

2019-10-31	Add capability to embed font files into tool	Nick White

2019-10-31	PDF: add functionality to use "best" file if it exists	Nick White

2019-10-31	PDF: add space to each word to ensure copy-past ability from more PDF readers	Nick White

2019-10-31	PDF: lay out every word with coordinates separately	Nick White
	I presumed this would mean that multiple words next to each other couldn't be reliably searched for, but this seems not to be the case.
2019-10-31	Add flag to switch between binarised and colour output	Nick White

2019-10-31	Move PDF handling code to a separate file	Nick White

2019-10-31	Many improvements to pdfbook; basically working now	Nick White

2019-10-31	Add work in progress PDF producer	Nick White

2019-10-29	Print heartbeat error on failure	Nick White

2019-10-29	Debugging: kill process immediately a heartbeat error is detected (systemd ↵	Nick White
	will restart it soon thereafter)
2019-10-29	Another attempt to fix the ongoing heartbeat issue	Nick White
	This time wait up to 1 second between attempts, reduce long polling time significantly, and attempt for longer before giving up.
2019-10-28	Try to fix heartbeat renew issue more fully	Nick White
	This approach first sets the remaining visibility timeout to zero. This should ensure that the message is available to re-find as soon as the process looks for it. Correspondingly the delay between checks is much shorter, as there shouldn't be a reason for much delay.
2019-10-23	getpipelinebook: default to downloading corresponding page images, and add ↵	Nick White
	option to download the original page images too
2019-10-23	Manually calculate yticks, so they fall on reasonable numbers	Nick White

2019-10-23	Add more annotations to graph; anything outside of the 80% "normal" band ↵	Nick White
	gets an annotation now, and that band is labelled
2019-10-17	Adjust the heartbeat searching function to hopefully have better luck at ↵	Nick White
	finding it and not letting another process steal it.
2019-10-16	Rewrite booktopipeline to use bookpipeline aws interface	Nick White

2019-10-16	Sort book list in lspipeline by modified date	Nick White

2019-10-16	Ensure booktopipeline complains if given too many arguments	Nick White

2019-10-16	Another attempted fix to "too many open files" issue	Nick White

2019-10-16	Ensure files are promptly closed by booktopipeline	Nick White

2019-10-11	Ensure graph produces output by falling back on generic page numbers if none ↵	Nick White
	can be determined