bookpipeline - Tools to process books in a cloud based pipeline system

Age	Commit message (Collapse)	Author
2022-03-21	rescribe: Update copyright years and add TODO file	Nick White

2022-03-21	rescribe: Update traineddata descriptions in command line version	Nick White

2022-03-21	Update tessdata to only include a few trainings	Nick White

2022-03-21	rescribe: Improve cli wording and simplify PDF stuff slightly	Nick White

2022-03-21	Only generate full-size PDF if requested	Nick White
	This avoids the issue that large PDFs require a lot of RAM, so there are chances of running out of memory. Plus it's a waste of space and time.
2022-03-11	Add initial support for full-size PDF generation	Nick White
	Some issues: 1) The PDF generation stores every page in memory while it constructs it. That means that there's a higher chance of failure due to running out of memory with these. There's no getting around this except by improving the PDF generation library, which is not easy. 2) Currently I've just changed the pipeline to always generate these full size PDFs, and then the rescribe tool will just delete them if they weren't requested. This is bad in particular because of point 1, and would probably cause issues of failures in the server pipeline as a result Therefore the plan is to add a tag to queue messages so that full size generation can be selectively enabled. Also, it should be split from the loop with colour pdf generation, as holding them both in RAM at the same time is unnecessary.
2022-03-11	Name PDF extracted images so they sort correctly	Nick White

2022-02-28	rescribe: Add " searchable" to file name for saved PDF	Nick White

2022-02-28	Add PreNoWipe queue, that just does binarisation but no wiping	Nick White

2022-02-23	rescribe: fix typo with embedded getgbook running	Nick White

2022-02-23	rescribe: Add embedded support for getgbook, for linux only so far	Nick White

2022-02-21	Ensure that no new console windows are opened on Windows when executing ↵	Nick White
	Tesseract
2022-01-31	rescribe: Add context cancelling to extractPdfImgs(), so it's no longer ↵	Nick White
	possible to get the gui into a bad state by cancelling before startProcess began (hopefully)
2022-01-31	rescribe: Ensure status isnt overwritten after an abort, when wipe-only ↵	Nick White
	preprocessing
2022-01-31	Make pipeline context-aware, so the rescribe tool can cancel jobs	Nick White

2022-01-10	rescribe: Rename PDFs taking into account that in some cases one or the ↵	Nick White
	other of binarised or colour may not exist
2022-01-10	rescribe: handle PDF errors much more gracefully	Nick White

2021-12-20	rescribe: Ensure temporary tesseract data is only removed when the program ↵	Nick White
	ends, so multiple books can be processed by the gui one after the other
2021-12-20	rescribe: Ensure temporary tesseract dir is removed in gui mode too	Nick White

2021-12-20	whitespace and error clarity changes	Nick White

2021-12-20	fixed -png flag and changed rescribe tool to save binarized png in separate ↵	Antonia Rescribe
	folder
2021-12-06	pipeline: process jpg or png regardless of whether in wipe or preprocess queue	Nick White

2021-11-23	rescribe: Remove debugging printfs related to PDF parsing	Nick White

2021-11-23	rescribe: Improve pdf consumption by ensuring only jpg or png are saved to ↵	Nick White
	upload
2021-11-22	rescribe: Add support for reading images directly from PDFs	Nick White
	There are several TODO items before this can be considered "good enough", let alone complete. See the comments in the code for details. On a good day, with a fair wind, though, this works.
2021-11-22	rescribe: replace errors.New with fmt.Errorf	Nick White

2021-11-02	rescribe: handle directories with spaces correctly	Nick White

2021-10-26	rescribe: Separate gui code, and organise it better (should be no functional ↵	Nick White
	change)
2021-10-25	rescribe: wip gui using fyne	Nick White

2021-10-12	rescribe: fix lookup of external training filev0.5.3	Nick White

2021-10-01	rescribe: Add embedded lat.traineddata	Nick White

2021-10-01	rescribe: Add both original training path and embedded version on error ↵	Nick White
	output for training file not found, so that its clear that the file specified may not exist
2021-08-17	pipeline: use regular storage for tests, rather than a separate one	Nick White

2021-08-02	internal/pipeline: Add test (incomplete but working) for UploadImages	Nick White

2021-07-20	Cleanup thanks to go vet	Nick White

2021-07-13	gofmt	Nick White

2021-07-08	rescribe: Exit with an error if directory doesn't exist	Nick White

2021-06-29	rescribe: Add embed target for darwin (osx) too	Nick White

2021-06-22	rescribe: Remove erroneous unnecessary mkdir	Nick White

2021-06-22	rescribe: Make it clearer that embedded training files are available to use	Nick White

2021-06-22	rescribe: add embedded tesseract for linux	Nick White

2021-06-22	rescribe: allow use of embedded training even if -systess is used	Nick White

2021-06-22	rescribe: Add go generate command to download the needed files to embed	Nick White

2021-06-22	rescribe: Add an embedded tessdata	Nick White

2021-06-21	rescribe: Set up so only Tesseract needed for the build platform is embedded	Nick White

2021-06-21	rescribe: Embed Tesseract into binary so that no Tesseract install is necessary	Nick White

2021-05-31	Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵	Nick White
	available to Pipeliner
2021-03-16	rescribe: change default training directory to trainings/v0.3.3	Nick White

2020-12-07	[rescribe] Fix up *.hocr glob, which ensures that using a savedir that ↵v0.3.2	Nick White
	already has a hocr directory in it will work
2020-12-07	[rescribe] Allow saving of results to somewhere other than a directory named ↵	Nick White
	after the book being processed