summaryrefslogtreecommitdiff
path: root/cmd/bookpipeline
AgeCommit message (Collapse)Author
2020-11-10[rescribe] Enable custom paths to tesseract command to be set (also improve ↵Nick White
some error output)
2020-11-09Switch Preprocess() to take the thresholds to use, and have rescribe tool ↵separatelocalNick White
only use 0.1,0.2,0.3
2020-11-09[bookpipeline] Split most functionality out to package internal/pipelineNick White
No functionality changes, but this should make it easier to make custom builds using the pipeline in slightly different ways.
2020-11-09Add -autostop, so time to shutdown can be specified, and so the process can ↵Nick White
just be stopped after a period, rather than the whole computer shut down
2020-11-09[bookpipeline] Improve interface, particularly for local use, by disabling ↵Nick White
(failing) log saving, mail sending, and removing erroneous references to AWS
2020-11-09Set hocr config options directly rather than relying on 'hocr' config fileNick White
This ensures that bookpipeline will still work even if TESSDATA_PREFIX has been set to a directory without configs in it.
2020-09-15Abort and delete a failed wipeonly job, like we do with preprocessingNick White
There was no reason not to do this with wipeonly as well, and sure enough a single broken PNG image in a wipeonly task would cause the queue to exponentially fill as happened previously.
2020-08-18Update preproc to v0.4.0 to enable vertical wipingNick White
2020-07-27Use os.Getenv() to find config dir, rather than rely on os.UserConfigDir(), ↵Nick White
as that isnt present on go1.11
2020-07-27Switch mail settings to an externally set fileNick White
2020-07-21[bookpipeline] If preprocessing fails, email us and remove the job from the ↵Nick White
queue This prevents the current situation where a failed preprocessing job is endlessly repeated, potentially spawning thousands of ocrpage jobs in its wake each time. Note that the email stuff works but requires putting secrets into .go files, so need to rewrite that to read from somewhere more sensible like a dotfile on the host.
2020-07-20Merge branch 'master' of https://git.rescribe.xyz/bookpipelineNick White
2020-07-20Update preproc to v0.1.4 to take advantage of vertical wiping parameters, ↵v0.2.5Nick White
and change WipeFile() to take advantage of them
2020-06-16[getallhocrs] Skip files which have already been downloadedNick White
2020-06-03Hopefully fix last bug in analyse step of bookpipelineNick White
2020-06-03Fix bug in analyse step of bookpipelineNick White
2020-06-02Fix race condition that could cause errors to be silently discardedNick White
This was a nasty one. By closing the up channel, the up() function would finish and send to the done channel. This means that the select between err and done would be random as to which was picked, whereas of course if there has been an error that path must be taken.
2020-05-29[bookpipeline] Remove local copy of original page image once preprocessedNick White
2020-05-29Merge branch 'minimisedisk'v0.2.4Nick White
2020-05-26Add -c conntype for necessary tools to allow local connection to be usedNick White
2020-05-22Fix bookpipeline failing if shutdown option isnt usedNick White
2020-05-22[untested] Use less disk spaceminimisediskNick White
There are several ways that disk usage is reduced with this patch: - Files are deleted as soon as they have been uploaded - Once a page image has been added to a PDF, immediately delete it This should allow much larger books to be processed without needing bigger disks.
2020-04-14Briefly document each of the commands in a godoc friendly way, and improve ↵Nick White
the cloudsettings documentation slightly
2020-04-07Remove unused OCR queue (was superceded by the ocrpage queue some time ago)Nick White
2020-04-07gofmtNick White
2020-03-31Disable autoshutdown by default for bookpipeline, and update to ami 0.11 ↵Nick White
(which reenables it for spot instances)
2020-03-31[bookpipeline] Fix typo in previous commit and rename HeartbeatTime to ↵Nick White
HeartbeatSeconds, as it is not a Time
2020-03-31[bookpipeline] Stop using filepath.Join for storage keys, as we want to ↵Nick White
ensure it is always a / delimeter
2020-03-31[bookpipeline] Improve logging outputNick White
2020-03-31[bookpipeline] Add (experimental) log saving functionalityNick White
2020-03-30[bookpipeline] Clean up autoshutdownNick White
2020-03-30[bookpipeline] Enable real shutdown when bookpipeline has been idle for 5 ↵Nick White
minutes
2020-03-30[bookpipeline] Neaten shutdown fixNick White
2020-03-30[bookpipeline] Fix hang bug when restarting shutdown timerNick White
2020-03-30Rewrite autoshutdown to do things right [bugs excluded] (wip)Nick White
2020-03-24[bookpipeline] Improve autoshutdown wipNick White
2020-03-24[bookpipeline] Add experimental (dummy) shutdown partNick White
2020-03-23Add Log() function to Pipeliner interfaceNick White
This simplifies things nicely from using conn.GetLogger().Println() to conn.Log()
2020-03-23Replace errors.New(fmt.Sprintf with fmt.ErrorfNick White
Embarassing I hadn't noticed the fmt.Errorf function before, but better late than never.
2020-03-23Don't try to make a graph with one line (it will fail), and don't mark ↵Nick White
analysis as failed if graph isn't made for that reason
2020-02-27Add documentation, license notices, and licenseNick White
2020-02-05Fix allOCRed for wipeonly books (hopefully)Nick White
allOCRed was checking for wipePattern files, however they should have been transformed into the regular preprocessedPattern for OCR anyway, so shouldn't have been directly OCRed. Thus, allOCRed was mistakenly looking for .hocr versions of the original wipePattern files, which never would have been produced.
2019-12-13Hopefully fix empty training bugNick White
2019-12-13Mention training in ocr error messageNick White
2019-12-13Print stdout and stderr output when tesseract failsNick White
2019-12-11Fix typo incorrectly screwing up PDFsNick White
2019-12-11Clarify use of -training in toolsNick White
2019-12-11Clean up and correct book name parsing in the pipeline, and update usage of ↵Nick White
getpipelinebook
2019-12-11Add ability to set a different training for the ocr jobNick White
2019-12-06Don't abort PDF generation if pages aren't found, just do the best that can ↵Nick White
be done and move on; not all books will have all page types (such as wipeonly books)