summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-08-13Correct typo in bucket name for pipelinepreprocess; tested and seems to ↵Nick White
work, remarkably
2019-08-13Add bonus verbose log pointsNick White
2019-08-13Add booktopipeline tool (only lightly tested)Nick White
2019-08-13Reduce SQS WaitTime to something in-spec, and add bonus verbose log pointsNick White
2019-08-13Switch ksizes to use by preprocmultiNick White
2019-08-13Add basic verbose logging capabilities to pipelinepreprocessNick White
2019-07-25Add first draft of pipelinepreprocess - completely untested, will contain bugsNick White
2019-07-19rename setupawspipeline to mkpipelineNick White
2019-07-19rename pipelineaws to setupawspipelineNick White
2019-07-19Add aws pipeline setupNick White
2019-06-25Remove 0.6 binarisation threshold option from preprocmultiNick White
2019-06-25Experimentally adjust wipe threshold according to binarisation levelNick White
2019-06-11Name hocrs as pdfimages does, and preserve entities for hocrNick White
2019-06-11Add basic utility to turn an eebo xml into a set of hocr files (for hocr2pdf)Nick White
2019-06-03Add option to disable wiping for preproc and preprocmultiNick White
2019-06-03Add -m option to wipe to set minimum content area for wipe to proceedNick White
If content is very light or sparse it may be better to not wipe at all than wipe almost all of the content leaving a small strip. This is done now by aborting the wipe if the detected content takes up less than the minimum % of the page (default is 30%).
2019-05-15Return an error if page average calculation cant be done with hocrNick White
2019-05-14Rewrite pgconf to be more accurate by measuring average word confidence ↵Nick White
rather than average line confidence
2019-05-14pgconf: Don't print NaN if a page has no lines, and show the percentage, ↵Nick White
rather than float, for easier comparison
2019-05-14Add pgconf tool that prints the overall confidence for a whole page of hocrNick White
2019-05-14Basic cleanup of preprocmultiNick White
2019-05-14gofmtNick White
2019-05-14Add preprocmulti tool, that outputs multiple binarisation options quicklyNick White
2019-05-13Add preproc command, that binarises and preprocesses togetherNick White
Surprisingly opening an image takes a significant amount of the total processing time, so this actually saves quite a bit of time in the grand scheme of things.
2019-05-13Define flags in each test, so they arent erroneously picked up and used by ↵Nick White
cmds as they were defined in global package space
2019-05-13Use general integralimg functions for wipe functionsNick White
2019-05-13Add -slow flag to test to skip slow tests by defaultNick White
2019-05-13Reorganise image manipulation to separate integral image partsNick White
Also unify everything else under preproc/ Note that the UsefulImg interface should be used by the main functions, to simplify things, but this hasn't been done yet.
2019-05-13Start switching preproc to use interfaces moreNick White
2019-05-13Rename cleanup to wipe, and only export main functionNick White
2019-05-13Rename cleanup package to preproc, and add basic cmd versionNick White
2019-05-13Improve error handling in sauvola testsNick White
2019-05-13Make cleanup a basic libraryNick White
2019-05-13Add some basic tests for cleanupNick White
2019-05-13Use the simplified findbestedge function, and simplify codeNick White
2019-04-18Simplify cleanup codeNick White
2019-04-18Put edge in middle of window slice, rather than at left side, and gofmtNick White
2019-04-18Add basic cleanup tool; working, but more refinements planned.Nick White
This uses integral image calculations, so they have been exported in the binarization package
2019-04-17Add basic dehyphenate toolNick White
2019-03-28Remove todo for integral image testing for nowNick White
2019-03-28Improve tests; test regular sauvola, and add option to update golden filesNick White
2019-03-26Add zeroinv option for binarize commandNick White
2019-03-26Move sauvola binarization tool to cmd/binarizeNick White
2019-03-26Better error handling with hocr linesNick White
2019-02-25Generalise get text from hocr linesNick White
2019-02-25Add tool to extract plain text from hocrNick White
2019-02-15Separate out binarize into a package, and start adding tests for itNick White
2019-01-30Set window size automatically based on resolutionNick White
2019-01-30Remove dependency on Imger packageNick White
2019-01-30Add integral image functionality to enable massive speedup of SauvolaNick White
Note that there are some very small differences to the output compared to the basic algorithm, but this doesn't make much difference. This is due to minor differences with the standard deviation calculation throughout, and with mean calculation at edges, for reasons I'm unclear about. WIP integral image speedup. mean is working Very WIP, but mean is perfect once full window is used Integral version all working! Remove debugging info Organise code better