summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-02-11Rename traintessv5.sh to traintess.shNick White
2020-01-26Fix fast version of training in traintessv5.shNick White
Tesseract relies on the position of arguments for lstmtraining, surprisingly. Anyway, this formulation works correctly.
2020-01-22Try to capture git revision of ground truth usedNick White
2020-01-22Create fast version of trainingNick White
2020-01-22Replace traintessv4 with traintessv5 script, which was used for fra-engbase ↵Nick White
training (very minor edits)
2019-10-23Add dir-to-pdfv3.sh, for use alongside bookpipelineNick White
2019-10-23Add worst directory for fullocrdir scriptNick White
2019-07-23fix hocrtotxtdir.shNick White
2019-07-15fix more eebotopdf bugs; hopefully more resiliant nowNick White
2019-07-15Add unpaperdir.shNick White
2019-07-15Ensure eebotopdf.sh uses a /tmp dir for tmp filesNick White
2019-07-15ensure eebo pdfs are saved to the appropriate directoryNick White
2019-07-15Make fullocrdir.sh only do things that haven't been done beforeNick White
2019-06-25Add fixoverwiped.sh scriptNick White
2019-06-11Fix bug in checkoverwiping scriptNick White
2019-06-11Add checkoverwiping scriptNick White
2019-06-11Add eebotopdf scriptNick White
2019-06-10Do bookgraph as standard when doing fullocrdirNick White
2019-06-05Rename bookgraphv2.sh to the canonical bookgraphNick White
Add word count to the graph. Use a scaled figure so it's easy to compare with the confidence.
2019-06-05Ensure bookgraph uses directory name even when run with . for current dir, ↵Nick White
and ensure temp dirs are destroyed
2019-06-03Add dir-to-pdfv2 scriptNick White
2019-06-03Fix dir-to-pdf output namingNick White
2019-05-15Adjust fullocrdir.sh to latest version of pgconfNick White
2019-05-14Add bookgraphv2, to go hand in hand with fullocrdirNick White
2019-05-14fix typoNick White
2019-05-14Add fullocrdir script, which does multiple binarisation options and picks ↵Nick White
the ones with the highest confidence
2019-05-08Ensure dir-to-pdf saves to dirname.pdf not dirname/.pdf, and handle all ↵Nick White
different naming conventions
2019-05-08Make scrape scripts executableNick White
2019-05-08Make scrapers more robust, and have them scrape into a directory per bookNick White
2019-05-08Make BNF scraper much more robustNick White
2019-05-08Allow an argument to set pdf savefile, and resize pdf images to be way smallerNick White
2019-05-08Rename pdf prep tool as it creates the pdf too nowNick White
2019-05-08Use sane page numbering for erara scraperNick White
2019-05-08Add scrape-erara.sh script (not fully tested)Nick White
2019-05-08Set DPI for images, and maximally compress jpg (with binarisation it doesn't ↵Nick White
make much difference)
2019-05-08Add format-for-hocr-pdf.sh scriptNick White
2019-04-23Save dehyphenated text to a different file, rather than overwriting the originalNick White
2019-04-23Add dehyphenate scriptNick White
2019-04-09Modify traintessv4.sh to include step to construct final trainingNick White
2019-04-02Fix bugs in traintessv4.shNick White
2019-04-02Add tesseractv4 training scriptNick White
2019-03-26Make book graph scripts more robust to dodgy page filenames, and name ↵Nick White
bookgraph better
2019-03-26Add nonewlines scriptNick White
2019-03-11Add basic bsb scraperNick White
2019-02-25Make bookgraph script more readableNick White
2019-02-25Add various helper scriptsNick White