summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-07-23dehyphenate: Update to reflect multiple page support in hocr packageNick White
2021-07-23iiifdownloader: Fixed error printingNick White
2021-07-23gofmtNick White
2021-06-08iiifdownloader: remove old and incorrect part which could cause errorsNick White
2021-05-11Handle pages with png suffix correctlyNick White
Example book: https://content.staatsbibliothek-berlin.de/dc/687222079/manifest
2021-05-11Update dlgbook usage statement to reflect that we include the bookid nowNick White
2021-05-10dlgbook: Strip special characters from authors as well as titlesNick White
2021-05-10dlgbook: add google book id to the end of the directory name, and limit ↵Nick White
lengths of title and author to ensure it never meets ext4 size limits
2021-03-25extracthocrlines: ensure opened files are closed promptly, to forego any too ↵Nick White
many open files errors
2021-03-25extracthocrlines: Fix syntax error typoNick White
2021-03-23extracthocrlines: Replace -e with -b, its opposite, and make it defaultNick White
2021-03-23extracthocrlines: Skip empty text linesNick White
2021-03-23hocr: Add ability to specify a custom image path for hocr line extraction, ↵Nick White
and use it in extracthocrlines
2021-03-16Report book dir to use before starting getgbook, in case it is unwantedNick White
2021-03-16dlgbook: add new tool to wrap around getgbook, automatically setting the ↵Nick White
author, year and title and naming the directory appropriately
2021-02-09Add extracthocrlines toolNick White
2021-02-09hocr: Use extracted page name for line namingNick White
This means that even in multi page hocrs with lines with the same id (like line_1_1), then the page name will be different, so extracthocrlines now won't mistakenly name different lines the same and therefore overwrite them.
2021-02-09hocr: Use image specified in ocr_page title, so can support multipage hocrs ↵Nick White
cleanly
2021-02-02[eeboxmltohocr] Fix bug causing error if there were many hocr filesNick White
2021-01-25Fix generic IIIF downloading to fix special-case for the Bodleian onlyNick White
This was also triggering for erara, causing it to fail. As it's clearly a Bodleian special case, we now check the URL is Bodleian before applying it.
2020-11-06Add git clone advice to readmeNick White
2020-10-26[iiifdownloader] Add -insecure flag to ignore TLS errorsNick White
At the time of writing, the https://manuscrits-france-angleterre.org website has expired certificates, which make accessing their images a pain. While the issue is obviously with them, it's reasonable for us to add a -insecure flag (emphatically not the default) to override cert checking for cases like this.
2020-10-13[iiifdownloader] Catch SIGINT when writing a file to remove half-written ↵Nick White
files before exit
2020-10-13Improve error handling, and ensure incomplete page downloads are removedNick White
2020-10-12[analysestats] skip zero confidence pages from statsNick White
2020-09-28[iiifdownloader] Add a TODO to switch to tile based downloadingNick White
2020-09-28[iiifdownloader] Work around oxford needing the iiif suffix adding to its idNick White
2020-09-28[iiifdownloader] Default to iiifmanifest type if none is given and no ↵Nick White
definitive service can be found
2020-09-28Make page numbering more generic to handle more iiif variety, and add ↵Nick White
harvardartmuseums iiif manifest example url
2020-09-28Add ability to pass -service to choose which download type to use, plus add ↵Nick White
a -bookdir flag to set download directory
2020-09-22[analysestats] completeNick White
2020-09-22[analysestats] Parse hocr for training usedNick White
2020-09-21Add wip analysestats commandNick White
2020-09-21Use strings.Replace rather than strings.ReplaceAll so that it works on older ↵Nick White
versions of go
2020-09-12Add todosNick White
2020-09-08Add the option to force METS usage for BSBNick White
2020-09-08Switch to using generic page downloader for BNFNick White
2020-09-08Improve urlToPgName so it can be used by BNF tooNick White
2020-09-08Improve urlToPgName and documentationNick White
2020-09-08Sanitise URLs so that // in url doesn't cause issues (bsb site can spew these)Nick White
2020-09-08Switch from METS to IIIF manifest for BSB downloading, as it returns higher ↵Nick White
quality images with no visible watermark
2020-09-08[iiifdownloader] BSB downloading works now, by parsing METS XMLNick White
2020-09-07[iiifdownloader] Split out NoPgNums downloading to its own functionNick White
2020-09-07Add skeleton of bsb supportNick White
2020-08-25Move dehyphenate string code into its own functionNick White
2020-08-25Fixes to dehyphenateNick White
- Ensure a final hyphen on the last word of a page isn't removed - Only try to add a next word if there is one to take - Ensure that if a single word is on the following line, which is taken, then the line is blanked
2020-08-25Add text mode for dehyphenate toolNick White
2020-06-23[iiifdownloader] Only remove 1 duplicate page, as 2nd one may not be ↵Nick White
duplicate (no way of knowing as if it is its downsized)
2020-06-23[iiifdownloader] Add support for BNF urls with a dot after book idNick White
2020-06-23Add IIIF downloader, that just supports BNF for nowNick White