summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-08-19lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0Nick White
EC2's rate limiting
2021-08-18rescribe: Update documentation on how to deal with M1 signing, and move ↵Nick White
makefile to where it makes sense
2021-08-17pdf: Stretch words to fit in their boxes, for more perfect embeddingNick White
- Words are stretched to fit their boxes, which means the accuracy is now very high indeed. This was done by modifying gofpdf to add the SetCellStretchToFit function, which will hopefully be upstreamed in due course. - Copy pasting from a PDF works well with lines rarely if ever being erroneously broken by the PDF reader. There was quite a bit of trial-and-error to improve this, and the stretched text plus a space being added after the word in CellFormat was the best (plus preserves accuracy of word and character locations).
2021-08-17pipeline: use regular storage for tests, rather than a separate oneNick White
2021-08-09pdf: use same line height and origin for all words on a line as it makes ↵Nick White
things neater in the PDF in most cases
2021-08-09pdf: significantly improve character coordinatesNick White
A few good changes to make word coordinate lookups significantly more accurate: - Set font size dynamically based on the line height (previously it was fixed as size 10) - Correct height and width of word boxes (previously they were way too large, which probably didn't make a difference in the general case, but now they're correct) - Set word box margin to zero Also change PDF size to A5 paper, as that's closer to an average book page size.
2021-08-02rescribe: Add experimental m1 buildNick White
2021-08-02internal/pipeline: Add test (incomplete but working) for UploadImagesNick White
2021-07-27internal/pipeline: Add test to check that hidden files are skippedNick White
2021-07-27Update dependenciesNick White
2021-07-27internal/pipeline: add tests for DetectQueueTypeNick White
2021-07-27internal/pipeline: Add notreadable test to CheckImagesNick White
2021-07-27internal/pipeline: Add a test for CheckImagesNick White
2021-07-20Cleanup thanks to go vetNick White
2021-07-19internal/pipeline: Be more explicit with exactly what functions are in each ↵Nick White
interface, to ensure no "duplicate function" errors when compiling
2021-07-13Fix up tests a bitNick White
2021-07-13gofmtNick White
2021-07-13internal/pipeline: Reorganise interfaces so that functions only declare what ↵Nick White
they need We were using Pipeliner as a catch-all, but it's nicer if the functions can just state that e.g. they need download functionality, so decompose things so that that's how we do things now.
2021-07-13aws: Only look up test queue id when asked for, as for most Init()s it won't ↵Nick White
be needed
2021-07-12Add necessary pipeliner dependency for testqueue (probably remove this from ↵Nick White
internal library later as its only needed for tests
2021-07-12Add test for upAndQueue functionNick White
This involved adding a test queue, so it can be run safely without intefering with the pipeline.
2021-07-08rescribe: Exit with an error if directory doesn't existNick White
2021-06-29rescribe: add documentation on how to generate embedded dataNick White
2021-06-29rescribe: Add embed target for darwin (osx) tooNick White
2021-06-22rescribe: Remove erroneous unnecessary mkdirNick White
2021-06-22rescribe: Make it clearer that embedded training files are available to useNick White
2021-06-22rescribe: add embedded tesseract for linuxNick White
2021-06-22rescribe: allow use of embedded training even if -systess is usedNick White
2021-06-22cloud: update spot image to latest version that wont attempt to build ↵Nick White
rescribe tool
2021-06-22rescribe: Add go generate command to download the needed files to embedNick White
2021-06-22rescribe: Add an embedded tessdataNick White
2021-06-21Merge remote-tracking branch 'ssh/master'Nick White
2021-06-21rescribe: Set up so only Tesseract needed for the build platform is embeddedNick White
2021-06-21rescribe: Embed Tesseract into binary so that no Tesseract install is necessaryNick White
2021-06-21update spot image usedNick White
2021-06-15pipeline: Ignore hidden files when checking and uploadingNick White
This prevents issues if a .DS_Store file is present in a directory.
2021-05-31local: Only create a file once we are sure that it will be writeableNick White
2021-05-31Add a test for up(), and document download() and up() properlyNick White
2021-05-31Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵Nick White
available to Pipeliner
2021-05-19Close process channel after writing to err channel in download(), in case of ↵Nick White
an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal.
2021-05-19Add tests for download()Nick White
2021-05-19Fix syntax with another Errorf callNick White
2021-05-19Local download now tries to open the source file before creating a ↵Nick White
destination file, so if it fails an empty file isnt left behind
2021-05-19Add basic DeleteObjects implementation to local.goNick White
2021-05-19Fix syntax for some fmt.Errorf callsNick White
2021-04-12Update preproc dependencyNick White
2021-03-16rescribe: change default training directory to trainings/v0.3.3Nick White
2021-02-22lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵Nick White
lspipeline as there are some hard to debug issues in concurrency version
2021-02-15getsamplepages: Add -prefix option, and use 'best' to get random page numbersNick White
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that.
2021-02-05Merge branch 'master' of ↵Nick White
ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline