Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-08-19 | lspipeline-ng: Limit number of book details requests so we don't run into ↵v0.5.0 | Nick White | |
EC2's rate limiting | |||
2021-08-18 | rescribe: Update documentation on how to deal with M1 signing, and move ↵ | Nick White | |
makefile to where it makes sense | |||
2021-08-17 | pdf: Stretch words to fit in their boxes, for more perfect embedding | Nick White | |
- Words are stretched to fit their boxes, which means the accuracy is now very high indeed. This was done by modifying gofpdf to add the SetCellStretchToFit function, which will hopefully be upstreamed in due course. - Copy pasting from a PDF works well with lines rarely if ever being erroneously broken by the PDF reader. There was quite a bit of trial-and-error to improve this, and the stretched text plus a space being added after the word in CellFormat was the best (plus preserves accuracy of word and character locations). | |||
2021-08-17 | pipeline: use regular storage for tests, rather than a separate one | Nick White | |
2021-08-09 | pdf: use same line height and origin for all words on a line as it makes ↵ | Nick White | |
things neater in the PDF in most cases | |||
2021-08-09 | pdf: significantly improve character coordinates | Nick White | |
A few good changes to make word coordinate lookups significantly more accurate: - Set font size dynamically based on the line height (previously it was fixed as size 10) - Correct height and width of word boxes (previously they were way too large, which probably didn't make a difference in the general case, but now they're correct) - Set word box margin to zero Also change PDF size to A5 paper, as that's closer to an average book page size. | |||
2021-08-02 | rescribe: Add experimental m1 build | Nick White | |
2021-08-02 | internal/pipeline: Add test (incomplete but working) for UploadImages | Nick White | |
2021-07-27 | internal/pipeline: Add test to check that hidden files are skipped | Nick White | |
2021-07-27 | Update dependencies | Nick White | |
2021-07-27 | internal/pipeline: add tests for DetectQueueType | Nick White | |
2021-07-27 | internal/pipeline: Add notreadable test to CheckImages | Nick White | |
2021-07-27 | internal/pipeline: Add a test for CheckImages | Nick White | |
2021-07-20 | Cleanup thanks to go vet | Nick White | |
2021-07-19 | internal/pipeline: Be more explicit with exactly what functions are in each ↵ | Nick White | |
interface, to ensure no "duplicate function" errors when compiling | |||
2021-07-13 | Fix up tests a bit | Nick White | |
2021-07-13 | gofmt | Nick White | |
2021-07-13 | internal/pipeline: Reorganise interfaces so that functions only declare what ↵ | Nick White | |
they need We were using Pipeliner as a catch-all, but it's nicer if the functions can just state that e.g. they need download functionality, so decompose things so that that's how we do things now. | |||
2021-07-13 | aws: Only look up test queue id when asked for, as for most Init()s it won't ↵ | Nick White | |
be needed | |||
2021-07-12 | Add necessary pipeliner dependency for testqueue (probably remove this from ↵ | Nick White | |
internal library later as its only needed for tests | |||
2021-07-12 | Add test for upAndQueue function | Nick White | |
This involved adding a test queue, so it can be run safely without intefering with the pipeline. | |||
2021-07-08 | rescribe: Exit with an error if directory doesn't exist | Nick White | |
2021-06-29 | rescribe: add documentation on how to generate embedded data | Nick White | |
2021-06-29 | rescribe: Add embed target for darwin (osx) too | Nick White | |
2021-06-22 | rescribe: Remove erroneous unnecessary mkdir | Nick White | |
2021-06-22 | rescribe: Make it clearer that embedded training files are available to use | Nick White | |
2021-06-22 | rescribe: add embedded tesseract for linux | Nick White | |
2021-06-22 | rescribe: allow use of embedded training even if -systess is used | Nick White | |
2021-06-22 | cloud: update spot image to latest version that wont attempt to build ↵ | Nick White | |
rescribe tool | |||
2021-06-22 | rescribe: Add go generate command to download the needed files to embed | Nick White | |
2021-06-22 | rescribe: Add an embedded tessdata | Nick White | |
2021-06-21 | Merge remote-tracking branch 'ssh/master' | Nick White | |
2021-06-21 | rescribe: Set up so only Tesseract needed for the build platform is embedded | Nick White | |
2021-06-21 | rescribe: Embed Tesseract into binary so that no Tesseract install is necessary | Nick White | |
2021-06-21 | update spot image used | Nick White | |
2021-06-15 | pipeline: Ignore hidden files when checking and uploading | Nick White | |
This prevents issues if a .DS_Store file is present in a directory. | |||
2021-05-31 | local: Only create a file once we are sure that it will be writeable | Nick White | |
2021-05-31 | Add a test for up(), and document download() and up() properly | Nick White | |
2021-05-31 | Fix bug after changing pipeliner for tests, to ensure DeleteObjects is ↵ | Nick White | |
available to Pipeliner | |||
2021-05-19 | Close process channel after writing to err channel in download(), in case of ↵ | Nick White | |
an error This is needed so that in tests the error can be selected out reliably, rather than an empty process signal. | |||
2021-05-19 | Add tests for download() | Nick White | |
2021-05-19 | Fix syntax with another Errorf call | Nick White | |
2021-05-19 | Local download now tries to open the source file before creating a ↵ | Nick White | |
destination file, so if it fails an empty file isnt left behind | |||
2021-05-19 | Add basic DeleteObjects implementation to local.go | Nick White | |
2021-05-19 | Fix syntax for some fmt.Errorf calls | Nick White | |
2021-04-12 | Update preproc dependency | Nick White | |
2021-03-16 | rescribe: change default training directory to trainings/v0.3.3 | Nick White | |
2021-02-22 | lspipeline: Rename to lspipeline-ng, and restore pre concurrency version to ↵ | Nick White | |
lspipeline as there are some hard to debug issues in concurrency version | |||
2021-02-15 | getsamplepages: Add -prefix option, and use 'best' to get random page numbers | Nick White | |
The -prefix option is useful to us. Previously only a .jpg for page number 100 was retreived, which failed if the book had fewer (or unusually named) pages, and also didn't provide a corresponding .hocr at all (bug introduced with 48958d2). Using 'best', which is (effectively) randomly sorted, provides a guaranteed to exist page, and a random one at that. | |||
2021-02-05 | Merge branch 'master' of ↵ | Nick White | |
ssh://ssh.phx.nearlyfreespeech.net/home/public/bookpipeline |