Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
This avoids the issue that large PDFs require a lot of RAM, so there
are chances of running out of memory. Plus it's a waste of space and
time.
|
|
Some issues:
1) The PDF generation stores every page in memory while it constructs it. That means that
there's a higher chance of failure due to running out of memory with these. There's no
getting around this except by improving the PDF generation library, which is not easy.
2) Currently I've just changed the pipeline to always generate these full size PDFs, and
then the rescribe tool will just delete them if they weren't requested. This is bad in
particular because of point 1, and would probably cause issues of failures in the server
pipeline as a result
Therefore the plan is to add a tag to queue messages so that full size generation can be
selectively enabled.
Also, it should be split from the loop with colour pdf generation, as holding them both in RAM at
the same time is unnecessary.
|
|
|
|
|
|
|
|
|
|
|
|
Tesseract
|
|
possible to get the gui into a bad state by cancelling before startProcess began (hopefully)
|
|
preprocessing
|
|
|
|
other of binarised or colour may not exist
|
|
|
|
ends, so multiple books can be processed by the gui one after the other
|
|
|
|
|
|
folder
|
|
|
|
|
|
upload
|
|
There are several TODO items before this can be considered "good
enough", let alone complete. See the comments in the code for
details.
On a good day, with a fair wind, though, this works.
|
|
|
|
|
|
change)
|
|
|
|
|
|
|
|
output for training file not found, so that its clear that the file specified may not exist
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
available to Pipeliner
|
|
|
|
already has a hocr directory in it will work
|
|
after the book being processed
|
|
txt-ified on windows
|