summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNick White <git@njw.name>2021-10-25 17:31:47 +0100
committerNick White <git@njw.name>2021-10-25 17:31:47 +0100
commit44ba2d62c37b387063b2b5b3cada41902c089e36 (patch)
tree59ff441a5431a29b92032a59f9fb4f13597d8e88
parent18cc85896bd33eb94fee5d20efe6c557357980d0 (diff)
pdfs: wording improvements throughout
-rw-r--r--content/posts/pdfs/index.md39
1 files changed, 19 insertions, 20 deletions
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md
index 02e8457..19e27b3 100644
--- a/content/posts/pdfs/index.md
+++ b/content/posts/pdfs/index.md
@@ -15,9 +15,7 @@ exactly what "good" means in this context.
## What makes a good searchable PDF?
-If you've worked with PDFs a lot, from a variety of different
-sources, you may have seen various issues crop up. A good
-searchable PDF is fundamentally a PDF which is clear and easy to
+A good searchable PDF is fundamentally a PDF which is clear and easy to
use, and where everything works as expected. More helpful in
understanding how to get there is to consider some of the many ways
that a searchable PDF can fail to live up to that goal.
@@ -31,11 +29,12 @@ that a searchable PDF can fail to live up to that goal.
displayed.
* When searching a PDF, the highlighted area for a match doesn't
exactly match the location of the result.
-* PDF files are too large to easily use, meaning they're difficult
- to share, and can cause issues on older computers.
-* Page images are too compressed to be useable.
-* The DPI is set incorrectly causing PDFs to be rendered far too
- zoomed in or out by default, both on screen and if printed.
+* PDF files are very large, meaning they're difficult to share,
+ slow to work with, and can cause issues on older computers.
+* Page images are so compressed that the text is difficult or
+ impossible to read.
+* The DPI is set incorrectly, which can cause PDFs to be rendered
+ far too large or small by default, both on screen and if printed.
This list should give a sense of how the simple goal of producing
a searchable PDF that just works as expected actually requires a
@@ -55,9 +54,7 @@ While that may sound manageable, a book with 500 page images of
this size would therefore take around 250MB, which is an
annoyingly heavy PDF to open, store, and share.
-While the optimum image quality to pick ultimately depends on the
-use to which the PDF is to be put, in the general case of wanting
-a readable, searchable PDF, we found that using a fixed image
+To produce a readable, searchable PDF, we found that using a fixed image
height of around 1000 pixels was about right. This results in a
PDF which is high quality, but is rather more manageable, clocking
in at around 70MB for a 500 page book.
@@ -72,13 +69,13 @@ common method is to use a font which is completely invisible,
however this is somewhat of a hack, and can cause issues with
Unicode characters, and generally be unreliable.
-The ideal method, which we have implemented, is to use a feature
+The best method, which we have implemented, is to use a feature
of the PDF specification called "Text Rendering Mode" to request
-that the text should be laid out as normal, but not "stroked"
-(outlined) or "filled". Not many PDF rendering libraries seem to
+that the text should be laid out as normal, but not actually
+drawn in the final output. Not many PDF rendering libraries seem to
support this, but we were able to easily add it to the gofpdf
package we use, and
-[get the support upstream](https://github.com/jung-kurt/gofpdf/pull/331)
+[get the support integrated upstream into the official library](https://github.com/jung-kurt/gofpdf/pull/331)
for others to benefit from. We could then use a simple and
reliable freely licensed Unicode font, DejaVu Condensed, so that
any complex characters and layout is handled correctly, while the
@@ -135,17 +132,19 @@ each word to ensure that they precisely fill the width dimensions
that the OCR engine reports. This can be done using a feature of
PDF called "Horizontal Stretch", which again we had to
[add to the gofpdf library](https://github.com/nickjwhite/gofpdf/commit/3e3f1fbc561c4ae28e33d92959d58f2aa74428c8)
-to implement. This work has also been submitted upstream, but as
-it's more niche it's not yet clear whether it will be accepted.
+to implement.
With this last step done, the text lines up pretty perfectly with
the text in the underlying image, while perfectly supporting
search and copying of any amount of text.
## Conclusion
-Our PDF support is now the best available anywhere, as far as
-we're aware. Give it a go with the
-[rescribe tool](https://rescribe.xyz/rescribe)
+PDFs may seem straightforward, but producing really high
+quality searchable PDFs from OCR took us some work to get
+right.
+
+You can see the results with the latest versions of
+the [rescribe tool](https://rescribe.xyz/rescribe)
or the `pdfbook` tool in the
[bookpipeline package](https://rescribe.xyz/bookpipeline).
Let us know how you get on by [email](mailto:info@rescribe.xyz) or