pdfs: wording improvements throughout

author: Nick White <git@njw.name> 2021-10-25 17:31:47 +0100
committer: Nick White <git@njw.name> 2021-10-25 17:31:47 +0100
commit: 44ba2d62c37b387063b2b5b3cada41902c089e36 (patch)
tree: 59ff441a5431a29b92032a59f9fb4f13597d8e88 /content/posts
parent: 18cc85896bd33eb94fee5d20efe6c557357980d0 (diff)
1 files changed, 19 insertions, 20 deletions
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md
index 02e8457..19e27b3 100644
--- a/content/posts/pdfs/index.md
+++ b/content/posts/pdfs/index.md
@@ -15,9 +15,7 @@ exactly what "good" means in this context.
 
 ## What makes a good searchable PDF?
 
-If you've worked with PDFs a lot, from a variety of different
-sources, you may have seen various issues crop up. A good
-searchable PDF is fundamentally a PDF which is clear and easy to
+A good searchable PDF is fundamentally a PDF which is clear and easy to
 use, and where everything works as expected. More helpful in
 understanding how to get there is to consider some of the many ways
 that a searchable PDF can fail to live up to that goal.
@@ -31,11 +29,12 @@ that a searchable PDF can fail to live up to that goal.
   displayed.
 * When searching a PDF, the highlighted area for a match doesn't
   exactly match the location of the result.
-* PDF files are too large to easily use, meaning they're difficult
-  to share, and can cause issues on older computers.
-* Page images are too compressed to be useable.
-* The DPI is set incorrectly causing PDFs to be rendered far too
-  zoomed in or out by default, both on screen and if printed.
+* PDF files are very large, meaning they're difficult to share,
+  slow to work with, and can cause issues on older computers.
+* Page images are so compressed that the text is difficult or
+  impossible to read.
+* The DPI is set incorrectly, which can cause PDFs to be rendered
+  far too large or small by default, both on screen and if printed.
 
 This list should give a sense of how the simple goal of producing
 a searchable PDF that just works as expected actually requires a
@@ -55,9 +54,7 @@ While that may sound manageable, a book with 500 page images of
 this size would therefore take around 250MB, which is an
 annoyingly heavy PDF to open, store, and share.
 
-While the optimum image quality to pick ultimately depends on the
-use to which the PDF is to be put, in the general case of wanting
-a readable, searchable PDF, we found that using a fixed image
+To produce a readable, searchable PDF, we found that using a fixed image
 height of around 1000 pixels was about right. This results in a
 PDF which is high quality, but is rather more manageable, clocking
 in at around 70MB for a 500 page book.
@@ -72,13 +69,13 @@ common method is to use a font which is completely invisible,
 however this is somewhat of a hack, and can cause issues with
 Unicode characters, and generally be unreliable.
 
-The ideal method, which we have implemented, is to use a feature
+The best method, which we have implemented, is to use a feature
 of the PDF specification called "Text Rendering Mode" to request
-that the text should be laid out as normal, but not "stroked"
-(outlined) or "filled". Not many PDF rendering libraries seem to
+that the text should be laid out as normal, but not actually
+drawn in the final output. Not many PDF rendering libraries seem to
 support this, but we were able to easily add it to the gofpdf
 package we use, and
-[get the support upstream](https://github.com/jung-kurt/gofpdf/pull/331)
+[get the support integrated upstream into the official library](https://github.com/jung-kurt/gofpdf/pull/331)
 for others to benefit from. We could then use a simple and
 reliable freely licensed Unicode font, DejaVu Condensed, so that
 any complex characters and layout is handled correctly, while the
@@ -135,17 +132,19 @@ each word to ensure that they precisely fill the width dimensions
 that the OCR engine reports. This can be done using a feature of
 PDF called "Horizontal Stretch", which again we had to
 [add to the gofpdf library](https://github.com/nickjwhite/gofpdf/commit/3e3f1fbc561c4ae28e33d92959d58f2aa74428c8)
-to implement. This work has also been submitted upstream, but as
-it's more niche it's not yet clear whether it will be accepted.
+to implement.
 With this last step done, the text lines up pretty perfectly with
 the text in the underlying image, while perfectly supporting
 search and copying of any amount of text.
 
 ## Conclusion
 
-Our PDF support is now the best available anywhere, as far as
-we're aware. Give it a go with the
-[rescribe tool](https://rescribe.xyz/rescribe)
+PDFs may seem straightforward, but producing really high
+quality searchable PDFs from OCR took us some work to get
+right.
+
+You can see the results with the latest versions of
+the [rescribe tool](https://rescribe.xyz/rescribe)
 or the `pdfbook` tool in the
 [bookpipeline package](https://rescribe.xyz/bookpipeline).
 Let us know how you get on by [email](mailto:info@rescribe.xyz) or
author	Nick White <git@njw.name>	2021-10-25 17:31:47 +0100
committer	Nick White <git@njw.name>	2021-10-25 17:31:47 +0100
commit	44ba2d62c37b387063b2b5b3cada41902c089e36 (patch)
tree	59ff441a5431a29b92032a59f9fb4f13597d8e88 /content/posts
parent	18cc85896bd33eb94fee5d20efe6c557357980d0 (diff)