summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--content/posts/pdfs/index.md154
-rw-r--r--images/pdfs/lineheighttofontsize-smaller.xcfbin0 -> 880128 bytes
-rw-r--r--images/pdfs/lineheighttofontsize.xcfbin0 -> 2964524 bytes
3 files changed, 154 insertions, 0 deletions
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md
new file mode 100644
index 0000000..1e8d11e
--- /dev/null
+++ b/content/posts/pdfs/index.md
@@ -0,0 +1,154 @@
+---
+title: "Making great PDFs for OCRed works"
+date: 2021-09-06
+categories: [pdf, software, code]
+---
+Recently we have been putting some effort into improving the PDF
+output from our tools, which have all made it into the latest
+release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). PDFs
+are an interesting, sometimes tricky file format to produce
+correctly, so here we'll run through some of the ways we get really
+good PDFs out of our pipeline, and exactly what "good" means in
+this context.
+
+## What makes a good searchable PDF?
+
+![TODO: Example of nice pdf with nice search highlighting](example-01.png)
+
+If you've worked with PDFs a lot, from a variety of different
+sources, you may have seen various issues crop up. A good
+searchable PDF is fundamentally a PDF which is clear and easy to
+use, and where everything works as expected. More helpful in
+understanding how to get there is to consider some of the many ways
+that a searchable PDF can fail to live up to that goal.
+
+* Copy-pasting text from a PDF has no space between words, extra
+ line breaks, or other formatting issues.
+* Non-ASCII characters aren't correctly embedded, so searching for
+ words which contain them, or copying them from a PDF, doesn't
+ work.
+* PDF doesn't open in some PDF readers or causes warnings to be
+ displayed.
+* When searching a PDF, the highlighted area for a match doesn't
+ exactly match the location of the result.
+* PDF files are too large to easily use, meaning they're difficult
+ to share, and can cause issues on older computers.
+* Page images are too compressed to be useable.
+* The DPI is set incorrectly causing PDFs to be rendered far too
+ zoomed in or out by default, both on screen and if printed.
+
+This list should give a sense of how the simple goal of producing
+a searchable PDF that just works as expected actually requires a
+good deal of care and thought to get right. In the rest of this
+post we'll go over some of the more interesting and tricky issues
+we tackled to get the PDF output of our tools to the excellent
+state it is in today.
+
+## Finding a good image size to embed
+
+![TODO: Example of too low quality vs fine quality](example-01.png)
+
+We prefer to do our OCR on images scanned with high DPI, as they
+produce more accurate results, but they also take a lot of disk
+space. As an example, the page images we use derived from books
+from Google Books tend to be around 1600x2500 pixels, which even
+when compressed as JPEG results in each image being around 500KB.
+While that may sound manageable, a book with 500 page images of
+this size would therefore take around 250MB, which is an
+annoyingly heavy PDF to open, store, and share.
+
+While the optimum image quality to pick ultimately depends on the
+use to which the PDF is to be put, in the general case of wanting
+a readable, searchable PDF, we found that using a fixed image
+height of around 1000 pixels was about right. This results in a
+PDF which is high quality, but is rather more manageable, clocking
+in at around 70MB for a 500 page book.
+
+## Hiding the text layer
+
+Interestingly there are quite a few different methods for creating
+the text layer on a searchable PDF made up of page images. One
+common method is to use a font which is completely invisible,
+however this is somewhat of a hack, and can cause issues with
+Unicode characters, and generally be unreliable.
+
+![TODO: Example of copy pasting unicode going badly](example-01.png)
+
+The ideal method, which we have implemented, is to use a feature
+of the PDF specification called "Text Rendering Mode" to request
+that the text should be laid out as normal, but not "stroked"
+(outlined) or "filled". Not many PDF rendering libraries seem to
+support this, but we were able to easily add it to the gofpdf
+package we use, and
+[get the support upstream](https://github.com/jung-kurt/gofpdf/pull/331)
+for others to benefit from. We could then use a simple and
+reliable freely licensed Unicode font, DejaVu Condensed, so that
+any complex characters and layout is handled correctly, while the
+only text visible in the PDF is that in the original page images.
+
+## Aligning the text with the image
+
+The trickiest part of getting a really good searchable PDF is
+lining up the text layer perfectly with the printed text in the
+image. The font chosen for the PDF text layer will rarely have
+precisely the same dimensions as the text in the image, so it's
+common to have text which doesn't line up precisely with its
+location on the image. There are several ways to address this, but
+doing it right isn't easy.
+
+One way would be to specify the exact location of each character,
+using the coordinates directly from the OCR engine. This has some
+appeal, as the starting location of each character should be
+exactly correct. However, it leaves the PDF viewer to decide where
+spaces between words should be, and which line each character is
+on. While that may sound straightforward, in a large corpus, even
+ of good quality scans, this causes constant errors. This results
+in blocks of text copied from the PDF being garbled, and means
+searching by word or phrase will generally fail, as spaces are
+incorrectly added or omitted by the PDF viewer.
+
+![Example of setting font size using lineheight](example-lineheighttofontsize.webp)
+
+Specifying the location of each word, instead, addresses the issue
+of spaces being incorrectly added or omitted, but there are still
+several issues remaining. As each word can have its highest and
+lowest points at different levels, depending on whether there are
+characters with ascenders and descenders present (such as 'b' and
+'p'), then each word is seen to move up and down relative to its
+neighbours, which affects the ability of the PDF viewer to
+determine whether they are on the same line more than you would
+expect, resulting once again in unreliable search and text
+copying. The solution to this is simple, to always set the
+vertical position and height of each word to that of the line as
+a whole, ignoring the dimensions given by the OCR engine.
+
+![TODO: Example of nostretch vs final](example-01.png)
+
+That improves things a lot, but there's still one issue, which is
+that each word may be too narrow or wide to match with the text
+in the underlying image, as the font dimensions will not precisely
+match those in the image. This isn't the end of the world, as it
+doesn't affect searching or copying of text, but it is unexpected
+and messy, and we can fix it. There are two parts to the fix. The
+easy one is to set the font size for each word to match the line
+height, which gets us most of the way there. To perfect things,
+though, is more tricky. The goal is to stretch the characters in
+each word to ensure that they precisely fill the width dimensions
+that the OCR engine reports. This can be done using a feature of
+PDF called "Horizontal Stretch", which again we had to
+[add to the gofpdf library](https://github.com/nickjwhite/gofpdf/commit/3e3f1fbc561c4ae28e33d92959d58f2aa74428c8)
+to implement. This work has also been submitted upstream, but as
+it's more niche it's not yet clear whether it will be accepted.
+With this last step done, the text lines up pretty perfectly with
+the text in the underlying image, while perfectly supporting
+search and copying of any amount of text.
+
+## Conclusion
+
+Our PDF support is now the best available anywhere, as far as
+we're aware. Give it a go with the
+[rescribe tool](https://rescribe.xyz/rescribe)
+or the
+[bookpipeline package (for experts)](https://rescribe.xyz/bookpipeline),
+and let us know how you get on by [email](mailto:info@rescribe.xyz) or
+[@RescribeOCR on twitter](https://twitter.com/rescribeOCR).
diff --git a/images/pdfs/lineheighttofontsize-smaller.xcf b/images/pdfs/lineheighttofontsize-smaller.xcf
new file mode 100644
index 0000000..ec846ab
--- /dev/null
+++ b/images/pdfs/lineheighttofontsize-smaller.xcf
Binary files differ
diff --git a/images/pdfs/lineheighttofontsize.xcf b/images/pdfs/lineheighttofontsize.xcf
new file mode 100644
index 0000000..cc98840
--- /dev/null
+++ b/images/pdfs/lineheighttofontsize.xcf
Binary files differ