diff options
Diffstat (limited to 'content/posts')
-rw-r--r-- | content/posts/pdfs/index.md | 154 |
1 files changed, 154 insertions, 0 deletions
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md new file mode 100644 index 0000000..1e8d11e --- /dev/null +++ b/content/posts/pdfs/index.md @@ -0,0 +1,154 @@ +--- +title: "Making great PDFs for OCRed works" +date: 2021-09-06 +categories: [pdf, software, code] +--- +Recently we have been putting some effort into improving the PDF +output from our tools, which have all made it into the latest +release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). PDFs +are an interesting, sometimes tricky file format to produce +correctly, so here we'll run through some of the ways we get really +good PDFs out of our pipeline, and exactly what "good" means in +this context. + +## What makes a good searchable PDF? + +![TODO: Example of nice pdf with nice search highlighting](example-01.png) + +If you've worked with PDFs a lot, from a variety of different +sources, you may have seen various issues crop up. A good +searchable PDF is fundamentally a PDF which is clear and easy to +use, and where everything works as expected. More helpful in +understanding how to get there is to consider some of the many ways +that a searchable PDF can fail to live up to that goal. + +* Copy-pasting text from a PDF has no space between words, extra + line breaks, or other formatting issues. +* Non-ASCII characters aren't correctly embedded, so searching for + words which contain them, or copying them from a PDF, doesn't + work. +* PDF doesn't open in some PDF readers or causes warnings to be + displayed. +* When searching a PDF, the highlighted area for a match doesn't + exactly match the location of the result. +* PDF files are too large to easily use, meaning they're difficult + to share, and can cause issues on older computers. +* Page images are too compressed to be useable. +* The DPI is set incorrectly causing PDFs to be rendered far too + zoomed in or out by default, both on screen and if printed. + +This list should give a sense of how the simple goal of producing +a searchable PDF that just works as expected actually requires a +good deal of care and thought to get right. In the rest of this +post we'll go over some of the more interesting and tricky issues +we tackled to get the PDF output of our tools to the excellent +state it is in today. + +## Finding a good image size to embed + +![TODO: Example of too low quality vs fine quality](example-01.png) + +We prefer to do our OCR on images scanned with high DPI, as they +produce more accurate results, but they also take a lot of disk +space. As an example, the page images we use derived from books +from Google Books tend to be around 1600x2500 pixels, which even +when compressed as JPEG results in each image being around 500KB. +While that may sound manageable, a book with 500 page images of +this size would therefore take around 250MB, which is an +annoyingly heavy PDF to open, store, and share. + +While the optimum image quality to pick ultimately depends on the +use to which the PDF is to be put, in the general case of wanting +a readable, searchable PDF, we found that using a fixed image +height of around 1000 pixels was about right. This results in a +PDF which is high quality, but is rather more manageable, clocking +in at around 70MB for a 500 page book. + +## Hiding the text layer + +Interestingly there are quite a few different methods for creating +the text layer on a searchable PDF made up of page images. One +common method is to use a font which is completely invisible, +however this is somewhat of a hack, and can cause issues with +Unicode characters, and generally be unreliable. + +![TODO: Example of copy pasting unicode going badly](example-01.png) + +The ideal method, which we have implemented, is to use a feature +of the PDF specification called "Text Rendering Mode" to request +that the text should be laid out as normal, but not "stroked" +(outlined) or "filled". Not many PDF rendering libraries seem to +support this, but we were able to easily add it to the gofpdf +package we use, and +[get the support upstream](https://github.com/jung-kurt/gofpdf/pull/331) +for others to benefit from. We could then use a simple and +reliable freely licensed Unicode font, DejaVu Condensed, so that +any complex characters and layout is handled correctly, while the +only text visible in the PDF is that in the original page images. + +## Aligning the text with the image + +The trickiest part of getting a really good searchable PDF is +lining up the text layer perfectly with the printed text in the +image. The font chosen for the PDF text layer will rarely have +precisely the same dimensions as the text in the image, so it's +common to have text which doesn't line up precisely with its +location on the image. There are several ways to address this, but +doing it right isn't easy. + +One way would be to specify the exact location of each character, +using the coordinates directly from the OCR engine. This has some +appeal, as the starting location of each character should be +exactly correct. However, it leaves the PDF viewer to decide where +spaces between words should be, and which line each character is +on. While that may sound straightforward, in a large corpus, even + of good quality scans, this causes constant errors. This results +in blocks of text copied from the PDF being garbled, and means +searching by word or phrase will generally fail, as spaces are +incorrectly added or omitted by the PDF viewer. + +![Example of setting font size using lineheight](example-lineheighttofontsize.webp) + +Specifying the location of each word, instead, addresses the issue +of spaces being incorrectly added or omitted, but there are still +several issues remaining. As each word can have its highest and +lowest points at different levels, depending on whether there are +characters with ascenders and descenders present (such as 'b' and +'p'), then each word is seen to move up and down relative to its +neighbours, which affects the ability of the PDF viewer to +determine whether they are on the same line more than you would +expect, resulting once again in unreliable search and text +copying. The solution to this is simple, to always set the +vertical position and height of each word to that of the line as +a whole, ignoring the dimensions given by the OCR engine. + +![TODO: Example of nostretch vs final](example-01.png) + +That improves things a lot, but there's still one issue, which is +that each word may be too narrow or wide to match with the text +in the underlying image, as the font dimensions will not precisely +match those in the image. This isn't the end of the world, as it +doesn't affect searching or copying of text, but it is unexpected +and messy, and we can fix it. There are two parts to the fix. The +easy one is to set the font size for each word to match the line +height, which gets us most of the way there. To perfect things, +though, is more tricky. The goal is to stretch the characters in +each word to ensure that they precisely fill the width dimensions +that the OCR engine reports. This can be done using a feature of +PDF called "Horizontal Stretch", which again we had to +[add to the gofpdf library](https://github.com/nickjwhite/gofpdf/commit/3e3f1fbc561c4ae28e33d92959d58f2aa74428c8) +to implement. This work has also been submitted upstream, but as +it's more niche it's not yet clear whether it will be accepted. +With this last step done, the text lines up pretty perfectly with +the text in the underlying image, while perfectly supporting +search and copying of any amount of text. + +## Conclusion + +Our PDF support is now the best available anywhere, as far as +we're aware. Give it a go with the +[rescribe tool](https://rescribe.xyz/rescribe) +or the +[bookpipeline package (for experts)](https://rescribe.xyz/bookpipeline), +and let us know how you get on by [email](mailto:info@rescribe.xyz) or +[@RescribeOCR on twitter](https://twitter.com/rescribeOCR). |