From 18cc85896bd33eb94fee5d20efe6c557357980d0 Mon Sep 17 00:00:00 2001 From: Nick White Date: Mon, 11 Oct 2021 17:37:32 +0100 Subject: Finish (probably) pdf post --- content/posts/pdfs/badquality.webp | Bin 0 -> 73496 bytes content/posts/pdfs/index.md | 32 +++++++++++++-------------- content/posts/pdfs/lineheighttofontsize.webp | Bin 0 -> 169320 bytes content/posts/pdfs/nicesearch.webp | Bin 0 -> 77038 bytes content/posts/pdfs/stretch.webp | Bin 0 -> 182550 bytes images/pdfs/stretch-smaller.xcf | Bin 0 -> 942805 bytes images/pdfs/stretch.xcf | Bin 0 -> 2901120 bytes 7 files changed, 15 insertions(+), 17 deletions(-) create mode 100644 content/posts/pdfs/badquality.webp create mode 100644 content/posts/pdfs/lineheighttofontsize.webp create mode 100644 content/posts/pdfs/nicesearch.webp create mode 100644 content/posts/pdfs/stretch.webp create mode 100644 images/pdfs/stretch-smaller.xcf create mode 100644 images/pdfs/stretch.xcf diff --git a/content/posts/pdfs/badquality.webp b/content/posts/pdfs/badquality.webp new file mode 100644 index 0000000..5c1b932 Binary files /dev/null and b/content/posts/pdfs/badquality.webp differ diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md index 1e8d11e..02e8457 100644 --- a/content/posts/pdfs/index.md +++ b/content/posts/pdfs/index.md @@ -1,19 +1,19 @@ --- title: "Making great PDFs for OCRed works" -date: 2021-09-06 +date: 2021-10-11 categories: [pdf, software, code] --- Recently we have been putting some effort into improving the PDF output from our tools, which have all made it into the latest -release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). PDFs -are an interesting, sometimes tricky file format to produce -correctly, so here we'll run through some of the ways we get really -good PDFs out of our pipeline, and exactly what "good" means in -this context. +release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). While +they may seem simple, PDFs are a surprisingly complex, sometimes +tricky file format to produce correctly, so here we'll run through +some of the ways we get really good PDFs out of our pipeline, and +exactly what "good" means in this context. -## What makes a good searchable PDF? +![Example of PDF with good search highlighting](nicesearch.webp) -![TODO: Example of nice pdf with nice search highlighting](example-01.png) +## What makes a good searchable PDF? If you've worked with PDFs a lot, from a variety of different sources, you may have seen various issues crop up. A good @@ -46,8 +46,6 @@ state it is in today. ## Finding a good image size to embed -![TODO: Example of too low quality vs fine quality](example-01.png) - We prefer to do our OCR on images scanned with high DPI, as they produce more accurate results, but they also take a lot of disk space. As an example, the page images we use derived from books @@ -64,6 +62,8 @@ height of around 1000 pixels was about right. This results in a PDF which is high quality, but is rather more manageable, clocking in at around 70MB for a 500 page book. +{{< figure src="badquality.webp" caption="Example of a PDF with overly compressed images." >}} + ## Hiding the text layer Interestingly there are quite a few different methods for creating @@ -72,8 +72,6 @@ common method is to use a font which is completely invisible, however this is somewhat of a hack, and can cause issues with Unicode characters, and generally be unreliable. -![TODO: Example of copy pasting unicode going badly](example-01.png) - The ideal method, which we have implemented, is to use a feature of the PDF specification called "Text Rendering Mode" to request that the text should be laid out as normal, but not "stroked" @@ -107,7 +105,7 @@ in blocks of text copied from the PDF being garbled, and means searching by word or phrase will generally fail, as spaces are incorrectly added or omitted by the PDF viewer. -![Example of setting font size using lineheight](example-lineheighttofontsize.webp) +{{< figure src="lineheighttofontsize.webp" caption="Example of setting font size using lineheight (hidden text layer in blue)." >}} Specifying the location of each word, instead, addresses the issue of spaces being incorrectly added or omitted, but there are still @@ -122,7 +120,7 @@ copying. The solution to this is simple, to always set the vertical position and height of each word to that of the line as a whole, ignoring the dimensions given by the OCR engine. -![TODO: Example of nostretch vs final](example-01.png) +{{< figure src="stretch.webp" caption="Example of the effect of adding horizontal stretch (hidden text layer in blue)." >}} That improves things a lot, but there's still one issue, which is that each word may be too narrow or wide to match with the text @@ -148,7 +146,7 @@ search and copying of any amount of text. Our PDF support is now the best available anywhere, as far as we're aware. Give it a go with the [rescribe tool](https://rescribe.xyz/rescribe) -or the -[bookpipeline package (for experts)](https://rescribe.xyz/bookpipeline), -and let us know how you get on by [email](mailto:info@rescribe.xyz) or +or the `pdfbook` tool in the +[bookpipeline package](https://rescribe.xyz/bookpipeline). +Let us know how you get on by [email](mailto:info@rescribe.xyz) or [@RescribeOCR on twitter](https://twitter.com/rescribeOCR). diff --git a/content/posts/pdfs/lineheighttofontsize.webp b/content/posts/pdfs/lineheighttofontsize.webp new file mode 100644 index 0000000..c312429 Binary files /dev/null and b/content/posts/pdfs/lineheighttofontsize.webp differ diff --git a/content/posts/pdfs/nicesearch.webp b/content/posts/pdfs/nicesearch.webp new file mode 100644 index 0000000..9b296e9 Binary files /dev/null and b/content/posts/pdfs/nicesearch.webp differ diff --git a/content/posts/pdfs/stretch.webp b/content/posts/pdfs/stretch.webp new file mode 100644 index 0000000..858d909 Binary files /dev/null and b/content/posts/pdfs/stretch.webp differ diff --git a/images/pdfs/stretch-smaller.xcf b/images/pdfs/stretch-smaller.xcf new file mode 100644 index 0000000..0152545 Binary files /dev/null and b/images/pdfs/stretch-smaller.xcf differ diff --git a/images/pdfs/stretch.xcf b/images/pdfs/stretch.xcf new file mode 100644 index 0000000..a56389d Binary files /dev/null and b/images/pdfs/stretch.xcf differ -- cgit v1.2.1-24-ge1ad