summaryrefslogtreecommitdiff
path: root/content/posts
diff options
context:
space:
mode:
authorNick White <git@njw.name>2021-10-11 17:37:32 +0100
committerNick White <git@njw.name>2021-10-11 17:37:32 +0100
commit18cc85896bd33eb94fee5d20efe6c557357980d0 (patch)
treed110364d4aab081b707e5fa36952e6ebe41880b5 /content/posts
parent7d76916b4a33d92e4af324374dda6e10700dc8a7 (diff)
Finish (probably) pdf post
Diffstat (limited to 'content/posts')
-rw-r--r--content/posts/pdfs/badquality.webpbin0 -> 73496 bytes
-rw-r--r--content/posts/pdfs/index.md32
-rw-r--r--content/posts/pdfs/lineheighttofontsize.webpbin0 -> 169320 bytes
-rw-r--r--content/posts/pdfs/nicesearch.webpbin0 -> 77038 bytes
-rw-r--r--content/posts/pdfs/stretch.webpbin0 -> 182550 bytes
5 files changed, 15 insertions, 17 deletions
diff --git a/content/posts/pdfs/badquality.webp b/content/posts/pdfs/badquality.webp
new file mode 100644
index 0000000..5c1b932
--- /dev/null
+++ b/content/posts/pdfs/badquality.webp
Binary files differ
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md
index 1e8d11e..02e8457 100644
--- a/content/posts/pdfs/index.md
+++ b/content/posts/pdfs/index.md
@@ -1,19 +1,19 @@
---
title: "Making great PDFs for OCRed works"
-date: 2021-09-06
+date: 2021-10-11
categories: [pdf, software, code]
---
Recently we have been putting some effort into improving the PDF
output from our tools, which have all made it into the latest
-release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). PDFs
-are an interesting, sometimes tricky file format to produce
-correctly, so here we'll run through some of the ways we get really
-good PDFs out of our pipeline, and exactly what "good" means in
-this context.
+release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). While
+they may seem simple, PDFs are a surprisingly complex, sometimes
+tricky file format to produce correctly, so here we'll run through
+some of the ways we get really good PDFs out of our pipeline, and
+exactly what "good" means in this context.
-## What makes a good searchable PDF?
+![Example of PDF with good search highlighting](nicesearch.webp)
-![TODO: Example of nice pdf with nice search highlighting](example-01.png)
+## What makes a good searchable PDF?
If you've worked with PDFs a lot, from a variety of different
sources, you may have seen various issues crop up. A good
@@ -46,8 +46,6 @@ state it is in today.
## Finding a good image size to embed
-![TODO: Example of too low quality vs fine quality](example-01.png)
-
We prefer to do our OCR on images scanned with high DPI, as they
produce more accurate results, but they also take a lot of disk
space. As an example, the page images we use derived from books
@@ -64,6 +62,8 @@ height of around 1000 pixels was about right. This results in a
PDF which is high quality, but is rather more manageable, clocking
in at around 70MB for a 500 page book.
+{{< figure src="badquality.webp" caption="Example of a PDF with overly compressed images." >}}
+
## Hiding the text layer
Interestingly there are quite a few different methods for creating
@@ -72,8 +72,6 @@ common method is to use a font which is completely invisible,
however this is somewhat of a hack, and can cause issues with
Unicode characters, and generally be unreliable.
-![TODO: Example of copy pasting unicode going badly](example-01.png)
-
The ideal method, which we have implemented, is to use a feature
of the PDF specification called "Text Rendering Mode" to request
that the text should be laid out as normal, but not "stroked"
@@ -107,7 +105,7 @@ in blocks of text copied from the PDF being garbled, and means
searching by word or phrase will generally fail, as spaces are
incorrectly added or omitted by the PDF viewer.
-![Example of setting font size using lineheight](example-lineheighttofontsize.webp)
+{{< figure src="lineheighttofontsize.webp" caption="Example of setting font size using lineheight (hidden text layer in blue)." >}}
Specifying the location of each word, instead, addresses the issue
of spaces being incorrectly added or omitted, but there are still
@@ -122,7 +120,7 @@ copying. The solution to this is simple, to always set the
vertical position and height of each word to that of the line as
a whole, ignoring the dimensions given by the OCR engine.
-![TODO: Example of nostretch vs final](example-01.png)
+{{< figure src="stretch.webp" caption="Example of the effect of adding horizontal stretch (hidden text layer in blue)." >}}
That improves things a lot, but there's still one issue, which is
that each word may be too narrow or wide to match with the text
@@ -148,7 +146,7 @@ search and copying of any amount of text.
Our PDF support is now the best available anywhere, as far as
we're aware. Give it a go with the
[rescribe tool](https://rescribe.xyz/rescribe)
-or the
-[bookpipeline package (for experts)](https://rescribe.xyz/bookpipeline),
-and let us know how you get on by [email](mailto:info@rescribe.xyz) or
+or the `pdfbook` tool in the
+[bookpipeline package](https://rescribe.xyz/bookpipeline).
+Let us know how you get on by [email](mailto:info@rescribe.xyz) or
[@RescribeOCR on twitter](https://twitter.com/rescribeOCR).
diff --git a/content/posts/pdfs/lineheighttofontsize.webp b/content/posts/pdfs/lineheighttofontsize.webp
new file mode 100644
index 0000000..c312429
--- /dev/null
+++ b/content/posts/pdfs/lineheighttofontsize.webp
Binary files differ
diff --git a/content/posts/pdfs/nicesearch.webp b/content/posts/pdfs/nicesearch.webp
new file mode 100644
index 0000000..9b296e9
--- /dev/null
+++ b/content/posts/pdfs/nicesearch.webp
Binary files differ
diff --git a/content/posts/pdfs/stretch.webp b/content/posts/pdfs/stretch.webp
new file mode 100644
index 0000000..858d909
--- /dev/null
+++ b/content/posts/pdfs/stretch.webp
Binary files differ