Finish (probably) pdf post

author: Nick White <git@njw.name> 2021-10-11 17:37:32 +0100
committer: Nick White <git@njw.name> 2021-10-11 17:37:32 +0100
commit: 18cc85896bd33eb94fee5d20efe6c557357980d0 (patch)
tree: d110364d4aab081b707e5fa36952e6ebe41880b5
parent: 7d76916b4a33d92e4af324374dda6e10700dc8a7 (diff)
7 files changed, 15 insertions, 17 deletions
diff --git a/content/posts/pdfs/badquality.webp b/content/posts/pdfs/badquality.webp
new file mode 100644
index 0000000..5c1b932
--- /dev/null
+++ b/content/posts/pdfs/badquality.webp
diff --git a/content/posts/pdfs/index.md b/content/posts/pdfs/index.md
index 1e8d11e..02e8457 100644
--- a/content/posts/pdfs/index.md
+++ b/content/posts/pdfs/index.md
@@ -1,19 +1,19 @@
 ---
 title: "Making great PDFs for OCRed works"
-date: 2021-09-06
+date: 2021-10-11
 categories: [pdf, software, code]
 ---
 Recently we have been putting some effort into improving the PDF
 output from our tools, which have all made it into the latest
-release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). PDFs
-are an interesting, sometimes tricky file format to produce
-correctly, so here we'll run through some of the ways we get really
-good PDFs out of our pipeline, and exactly what "good" means in
-this context.
+release of [rescribe](https://rescribe.xyz/rescribe) (v0.5.1). While
+they may seem simple, PDFs are a surprisingly complex, sometimes
+tricky file format to produce correctly, so here we'll run through
+some of the ways we get really good PDFs out of our pipeline, and
+exactly what "good" means in this context.
 
-## What makes a good searchable PDF?
+![Example of PDF with good search highlighting](nicesearch.webp)
 
-![TODO: Example of nice pdf with nice search highlighting](example-01.png)
+## What makes a good searchable PDF?
 
 If you've worked with PDFs a lot, from a variety of different
 sources, you may have seen various issues crop up. A good
@@ -46,8 +46,6 @@ state it is in today.
 
 ## Finding a good image size to embed
 
-![TODO: Example of too low quality vs fine quality](example-01.png)
-
 We prefer to do our OCR on images scanned with high DPI, as they
 produce more accurate results, but they also take a lot of disk
 space. As an example, the page images we use derived from books
@@ -64,6 +62,8 @@ height of around 1000 pixels was about right. This results in a
 PDF which is high quality, but is rather more manageable, clocking
 in at around 70MB for a 500 page book.
 
+{{< figure src="badquality.webp" caption="Example of a PDF with overly compressed images." >}}
+
 ## Hiding the text layer
 
 Interestingly there are quite a few different methods for creating
@@ -72,8 +72,6 @@ common method is to use a font which is completely invisible,
 however this is somewhat of a hack, and can cause issues with
 Unicode characters, and generally be unreliable.
 
-![TODO: Example of copy pasting unicode going badly](example-01.png)
-
 The ideal method, which we have implemented, is to use a feature
 of the PDF specification called "Text Rendering Mode" to request
 that the text should be laid out as normal, but not "stroked"
@@ -107,7 +105,7 @@ in blocks of text copied from the PDF being garbled, and means
 searching by word or phrase will generally fail, as spaces are
 incorrectly added or omitted by the PDF viewer.
 
-![Example of setting font size using lineheight](example-lineheighttofontsize.webp)
+{{< figure src="lineheighttofontsize.webp" caption="Example of setting font size using lineheight (hidden text layer in blue)." >}}
 
 Specifying the location of each word, instead, addresses the issue
 of spaces being incorrectly added or omitted, but there are still
@@ -122,7 +120,7 @@ copying. The solution to this is simple, to always set the
 vertical position and height of each word to that of the line as
 a whole, ignoring the dimensions given by the OCR engine.
 
-![TODO: Example of nostretch vs final](example-01.png)
+{{< figure src="stretch.webp" caption="Example of the effect of adding horizontal stretch (hidden text layer in blue)." >}}
 
 That improves things a lot, but there's still one issue, which is
 that each word may be too narrow or wide to match with the text
@@ -148,7 +146,7 @@ search and copying of any amount of text.
 Our PDF support is now the best available anywhere, as far as
 we're aware. Give it a go with the
 [rescribe tool](https://rescribe.xyz/rescribe)
-or the
-[bookpipeline package (for experts)](https://rescribe.xyz/bookpipeline),
-and let us know how you get on by [email](mailto:info@rescribe.xyz) or
+or the `pdfbook` tool in the
+[bookpipeline package](https://rescribe.xyz/bookpipeline).
+Let us know how you get on by [email](mailto:info@rescribe.xyz) or
 [@RescribeOCR on twitter](https://twitter.com/rescribeOCR).
diff --git a/content/posts/pdfs/lineheighttofontsize.webp b/content/posts/pdfs/lineheighttofontsize.webp
new file mode 100644
index 0000000..c312429
--- /dev/null
+++ b/content/posts/pdfs/lineheighttofontsize.webp
diff --git a/content/posts/pdfs/nicesearch.webp b/content/posts/pdfs/nicesearch.webp
new file mode 100644
index 0000000..9b296e9
--- /dev/null
+++ b/content/posts/pdfs/nicesearch.webp
diff --git a/content/posts/pdfs/stretch.webp b/content/posts/pdfs/stretch.webp
new file mode 100644
index 0000000..858d909
--- /dev/null
+++ b/content/posts/pdfs/stretch.webp
diff --git a/images/pdfs/stretch-smaller.xcf b/images/pdfs/stretch-smaller.xcf
new file mode 100644
index 0000000..0152545
--- /dev/null
+++ b/images/pdfs/stretch-smaller.xcf
diff --git a/images/pdfs/stretch.xcf b/images/pdfs/stretch.xcf
new file mode 100644
index 0000000..a56389d
--- /dev/null
+++ b/images/pdfs/stretch.xcf
author	Nick White <git@njw.name>	2021-10-11 17:37:32 +0100
committer	Nick White <git@njw.name>	2021-10-11 17:37:32 +0100
commit	18cc85896bd33eb94fee5d20efe6c557357980d0 (patch)
tree	d110364d4aab081b707e5fa36952e6ebe41880b5
parent	7d76916b4a33d92e4af324374dda6e10700dc8a7 (diff)