Write more of adaptive binarisation; basically done

author: Nick White <git@njw.name> 2020-11-06 16:11:15 +0000
committer: Nick White <git@njw.name> 2020-11-06 16:11:15 +0000
commit: b5f11cf4910cef277defad486604a3aab7b7bf74 (patch)
tree: 3b8a3bf1c82e89b639d375bc96133ef9206122fa
parent: 4779d2cdaa89d06ce88d2c7de968f3f861821454 (diff)
1 files changed, 68 insertions, 7 deletions
diff --git a/content/posts/adaptive-binarisation/index.md b/content/posts/adaptive-binarisation/index.md
index e4e2232..46316f2 100644
--- a/content/posts/adaptive-binarisation/index.md
+++ b/content/posts/adaptive-binarisation/index.md
@@ -1,8 +1,7 @@
 ---
 title: "Adaptive Binarisation"
-date: 2019-12-17
-draft: true
-categories: [binarisation, preprocessing, image manipulation]
+date: 2020-11-06
+categories: [binarisation, preprocessing, image manipulation, software, code, tools]
 ---
 The [previous post](/posts/binarisation-introduction) covered the
 basics of binarisation, and introduced the Otsu algorithm, a good
@@ -32,9 +31,71 @@ screwed up)
 
 A popular algorithm for this adaptive binarisation technique was
 described in a 2000 paper by J. Sauvola
-[Adaptive document image binarization](http://www.ee.oulu.fi/mvg/files/pdf/pdf_24.pdf),
-and is now generally just refered to as the "Sauvola algorithm".
+[Adaptive document image binarization](http://www.ee.oulu.fi/mvg/files/pdf/pdf_24.pdf).
+The key part of the paper which is still used today is his modification
+of an earlier algorithm for adaptive binarisation by Niblack (1986), to
+add standard deviation to the calculation of a local threshold for
+an area. This improves the algorithm significantly, enabling it to take
+better account of variations in particularly dark or light areas of an
+image, such as in the gutter or a particularly stained area.
 
-only some of what sauvola actually outlined in the paper is generally used today, which is a modification of niblack's (1986). sauvola's paper suggests several different binarisation methods, with an algorithm to switch between them, but these days it's just the main "binarization of textual components" part (3.3) that the paper is remembered for, and is generally simply refered to as "sauvola".
+There are two parameters that need to be set for the Sauvola algorithm,
+a general *threshold value* called *k*, and the *window size*. The
+*threshold value* works much the same way as in other binarisation
+algorithms, essentially setting how aggressive the algorithm should be
+in removing dark areas, at the expense of potentially removing some
+parts of text. It can be between 0 and 1, with a higher number
+resulting in more aggressive binarisation, and is generally set to
+somewhere between 0.05 and 0.4 for scanned books. The *window size* is
+the number of pixels around each pixel which are taken into account by
+the algorithm to determine whether to set the pixel to black or white.
+A larger window size results in less local adjustment for different
+areas of a page, but too low and a good deal of noise can be incorrectly
+detected as text.
 
-there are two variables, 'k', the so-called 'threshold value', and the window size.
+A major disadvantage of adaptive binarisation methods is that they
+tend to be much slower than global threshold binarisation, as a new
+threshold must be calculated for every pixel. In the case of the
+Sauvola algorithm, this requires calculating the mean and standard
+deviation of every pixel in the window around each pixel, for each
+pixel, which adds up to a lot of calculations.
+
+Fortunately, however, we can use some clever mathematics to
+dramatically reduce the amount of comuptation that is necessary for
+adaptive binarisation. The key is using *Integral Images*, also
+known as *Summed Area Tables*. These work by calculating the sum of
+each pixel above and to the left of each pixel, which can then be
+used to calculate the mean and standard deviation of any area in an
+image by a very simple calculation of adding two numbers together
+and subtracting another two numbers from the Integral Image - a
+process that is many times faster than calculating the mean and
+standard deviation from scratch for each pixel. The *Integral Image*
+essentially does all of the calculation once for the whole image.
+This technique is well described in the paper
+[Efficient implementation of local adaptive thresholding techniques using integral images](http://citeseerx.ist.psu.edu/viewdoc/download;?doi=10.1.1.109.2748&rep=rep1&type=pdf)
+ by Shafait, Keysers and Breuel.
+
+## Our Tools
+
+There are surprisingly few free software tools to do binarisation
+with the Sauvola algorithm, and none that I could find that can
+use *Integral Images* to speed the processing up. When you're just
+binarising a few images, maybe that isn't so important, but when
+you're operating on a scale of many thousands of images, potentially
+processing each one multiple times with different threshold values,
+cutting the processing time from 60 seconds to 0.1 seconds makes a
+very big difference.
+
+So, we wrote a Go package which does Sauvola binarisation, which
+contains both standalone command line tools and can be used as a
+package in your own Go projects. The same package contains some
+other image preprocessing functionality, so the package is called
+[rescribe.xyz/preproc](https://rescribe.xyz/preproc). The command
+line binarisation tool in the package is *binarize*, and the
+relevant Go functions are Sauvola() and IntegralSauvola().
+
+The Integral Image support is provided by another package we wrote,
+[rescribe.xyz/integralimg](https://rescribe.xyz/integralimg), which
+is general purpose and has a test suite, so will also be useful for
+other image processing tools which require calculating the mean or
+standard deviation for different areas of an image.
author	Nick White <git@njw.name>	2020-11-06 16:11:15 +0000
committer	Nick White <git@njw.name>	2020-11-06 16:11:15 +0000
commit	b5f11cf4910cef277defad486604a3aab7b7bf74 (patch)
tree	3b8a3bf1c82e89b639d375bc96133ef9206122fa
parent	4779d2cdaa89d06ce88d2c7de968f3f861821454 (diff)