summaryrefslogtreecommitdiff
path: root/content/posts/adaptive-binarisation/index.md
blob: ffb1d16aebec696cea1da9fb99fe0f4531e090eb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
title: "Adaptive Binarisation"
date: 2020-11-06
categories: [binarisation, preprocessing, image manipulation, software, code, tools]
---
The [previous post](/posts/binarisation-introduction) covered the
basics of binarisation, and introduced the Otsu algorithm, a good
method for finding a global threshold number for a page. But there
are inevitable limitations with using a global threshold for
binarisation. Better would be to use a threshold that is adapted
over different regions of the page, so that as the conditions of the
page change so can the threshold. This technique is called adaptive
binarisation.

For each pixel of an image, adaptive binarisation considers the
pixels around it to determine a good threshold. This means that even
in an area which is heavily shaded, for example near the spine of a
book, the text will be correctly differentiated from the background,
as even though they may both be darker than the text in the rest of
the page, it is the darkness relative to its surroundings that
matters.

<!--
(diagram showing 2 different areas of a page, one light and one dark,
comparing global and local thresholding [can be fake, as the global
threshold diagram was])
(actually can probably just have a dark area of a page, comparing global
and local thresholding, setting the global one such that the image is
screwed up)
-->

A popular algorithm for this adaptive binarisation technique was
described in a 2000 paper by J. Sauvola
[Adaptive document image binarization](http://citeseerx.ist.psu.edu/viewdoc/download;?doi=10.1.1.98.880&rep=rep1&type=pdf).
The key part of the paper which is still used today is his modification
of an earlier algorithm for adaptive binarisation by Niblack (1986), to
add standard deviation to the calculation of a local threshold for
an area. This improves the algorithm significantly, enabling it to take
better account of variations in particularly dark or light areas of an
image, such as in the gutter or a particularly stained area.

There are two parameters that need to be set for the Sauvola algorithm,
a general *threshold value* called *k*, and the *window size*. The
*threshold value* works much the same way as in other binarisation
algorithms, essentially setting how aggressive the algorithm should be
in removing dark areas, at the expense of potentially removing some
parts of text. It can be between 0 and 1, with a higher number
resulting in more aggressive binarisation. The *window size* is
the number of pixels around each pixel which are taken into account by
the algorithm to determine whether to set the pixel to black or white.
A larger window size results in less local adjustment for different
areas of a page, but too low and a good deal of noise can be incorrectly
detected as text.

A major disadvantage of adaptive binarisation methods is that they
tend to be much slower than global threshold binarisation, as a new
threshold must be calculated for every pixel. In the case of the
Sauvola algorithm, this requires calculating the mean and standard
deviation of every pixel in the window around each pixel, for each
pixel, which adds up to a lot of calculations.

Fortunately, however, we can use some clever mathematics to
dramatically reduce the amount of comuptation that is necessary for
adaptive binarisation. The key is using *Integral Images*, also
known as *Summed Area Tables*. These work by calculating the sum of
each pixel above and to the left of each pixel, which can then be
used to calculate the mean and standard deviation of any area in an
image by a very simple calculation of adding two numbers together
and subtracting another two numbers from the Integral Image - a
process that is many times faster than calculating the mean and
standard deviation from scratch for each pixel. The *Integral Image*
essentially does all of the calculation once for the whole image.
This technique is well described in the paper
[Efficient implementation of local adaptive thresholding techniques using integral images](http://citeseerx.ist.psu.edu/viewdoc/download;?doi=10.1.1.109.2748&rep=rep1&type=pdf)
 by Shafait, Keysers and Breuel.

## Our Tools

There are surprisingly few free software tools to do binarisation
with the Sauvola algorithm, and none that I could find that can
use *Integral Images* to speed the processing up. When you're just
binarising a few images, maybe that isn't so important, but when
you're operating on a scale of many thousands of images, potentially
processing each one multiple times with different threshold values,
cutting the processing time from 60 seconds to 0.1 seconds makes a
very big difference.

So, we wrote a Go package which does Sauvola binarisation, which
contains both standalone command line tools and can be used as a
package in your own Go projects. The same package contains some
other image preprocessing functionality, so the package is called
[rescribe.xyz/preproc](https://rescribe.xyz/preproc) ([docs](https://pkg.go.dev/rescribe.xyz/preproc)).
The command line binarisation tool in the package is *binarize*, and
the relevant Go functions are Sauvola() and IntegralSauvola().

The Integral Image support is provided by another package we wrote,
[rescribe.xyz/integral](https://rescribe.xyz/integral) ([docs](https://pkg.go.dev/rescribe.xyz/integral)),
which will also be useful for other image processing tools which
require calculating the mean or standard deviation for different
areas of an image. It includes tests and examples, so should be
easy to pick up and use for your own projects.