Author: David Hood
Date Last Modified: 25th August 2003

Much of large scale scanning done by the project involves the scanning, and OCR of public historical sources such as trades directories, telephone directories, and electoral rolls. These are normally scans of photocopied reproductions of the original source, as this is the only way we are able to get access to the material. In many cases the material contains extraneous information such as sections of the facing page.

screenshot As well as including extra information, the location of the original document on the photocopied page may vary depending on where the original was in relation to the copier.

The above problems have meant scanning is a very manual, labour intensive, process. The techniques described below offer automatic solutions to these problems, requiring less manual labour.

The overall solution is to use techniques to locate the original page on the scanned photocopy and to trim the excess, unwanted, image prior to optical character recognition. For the specific implementation below to succeed the following is assumed:

- The photocopies of the original pages were made at the same size. As such, the desired original page will occupy the same area on every photocopy, though it may be in a different location.
- The photocopies were scanned taking the same area of the photocopied page at a constant resolution. This ensures that the original images occupy the same area in the scans.
- The images are scanned in black and white. This makes automatic identification of the printed area unambiguous. If non-text images were being analysed, this would not be a requirement.
- Metrics of the text width can be obtained for one image. As the originals are all at the same size and the same scanned resolution, the setting for one source apply to all others.