Printer Friendly

A simple, effective post-processing OCR improvement.

The Nokia library organization recently scanned a large batch of corporate publications dating back to 1912. Since a primary use of this resource is comprehensive searching for people or events of historical interest, accurate search and retrieval based on optical character recognition (OCR) is a significant concern. If words are incorrectly interpreted, the search engine cannot correctly match up what the user is searching with the relevant terms.

Looking at the initial OCR output, it immediately becomes evident that a significant error was being made in discriminating whether spacing was between words or just separating letters within a word. For example, it was very common to see OCR output of "T hey"--that is capital "T" followed by a space, followed by the letters (word?) "hey." This problem occurred both with OCR from Adobe Professional 7 as well as with the Tesseract open source OCR engine. After comparing results from these two OCR packages, the rest of this effort used the Adobe OCR-produced text.

Before restricting this post-processing to just incorrectly interpreted spacing, we first investigated more comprehensive post-processing of the OCR. The literature that pertained to OCR post-processing improvement was primarily from the early 1990s to mid-1990s. The papers, algorithms, and open source code that were more current didn't fill our needs, went beyond our requirements, or required significantly more training or development. So while it would be desirable to fix an incorrect OCR such as "develop" to "develop," the solution presented here does not find and correct this type of problem.

It does, however, provide a simple, low-cost approach to post-processing the OCR to make significant improvement with minimal overhead. Both for searching, which was the primary requirement for this exercise, as well as in user interfaces presenting the text, the improvement is readily apparent.

How This Post-Processing Approach Works

The basic approach presented in this paper is quite simple. The algorithm works from the start to the end of each scanned page. For each sequence of letters, the algorithm tries combining it with following sequences of letters, testing whether they combine to make a word. It occurs by combining the longest sequence, then working down to just combining two sequences. So the OCR output example of "man u facture" is put together into one word: "manufacture." As a result of doing longest to shortest, a sequence such as "important" will become "important," not the two shorter words "import" and "ant."

The algorithm skips sequences of non-alphanumeric characters between sequences of characters. So a sequence such as "im.~ port" will combine into "import." But a character sequence such as "im -0. Port" will be left as is. Stronger approaches that attempt to correctly resolve these potentially inaccurate scans have been discussed and tried. But for this endeavor, the simplest approach was used.

Fundamentally, the aforementioned is the complete algorithm that was used. It was implemented in a short Perl package available via GitHub as described below.

Dictionaries/Word Lists Used

One significant issue is which sequences are accepted as an English word. This is especially important since we want proper names to be correctly recognized as words. This makes it possible for users to find people and places accurately. Three corpuses of "words" were used for each processed publication: a large generic list of English words, a small specialized list, and a publication-specific set of words.

1. Open Book Project makes available a spelling list that was used. Since the desire was to do a simple approach and, in particular, not deal with the difficulties in English of parts of speech and pluralization, this set filled the need. The list of 53,000-plus words is available at openbookproject.net/ py4fun/spellcheck/spell. words.

2. A small, known list of proper nouns of particular interest to Bell Labs or telecommunications was added to be sure they were identified as words (e.g., proper nouns such as "Shannon" and "Penzias"). We also added words that were missed in the abovementioned when discovered--this included terminology and words of specific interest in the set of scanned materials.

3. For each of the scanned publications we had available, we manually entered the table of contents. These were used with the hope of adding corpus-specific terminology that was relevant to the individual publications and articles.

Results

This fairly simple approach to improving the OCR output was surprisingly effective. The basic data is in the table on page 22. And since the resulting text is primarily used for searching, this was a noticeable improvement. A secondary benefit was in the presentation of the search results. When a user retrieves a page (or document), he or she is presented with an image of the page--not the OCR. However, interfaces and applications presenting retrieval hits in context make a better impression when the text is a close approximation to the scanned image.

Some examples should make the effectiveness of this approach clear. While the quality of the scanned source material--and thus the OCR-produced text--varied, the results of this effort were still similar. Presented above are two representative samples; Sample 1 is from the June 1966 Bell Laboratories News, while Sample 2 is from the May 1927 Bell Laboratories Record.

Conclusion

This approach is admittedly far from complete or comprehensive. Even in the aforementioned examples, it is easy to see how more comprehensive methods would improve the OCR further. However, this approach allowed us--with around 100 lines of Perl code and not much processing--to improve the words identified from the OCR by as much as 18%. While additional improvement would certainly be desirable, the approach significantly improved the usability of our scanned resources.

The actual software used will be made available via GitHub (github.com/rwaldstein/post-processOCR). It consists of one small Perl module, called postOCR.pm, with a subroutine cleanstrQ--which takes an OCR output string and returns the improved version. <[R]>

Robert Waldstein (Robert .WaldsteinOnokia.com] is a distinguished member of the technical staff at Nokia Bell Labs in Murray Hill, N.J. His responsibilities include systems design, programming, and strategic planning for the corporation's integrated information solutions unit, global library network, and InfoView access platform.

Optical Character Recognition Results of String Merging

                               Years      Number       Number of
Name of Publication           Scanned    of Pages    Matched Words
Bell Labs News               1961-1996     6,460       5,394,098
(after 1996 captured
electronically]

Bell Telephone Quarterly     1922-1976    11,036       2,357,702

Bell Laboratories Record     1925-1986    29,268       7,837,299

Bell Laboratories Reporter   1952-1969     4,924        866,513

The Western Electric         1957-1983      804         231,315
  Engineer
Western Electric             1950-1975     3,687        398,160
Items of Interest

Western Electric Magazine    1948-1983     9,795       1,940,724
Western Electric News        1912-1932    10,055       3,975,902

                                 Unmatched        Words Created by
Name of Publication          Strings of Letters   Merging Strings
Bell Labs News                    940,557             415,960
(after 1996 captured
electronically]

Bell Telephone Quarterly          934,042             161,325

Bell Laboratories Record          866,516            1,269,994

Bell Laboratories Reporter        375,618             158,920

The Western Electric               69,320              33,033
  Engineer
Western Electric                   38,428              47,522
Items of Interest

Western Electric Magazine         408,824             309,191
Western Electric News             532,583             202,790

                               Percent
Name of Publication          Improvement
Bell Labs News                  7.7%
(after 1996 captured
electronically]

Bell Telephone Quarterly        6.8%

Bell Laboratories Record        16.2%

Bell Laboratories Reporter      18.3%

The Western Electric            14.3%
  Engineer
Western Electric                11.9%
Items of Interest

Western Electric Magazine       15.9%
Western Electric News           5.1%
COPYRIGHT 2016 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:optical character recognition
Author:Waldstein, Robert
Publication:Computers in Libraries
Geographic Code:1USA
Date:Dec 1, 2016
Words:1241
Previous Article:Anticipating the next phase of the library-technology industry.
Next Article:Three tech trends and one skill to watch during 2017.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters