A simple, effective post-processing OCR improvement.
Looking at the initial OCR output, it immediately becomes evident that a significant error was being made in discriminating whether spacing was between words or just separating letters within a word. For example, it was very common to see OCR output of "T hey"--that is capital "T" followed by a space, followed by the letters (word?) "hey." This problem occurred both with OCR from Adobe Professional 7 as well as with the Tesseract open source OCR engine. After comparing results from these two OCR packages, the rest of this effort used the Adobe OCR-produced text.
Before restricting this post-processing to just incorrectly interpreted spacing, we first investigated more comprehensive post-processing of the OCR. The literature that pertained to OCR post-processing improvement was primarily from the early 1990s to mid-1990s. The papers, algorithms, and open source code that were more current didn't fill our needs, went beyond our requirements, or required significantly more training or development. So while it would be desirable to fix an incorrect OCR such as "develop" to "develop," the solution presented here does not find and correct this type of problem.
It does, however, provide a simple, low-cost approach to post-processing the OCR to make significant improvement with minimal overhead. Both for searching, which was the primary requirement for this exercise, as well as in user interfaces presenting the text, the improvement is readily apparent.
How This Post-Processing Approach Works
The basic approach presented in this paper is quite simple. The algorithm works from the start to the end of each scanned page. For each sequence of letters, the algorithm tries combining it with following sequences of letters, testing whether they combine to make a word. It occurs by combining the longest sequence, then working down to just combining two sequences. So the OCR output example of "man u facture" is put together into one word: "manufacture." As a result of doing longest to shortest, a sequence such as "important" will become "important," not the two shorter words "import" and "ant."
The algorithm skips sequences of non-alphanumeric characters between sequences of characters. So a sequence such as "im.~ port" will combine into "import." But a character sequence such as "im -0. Port" will be left as is. Stronger approaches that attempt to correctly resolve these potentially inaccurate scans have been discussed and tried. But for this endeavor, the simplest approach was used.
Fundamentally, the aforementioned is the complete algorithm that was used. It was implemented in a short Perl package available via GitHub as described below.
Dictionaries/Word Lists Used
One significant issue is which sequences are accepted as an English word. This is especially important since we want proper names to be correctly recognized as words. This makes it possible for users to find people and places accurately. Three corpuses of "words" were used for each processed publication: a large generic list of English words, a small specialized list, and a publication-specific set of words.
1. Open Book Project makes available a spelling list that was used. Since the desire was to do a simple approach and, in particular, not deal with the difficulties in English of parts of speech and pluralization, this set filled the need. The list of 53,000-plus words is available at openbookproject.net/ py4fun/spellcheck/spell. words.
2. A small, known list of proper nouns of particular interest to Bell Labs or telecommunications was added to be sure they were identified as words (e.g., proper nouns such as "Shannon" and "Penzias"). We also added words that were missed in the abovementioned when discovered--this included terminology and words of specific interest in the set of scanned materials.
3. For each of the scanned publications we had available, we manually entered the table of contents. These were used with the hope of adding corpus-specific terminology that was relevant to the individual publications and articles.
This fairly simple approach to improving the OCR output was surprisingly effective. The basic data is in the table on page 22. And since the resulting text is primarily used for searching, this was a noticeable improvement. A secondary benefit was in the presentation of the search results. When a user retrieves a page (or document), he or she is presented with an image of the page--not the OCR. However, interfaces and applications presenting retrieval hits in context make a better impression when the text is a close approximation to the scanned image.
Some examples should make the effectiveness of this approach clear. While the quality of the scanned source material--and thus the OCR-produced text--varied, the results of this effort were still similar. Presented above are two representative samples; Sample 1 is from the June 1966 Bell Laboratories News, while Sample 2 is from the May 1927 Bell Laboratories Record.
This approach is admittedly far from complete or comprehensive. Even in the aforementioned examples, it is easy to see how more comprehensive methods would improve the OCR further. However, this approach allowed us--with around 100 lines of Perl code and not much processing--to improve the words identified from the OCR by as much as 18%. While additional improvement would certainly be desirable, the approach significantly improved the usability of our scanned resources.
The actual software used will be made available via GitHub (github.com/rwaldstein/post-processOCR). It consists of one small Perl module, called postOCR.pm, with a subroutine cleanstrQ--which takes an OCR output string and returns the improved version. <[R]>
Robert Waldstein (Robert .WaldsteinOnokia.com] is a distinguished member of the technical staff at Nokia Bell Labs in Murray Hill, N.J. His responsibilities include systems design, programming, and strategic planning for the corporation's integrated information solutions unit, global library network, and InfoView access platform.
Optical Character Recognition Results of String Merging Years Number Number of Name of Publication Scanned of Pages Matched Words Bell Labs News 1961-1996 6,460 5,394,098 (after 1996 captured electronically] Bell Telephone Quarterly 1922-1976 11,036 2,357,702 Bell Laboratories Record 1925-1986 29,268 7,837,299 Bell Laboratories Reporter 1952-1969 4,924 866,513 The Western Electric 1957-1983 804 231,315 Engineer Western Electric 1950-1975 3,687 398,160 Items of Interest Western Electric Magazine 1948-1983 9,795 1,940,724 Western Electric News 1912-1932 10,055 3,975,902 Unmatched Words Created by Name of Publication Strings of Letters Merging Strings Bell Labs News 940,557 415,960 (after 1996 captured electronically] Bell Telephone Quarterly 934,042 161,325 Bell Laboratories Record 866,516 1,269,994 Bell Laboratories Reporter 375,618 158,920 The Western Electric 69,320 33,033 Engineer Western Electric 38,428 47,522 Items of Interest Western Electric Magazine 408,824 309,191 Western Electric News 532,583 202,790 Percent Name of Publication Improvement Bell Labs News 7.7% (after 1996 captured electronically] Bell Telephone Quarterly 6.8% Bell Laboratories Record 16.2% Bell Laboratories Reporter 18.3% The Western Electric 14.3% Engineer Western Electric 11.9% Items of Interest Western Electric Magazine 15.9% Western Electric News 5.1%
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||optical character recognition|
|Publication:||Computers in Libraries|
|Date:||Dec 1, 2016|
|Previous Article:||Anticipating the next phase of the library-technology industry.|
|Next Article:||Three tech trends and one skill to watch during 2017.|