# A very fast substring search algorithm.

A VERY FAST SUBSTRING SEARCH ALGORITHMA fundamental technique used in computer science is to search for a specific substring in a larger body of text. Algorithms that do this rank with sorting algorithms as cornerstones of software methodology. Substring search algorithms can be used to find reference keywords in documents and all usages of some variable in source code, to monitor input text streams for certain event names or prompt words, or to locate items in a list stored in a computer as flat text. The string search technique is so fundamental that most large computer programs use it in one form or another. Often the execution of this technique in code accounts for a substantial percentage of the work a program does, and increases in the efficiency of these search routines can significatly speed up a computer program.

The substring search problem is to find all occurrences of a given pattern string p as a substring of a larger string of text t. Several important algorithms have been discovered that are more efficient than the straight-forward (SF) approach. Two of the most notable algorithms, published over a decade ago, are the Knuth-Morris-Pratt (KMP) [3] and the Boyer-Moore (BM) [1] algorithms. Both the KMP and BM algorithms have worst-case linear-search behavior, improving the quadratic SF algorithm. In practice, however, on commonplace English text, the BM algorithm is several (usually three or more) times faster than the other two that are about the same [4]. In this article, and improvement to the BM algorithm is presented that results in an even faster substring search algorithm. This new algorithm does not require that the pattern string be scanned in any particular order. Three variations of the algorithm are given that use three different pattern string scan orders. These include: (1) a "Quick Search" algorithm; (2) a "Maximal Shift" algorithm; and (3) an "Optimal Mismatch" algorithm. All three are very fast substring search algorithms.

Existing Algorithms

Let p[i] be the i-th character in the pattern string p=p[0]...p[m-1] of length m, and let r[j] be the j-th character in the text string t=t[0]... t[n-1] of length n>m. We will assume that the pattern string p is located at position k in the text string t in testing for a substring match. That is, p[0] is aligned with t[k], p[1] is aligned with t[k+1], and p[i] is aligned with t[k+1] up to i=m-1.

The SF algorithm is the obvious one that most programmers would use to code a substring search. The pattern string p is aligned with the extreme left of the text, at position k=0, and then the pattern characters are scanned from left to right, p[0] p[1]...p[m-1], testing for matches against the corresponding text characters. If all match, then a substring has been found. If any mismatch is found, the pattern string is shifted to the right one step, incrementing k by 1, and the pattern string is rescanned left to right starting again from its leftmost position at p[0]. This SF algorithm is easy to code. The main drawback of using the SF algorithm is that it is a quadratic algorithm with worst-case O(mm) search time. In practice, however, for each pattern string position, a mismatch is usually detected with the first character tested, and the expected running time is O(n).

The KMP algorithm [3] improves the SF algorithm with a worst-case O(m+n) search time. The fundamental idea behind the KMP algorithm is to use already-known matches to permit shifting the pattern string forward by a delta, [delta], of more than one character when a mismatch is found. The KMP method starts the same as the SF and scans the pattern string n the same left to right direction from p[0] through p[m-1]. When a character mismatch is found, however, between p[i] and t[k+i], for example, the KMP algorithm shifts the pattern string right in order to align already-matched and scanned text with the nearest matching prefix of the pattern. Additionally, a different pattern character is brought to the mismatch position, since we already know that the current one is a mismatch. These prefix shifts for each mismatch position in the pattern string can be determined from the pattern string alone, and an initial O(m) time is needed to precompute them. After this shift of [delta]>=1, testing characters of the text string then resumes at the point where the last mismatch was found, namely, at t[k+u] in the text string and at p[i[delata]]<0, p is when (i - [delta])[is greater than]=0. If (i - [delta])[is less than]0, p is shifted to position (k+i+1) in t. Thus, there is no backtracking in t, resulting in a worst-case O(n) search time. Nevertheless, in practice, the KMP and SF algorithms perform about the same [4] because the expected search time statistics are dominated by the event of a mismatch for the first character tested, p[0].

The BM algorithm [1] changes the direction of scanning the pattern string by testing the last character of it first and then proceeding right to left through the pattern string, p[m-1] p[m-2]...p[0], in testing for matches with the text. Similar to KMP, when a mismatch is found, at p[m-i], for instance, the information gained from known matches is used to shift the pattern right as much as possible. The shifts are generally larger than those for the KMP algorithm. For the mismatching character in the text string, t[k+m-1], the BM algorithm uses a precomputed table to find the index of its first leftward occurrence from the end of the pattern string. If this occures to the left of the position of the mismatch, then the difference is defined to be [delta]1 for that position in the pattern string. Since most often, the last character of p, p[m-1], gives a mismatch, [delta]1 is generally positive. for reasonably short patterns, the expected value of [delta]1 is almost (m-1). Using the BM [delta]1, for the patten shift after a mismatch yields a substring search algorithm that is usually better than three times as fast as those of the SF of KMP in practice [4]. This improved algorithm, however, has a worst-case O(mn) search time. By further incorporating the KMP idea to compute a [delta]2 and using the maximum of [delta]1 and [delta]2 for the actual [delta] shift used, one gets an O(m+n) algorithm. This second shift value is computed by taking the already-matched suffix of p to the right of the first mismatched character and by finding the next leftward occurrence of it in p. Additionally, the character at themismatch position must be different from the current one. Like [delta]1, [delta]2 is precomputed as a function of the position in p where a mismatch first occurs, and both [delta]1 and [delta]2 can be precomputed in O(m) time.

An improved Algorithm

One can recode the intent of the [[delta].sub.1] of the BM technique. First, note that the pattern string always shifts right by at least one character. Hence, the character in the text string immediately past the end of the pattern string, namely t[k+m], must be involved in testing for a substring match at the next position of the pattern string. Thus, a new [Delta]1 can be computed to be the index of the first leftward occurence of this charactrer from the end of the text string. As with the BM [delta]1, this index can be precomputed as a function of the text string alphabet, for instance, in a table TD1[c], whose value for any character c of the alphabet is its leftward index from the end of p (so that the last character of p has index 1). then,

[Delta]1=TD1[t[k+m]].

Whenever a mismatch is found, this value of [Delta]1 is the amount of shift p to the right. This either aligns that character in p with the text character t[k+m], or, when that character does not occur in p, shifts p right past it to texto position (k+m+1). Note that this [Delta]1 is computed as an absolute pattern shift and is not defined relative to the position in p of the last mismatch. Using this [Delta]1 instead of the BM [delta]1 has the following advantages:

(1) [Delta]1 [is greater than] = 1 always, and so it can be used by itself to simply and quickly code a fast, practical algorithm. The BM [delta]1, however, is sometimes [is less than] = 0, in which case either a shift of just 1 or [delta]2 is used.

(2) In practice, one expects that [Delta]1[is greater than]=[delta]1+1. Also, whenever the last character of the pattenr string matches the text character, then one expects that [Delta]1 [is greater than] = [delta]1 + 2, and so on. Thus using [Delta]1 results in a faster algorithm than that of BM. this is mostly true for short pattern strings, and the effect of this increase in speed decreases as the pattern string gets longer.

(3) [Delta]1 does NOT depend on the order in which the pattern string p is scanned. This is because it is defined relative to a text string character that lies outside the current comparison range of the pattern string. The BM [delta]1, however, depends strongly ont eh right to left pattern string scan order for its definition and efficient usage.

This last point is important, for it means that the pattern string p may be scanned in any order at all. One could scan it forward, backward, or use any other ordering of the subindices of the pattern string. Let an index ordering be represented as an integer array I[]={I[0],, ..., I[m-1]} that is a permutation of {0, ..., m-1}. Then, I[j] is the location in the pattern string of the j-th scan element, and p[I[j]] is the character of the pattern string at that location, for each j=0...(m-1).

For any specific order of scanning a pattern string, once can define a [Delta]2 shift that is similar to the KMP [delta] or the BM [delta]2. Then, for the substring search algorithm, oen uses a [Delata] pattern shift that is the maximum of [Delta]1 and [Delta]2. Like the BM [delata]2, this new [Delta]2 can be precomputed as a function of the position in p where a mismatch first occurs. Let TD2[j] be the precomputed [Delta]2 for when a mismatch first occurs at I[j] for the scan ordering I[ ]. To precompute the table TD2[ ], first consider p[I[0]]. If a mismatch occurs testing this character, we know that the next character of p that becomes aligned with the corresponding text location must differ from p[I[0]]; otherwise there would be another mismatch. Find the maximum i[less than][0] such that p[i] does not equal p[I[0]]. Then, TD2[0] = (I[0]-i) is the minimum shift where this holds. Next, TD2_1] is the minimum amount one must shift left so that p[I[0]] matches its corresponding character, but p[I[1]] does not.

Continue defining TD2[j] to be the minimum left shift so that p[I[0]] ... P[I[j-1]] match their aligned characters in p, but such that p[I[j]] does not. The [Delta]2 shift table for a specific ordered pattern can be precomputed with Algorithm (1) (Note: the algorithms described in this article will be given in Pascal for clarity of presentation. A complete implementation in the C language is also given in the Appendix.) where the matchshift (j, 1shift) function returns the alue of the next leftward shift, after an initial left shift 1shift, for which each of the first-j-ordered pattern characters match their correspoinding aligned string character. That is, this is the minimum value of mshift[is less than] = 1shift[is less than] = 0, such that either (I[i] - mshift)[is less than], 0, or p[I[i]] = p[I[i] - mshift], for each i=0...(j-1).

Note that if the pattern string is scanned in the forward direction, with I[ ] = {0, 1, ..., m-1{, then the [Dealta] we have computed is the KMP [delta]. Also, if the pattern string is scanned in the reverse direction, with I[ ] = {m - 1, m - 2, ..., 0{, our [Delata]2 is the same as the BM [delta]2.

Given a specific ordered pattern and precomputed shift tables TD1[ ] and TD2[ ] for it, the new substring search algorithm (Algorithm (2))is easy to code.

Similar to the KMP and BM algorithms, Algorithm (2) should have linear O(n) worst-case behavior for any scan order. It would seem that the KMP algorithm uses a best possible scan order for remembering already scanned text in order to avoid backtracking and should produce the best worst-case behavior. On the other hand, the BM scan order produces the worst possible performance, since it does not remember any scanned text. A proof of the O(n) linearity of the search algorithm for any fixed scan order of the pattern string might include the KMO and BM algorithms as special cases and reveal a relation between the bounds on their worst-case behavior. The proof of the linearity of this new search should be similar to the proof for the BM alrorithm [2, 3]. The details of a complete proof, however, have not yet been worked out, so we simply conjecture that linearity holds.

* Conjecture: The abitary scan order substring search algorithms has O(n) worst-case behavior.

Having a [Delta]1 and a [Delta]that can be used with any substring scan order creates new alg orithmic possibilities. Three different variations based on three different scan orderings are given here.

The Quick Search (QS)

Algorithm.

To quickly code a fast substring search alogorithm, the easy-to-code SF pattern string scan order can be used with the easy-to-compute [Delta]1 for the pattern string shift at each stage. No [Delta]2 is used. This is a simple, fast practical algorithm. Because it can be both coded and debugged quickly and it executes quickly, it can be called the Quick Search algorithm. This simplified search algorithm is shown in Algorithm (3).

If one augmented this straightforward search with the [Delta]2 shift and added code to stop backtracking in the text, one would get a fast algorithm with the KMP algorithm tight bound on worst-case behavior.

The Maximal Shift (MS)

Algorithm.

One can try to choose a scan order that somehow maximizes the [Delta]2 shift values that depend on it. One way of doing this is to first pick the character in the pattern string p whose next leftward occurrence in p is a maximal distance away. Test this character firt. If it matches the corresponding text character, then we have to maximally shift the pattern string p right, before the next valid comparison position is reached. Repeat this selection process with the remaining characters of p. One could also take into account the character where a mismatch is first detected to skip over pattern string subsequences of repeated characters. Other refinements could maximize higher-order [Delta]2 shifts, but this would not result in a significant increase of search efficiency.

C code for constructing a "Maximal Shift" (MS) ordered pattern is given in the Appendix. This code sorts with a comparison of the minimum left shifts needed to match the characters being compared. The sort first picks the character with the maximal minimum left shift. If two characters have the same minimum left-shift value, we use the BM heuristic to first select the one closest to the end of the pattern string.

The Optimal Mismatch

(OM) Algorithm.

An algorithm that is even faster in practice can be acheived by using a pattern string scan order that optimizes the change of getting a mismatch at each test position. This is done by ordering the characters of the pattern string p from the one least likely to occur in the text alphabet to the one most likely to occur. USe this as the pattern scan order. This increases cre probability of finding a mismatch as soon as possible and results in greater efficiency.

C code for constructing an "Optimal Mismatch" (OM) ordered pattern is given in the Appendix. This code sorts with a comparison of the frequency of occurrence of the pattern string's characters in the text alphabet. When two characters have the same frequency, the BM heuristic is used to first select the one closest to the end of the pattern string. One could go beyond this by first selecting, for characters of near equal frequency, the one that would give the maximal [Delta]2 shift. Even further, one could compute the number of expected comparisons for each character that would results from the probability of it matching or not matching and the expected shift associated with each event. This does not give much better results than the simplified OM algorithm that we have used.

To illustrate the impact of the OM algorithm, note that over 20 percent of English words end in the letter 'e', the most commonly occurring character in English text with about a 10 percent occurrence rate. Thus, many words that are searched for in text using the BM algorithm often get a match for the fist character tested. Testing the least probable character of a word first, considerably improves this statistic. The average ratio of the text occurrence probability of the last letter of a word to the least likely letter in it is almost 5, making a mismatch on the first character tested five times more probable in general. For some words (one percent), this ratio can be as high as 50 or more. For words ending in 'e', the average ratio is almost 9.

Comparison of the

Algorithms.

To compare the BM, QS, MS, and OM algorithms, each eas used to search for the same strings in large fixed text buffers, and the number of comparisons made with text characters was counted. The algorithms were coded in the C programming language (see Appendix). The first text buffer used was formed from the [UNIX.sup.TM] spelling dictionary file, /usr/dic/words, by discarding nonalphabetic characters and converting alphabetics to lower case. This resulted in about 200K characters of text. Then, all occurrences of each alphabetic word in /usr/dict/words were searched for in this text buffer using each of the four algorithms. After counting the number of character comparisons made for each word and algorithm,the fraction of the total characters of text was computed and recorded. Finally, for each pattern string length, the average of this fraction was computed for each algorithm. The resulting statistics (Table I) show that the QS, MS, and OM algorithms are all faster than the BM algorithm. Table I compares the BM, QS, MS, and OM algorithms as a function of the pattern string length, "Plen." The "Words" column gives the number of words of each each length searched for in the text. These results show that the OM algorithm is the fastest one of all.

Next, the increase in speed of the fastest algorithm, the OM algorithm, over the BM algorithm, was computed as the ratio BM/OM of the values in Table I for each word, and the average for each pattern legnth was computed as shown in Table II. This shows a dramatic increase in search speed for short pattern strings and a general increase of almost 10 percent for longer strings. Table II also shows the minimum and maximum values of the BM/OM ratio for any individual word that occurred for each string length. This shows that the OM algorithm is at least as good as the BM one and can sometimes be significantly faster.

A further test to compare the BM and OM algorithms was done using the text buffer formed by concatenating all the UNIX manual pages. The raw, unformatted manual pages were only filtered by throwing away format command lines and by converting alphabetic characters from upper to lower case. Nonalphabetic special characters, including all white spaces, were retained in the text. This resulted in almost 3 megabytes of technical English text. Again, each alphabetic word from the UNIX dictionary was searched for in this text buffer using the BM and OM search algorithms. The results are given in Table III. It is interesting to note that the speedup ratio BM/OM, as a function of Plen is almost exactly the same as computed in the dictionary text search tests.

Conclusion

Throughout the history of computer science, there has been an evolving discovery of new, fast string search algorithms. Theoretical work in automata theory, in the 1960s, led directly to the algorithms of the 1970s. Of these, the Boyer-Moore (BM) algorithm became notorious as the fastest technique available to search for a single fixed substring. It was also notorious for going against natural intuition by scanning the pattern string in reverse order and thus gaining its startling efficiency. Following this work, many improvements have been made to the BM algorithm that can increase search speed for certain types of patterns. All of these improved algorithms, however, still depend on the BM technique of scanning the pattern string in reverse order to achieve their efficiency.

This article has presented an extension of the BM algorithm that does away with dependence on the scan order of the pattern string. In fact, the pattern can be scanned in any arbitrary order, and there is still an increase in efficiency over the BM algorithm. It is then shown how to select scan orders that increase this efficiency even more. Three specific new algorithms are presented that use three different pattern string scan orders. These algorithms are called the "Quick Search" (QS), the "Maximal Shift" (MS), and the "Optimal Mismatch" (OM) algorithms.

The first of these, the QS algorithm, is very easy to implement and scans the pattern in the most natural forward order. It is almost as easy to understand, code, and debug as the slow, straightforward algorithm that most programmers tend to use. Using the QS algorithm, however, will most often give superior search speeds to even the BM algorithm. When a programmer is called on to rapidly code a string search, the QS algorithm should be his or her choice.

The final algorithm, the OM algorithm, is the fastest one of all. It gains its efficiency by first testing the least probable pattern string character and thus detects mismatches more quickly. This event dominates search statistics and results in a significant increase in speed (see Tables I, II, and III). The greatest gains are for short pattern stings, where there is a 20 percent or greater increase in search speed for normal English text. For longer strings, the relative advantage becomes smaller, and one can expect a text search speed increase of about 10 percent.

In applications not involving English text, the algorithms presented in this article should still give better performance than the Boyer-Moore string search. The statistics on the degree of improvement would be different since they depend on the size of the text alphabet and the frequency of occurrence of the alphabet characters in the application text.

Acknowledgments.

Although this work was not the result of a directly funded research project, it is most certainly a side-effect of the projects I have worked on at JHU-APL. I would like to acknowledge the stimulating technical environment in which I work and the resources made available to me on a daily basis. These factors have made this work possible.

References

[1] Boyer, R.S., and Moore, J.S. A fast string searching algorithm. Commun. ACM 20, 10 (Oct. 1977), 762-772.

[2] Guibas, L.J., and Odlyzko, A.M. A new proof of the linearity of the Boyer-Moore String Searching Algorithm. SIAM J. Comput. 9, 4 (Nov. 1980), 672-682.

[3] Knut, D.E., Morris, J.H., and Pratt, V.R. Fast pattern matching in strings. Siam J. Comput. 6, 2 (June 1977), 323-350.

[4] Smit, G.V. A comparison of three string matching algorithms. Softw.--Prac. and Exp. 12, 1 (Jan. 1982), 57-66.

CR Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems--sorting and searching

General Terms: Algorithms

Additional Key Words and Phrases: Boyer-Moore, Knuth-Morris-Pratt, pattern, search, substring

DANIEL M. SUNDAY is currently a mathematician at the Johns Hopkins University Applied Physics Laboratory (JHU/APL) and works as the designer and developed of software systems. HE is also the Lead Software Engineer on a project developing color display systems for the U.S. Navy. Author's Present Address: The Johns Hopkins University, Applied Physics Laboratory, Johns Hopkins Road, Laurel, MD 20723-6099. dan@aplvax.jhuapl.edu

Printer friendly Cite/link Email Feedback | |

Title Annotation: | includes related article on C implementation |
---|---|

Author: | Sunday, Daniel M. |

Publication: | Communications of the ACM |

Article Type: | technical |

Date: | Aug 1, 1990 |

Words: | 4196 |

Previous Article: | Physical design equivalencies in database conversion. |

Next Article: | Calendar of events. |

Topics: |