Fast hashing of variable-length text strings.
In the literature on hashing techniques, most authors spend little time discussing any particular hashing function, but make do with an allusion to Knuth  in their haste to get to the interesting topics of table organization and collision resolution. The relatively rare articles on hashing functions themselves  tend to discuss algorithms that operate on values of predetermined length or that make heavy use of operations (multiplication, division, or shifts of long bit strings) that are absent from the instruction sets of smaller microprocessors.
This article proposes a hashing function specifically tailored to variable-length text strings. This function takes as input a word W consisting of some number n of characters, C[.sub.1], C[.sub.2],..., C[.sub.N], each character being represented by one byte, and returns an index in the range 0-255. An auxiliary table T of 256 randomish bytes is used in the process. Here is the proposed algorithm:(l)
h := 0 ;
for i in 1..n loop
h[i] := T[ h[i-1] xor C[il ]
end loop ;
Notice that the processing of each additional character of text requires only an exclusive-OR operation and an indexed memory read. Also note that it is not necessary to know the length of the string at the beginning of the computation, a property useful when the end of the text string is indicated by a special character rather than by a separately stored length variable.
Two desirable properties of this algorithm for hashing variable-length strings derive from the technique of cryptographic checksums or message authentication codes [4), from which it is adapted. First, a good cryptographic checksum ensures that small changes to the data result in large and seemingly random changes to the checksum. In the hashing adaptation, this results in good separation of very similar strings. Second, on a good cryptographic checksum the effect of changing one part of the data must not be cancelled by an easily computed change to some other part. In hashing, this ensures good separation of anagrams, the downfall of hashing strategies that begin with a length-reducing exclusive-OR of substrings.
The auxiliary table T is obviously crucial to this algorithm, yet I have found very few constraints on its construction. Since the hashing function can only return values that appear in T, each index from 0 to 255 must appear in T exactly once. In other words, T must be a permutation of the values (0 ... 255). Obviously, if T[il = i, the corresponding h is merely a longitudinal exclusive-OR checksum, which is a bad hashing function because it does not separate anagrams. I have experimented by filling T with randomly generated permutations of (0 . . . 255) and have found no outstanding good or bad arrangements. (An attempt to promote greater dispersal among very similar short strings by clever choice of T, however, turned out to be a very bad idea.)
For the interested reader who does not want to generate his own random permutations, Table I presents the permutation used in the tests described later in this article.
The purpose of any text hashing function is to take text strings-even very similar text strings-and map them onto integers that are spread as uniformly as possible over the intended range of output values. In the absence of prior knowledge about the strings being hashed, a perfectly uniform output distribution cannot be expected. The best result that one can expect to achieve consistently is a seemingly random mapping of input strings onto output values. To see how well h does its job, one might ask the following questions.
* If h is applied to a string of random bytes, is each of the 256 possible outcomes equally likely? The answer, probably not surprisingly, is yes. From the algorithm given earlier, it is clear that if the last input character, C[n], is random-equally likely to take any value, and uncorrelated with any preceding character-then all final values of h are equally likely.
* If two input strings differ by a single bit, will their hash function values collide more often than by
Smaller Character Range
If the range of the input characters can be limited, a smaller auxiliary table T may be used, and the range of h can be limited accordingly. For example, if the input string can be limited to digits and uppercase letters, then each character can be mapped into the range [0, 63] as it is processed; T can be a table of 64 values in the range [0, 63], and h will then return a value in that range. For example, the 26,544 spelling-checker entries consisting entirely of letters and digits were hashed with a function that mapped the digits onto [0, 9] and both uppercase and lowercase alphabets onto [10, 35]. A 64-element T was built by eliminating all entries exceeding 63 from the table presented earlier. Distribution over the 64 output values was as even as would be expected from a random function X[.sup.2] = 59.17, 63 degrees of freedom, p = 0.614), while the exclusive-ORs of successive values were insignificantly less uniform (X[.sup.2] = 81.69, 63 d.f., p = 0.057).
Larger Range of Hash Values
In some applications, a range of hash indices larger than 256 is needed. Here is a simple way to get 16 bits of hash index from the function h:
(1) Apply h to the string, calling the result Hi.
(2) Add 1 (modulo 256) to the first character of the string.
(3) Apply h to the modified string to get H2.
(4) Concatenate Hi and H2 to get a 16-bit index.
When this algorithm was applied to the 26,662-word spelling-checker dictionary, 4,721 collisions occurred. Since perfectly random hashing would produce, on the average, 4,757 collisions, we conclude that this 16-bit extension of h performs essentially as well as random hashing.
In a second test, the 65,536 possible 16-bit index values were grouped and tallied in 533 bins. (The number of bins was chosen so that the average bin would catch about 50 of the 26,662 tallies.) The resulting distribution of tallies was consistent with the hypothesis of a uniform distribution X2 = 558.6, 532 d.f., p = 0.205).
Permuted Index Space
Some users of hashing functions who are concerned with collision handling prefer to think of the hashing function as producing a permutation of the index space, thereby specifying not just a single hash index, but a succession of hash indices to be tried in case of collisions . The function h is well suited to this sort of application. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible index values in a very irregular manner. This is derived from the assertion that strings of equal length differing in only one character cannot produce the same hashing function value.
A hashing function is perfect, with respect to some list of words, if it maps the words in the list onto distinct values, that is, with no collisions. A perfect hashing function is minimal if the integers onto which that particular list of words is mapped form a contiguous set, that is, a set with no holes. (See, for example, , , and [3, pp. 506-507].) Minimal perfect hashing functions are useful in applications where a predetermined set of high-frequency words is expected and the hash value is to be used to index an array relating to those words. If the hashing function is minimal, no elements in the array are wasted (unused).
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth  illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between -10 and 30. The table T presented in Table Il maps these same 31 words onto the integers from 1 to 31, in alphabetical order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process.
(1) A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
(2) One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
(3) For each word, the first candidate considered for exchange was T[h[n-1] xor C[n]]}, the last table element referenced in the computation of the hash function for that word.
(4) A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
(5) If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n-2] xor C[n-1]]}.
This procedure is not always successful. For example, using the ASCII character codes, if the word "a" hashes to 0 and the word "i" hashes to 15, it turns out that the word "in" must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (X[.sup.2] = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (X[.subp2] = 565.2, 532 d.f., p = 0.154).
The main advantages of the hashing function presented here are:
(1) No restriction is placed on the length of the text string.
(2) The length of the text string does not need to be known beforehand.
(3) Very little arithmetic is performed on each character being hashed.
(4) Similar strings are not likely to collide.
(5) Minimal, perfect hashing functions can be built in this form.
Its principal disadvantages are:
(1) Output value ranges that are not powers of 2 are somewhat more complicated to provide.
(2) More auxiliary memory (the 256-byte table T) is required by this hashing function than by many traditional functions.
This hashing function is expected to be particularly useful in situations where good separation of similar words is needed, very limited instruction sets are available, or perfect hashing is desired.
1. Cichelli, R. J. Minimal perfect hash functions made simple. Commun. ACM 23, 1 (Jan. 1980),17.
2. Knott, G. D. Hashing functions. Comput. 1. 18, 3 (1974), 265-278.
3. Knuth, D. E. The Art of Computer Programming. Vol. III, Searching and Sorting. Addison-Wesloy, Reading, Mass., 1973.
4. Meyer, C., and Matyas, S. Cryptography. John Wiley & Sons, New York, 1982.
5. Sprugnoli, R. Perfect hashing functions: A single probe retrieving method for static sets. Commun. ACM 20, 11 (Nov. 1977), 841.
6. Ullman, J. D. A note on the efficiency of hashing functions. 1. ACM 19, 3 July 1972), 569-575,
CR Categories and Subject Descriptors: E.2 [Data]: Data Storage Representations-hash-table representations; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-indexing methods; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-search process.
General Terms: Algorithms.
Additional Key Words and Phrases: Hashing, scatter storage.
ABOUT THE AUTHOR:
PETER PEARSON is a computer scientist at Lawrence Livermore National Laboratory, where his work tends to emphasize microcomputers, statistics, cryptology, and physics. Author's Present Address: 5624 Victoria Lane, Livermore, CA 94550 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||hashing algorithms|
|Author:||Pearson, Peter K.|
|Publication:||Communications of the ACM|
|Date:||Jun 1, 1990|
|Previous Article:||Skip lists: a probabilistic alternative to balanced trees.|
|Next Article:||Concurrent operations on extendible hashing and its performance.|