Printer Friendly

Notes on fast hashing of variable length text strings.

I read with interest Peter Pearson's article, "Fast Hashing of Variable-Length Text Strings" (JUne 1990, pp. 677-680). In it he defines a hash function, given a text [C.sub.1] . . . [C.sub.n], by Exclusive OR'ing the bytes and modifying each intermediate result through a table of 256 randomish bytes.

He is right, there is not much literature about hashing functions, but this idea is not new. I developed it in my diploma thesis in 1976 [2] and published a modified version in 1977 at the annual conference of the german Gesellschaft fur Informatik (GI).

Here I decided to omit an additional table of randomish bytes. Although it makes possible a nearly perfect hashing, on a practical level it suffices the pure EOR-checksum if used with variable names in compilers. Statistics considering programmer's naming habits in Fortran and Pascal programs show a sufficient homogeneous distribution. Perhaps in environments with EBCDIC instead of SCII coding it will be better to EOR the result additionally with a byte containing the length of the string.

Of course the performance advantage (omitting the table access against an occasional collision) is not large. Otherwise, one can write an extremely fast scanner (including hashing) using the XC, EDMK and EX-instructions of the IBM/370-Assembler if s/he is not afraid of machine-dependent hacking instead of constructing a scanner out of regular expressions for at least one time.

I. Dittmer Bremer StraBe 8 A D-4514 Ostercappeln West Germany

In his article, Pearson suggests a new algorithm to map variable-length string onto integer. A string C consists in a sequence of characters [C.sub.0], [C.sub.n]. Each character is represented by one byte (in the range of 0 to 255). The algorithm needs an auxiliary table named T whose index goes from 0 to 255. This table memorizes a permutation of the values (0, 1, ..., 255) (for example, by a shuffling algorithm, [4, Section 3.4.2]). The proposed solution is the following:

integer hash (array c);

h = 0; n = ArrayLength(c) - 1; for i = 0 to n do

h = T[bitXor (h, c[i])];/* XOR bit by bit */ return h;

This algorithm returns a small integer between 0 and 255, an element of the table T. How does it work? As an example, one can limit the alphabet to the four symbols {'a', 'b', 'c', 'd'}. The corresponding codes are {'a' = 0, 'b' = 1, 'c' = 2, 'd' = 3}. Two bits are needed to code each character of the alphabet. The table T is set to the array [2 3 1 0], a random permutation of the value 0 to 3.

hash ('bc') = T [T[0 bit Xor 1] bitXor 2 ] = T[ 3 bitxor 2 ] = 3 hash ('cb') = T[ T[0 bitXor 2] bitXor 1 ] = T[ 1 bitXor 1 ] = 2 hash ('ab') = T[ T[0 bitXor 0] bitXor 1 ] = T[ 2 bitXor 1 ] = 0 hash ('ba') = T[ T[0 bitXor 1] bitXor 0 ] = T[ 3 bitXor 0 ] = 0

With that example, we easily see that the proposed solution can separate the anagrams (the string 'bc' and 'cb'). However, this fact may not be applicable to all anagrams (e.g. the words 'ab' and 'ba' give the same integer).

Before investigating the problem of varying the table size, we had to study the statistical behavior of the hash function and, especially the question, "Are the 256 outcomes equally probable?" The author does one traditional chi-square goodness-of-fit test with an English dictionary of 26,662 words. His answer is yes ([H.sup.2] = 255.64, d.f. 255, p = 0.477). I have used the same T table with a French dictionary of 52,748 entries (word in lowercase but with some accents which means a dozen codes between 128 and 255). The mean words length is 8.81334 and the variance is 7.85056 (standard deviation = 2.801885). The chi-square test result I obtained is alarming ([H.sup.2] = 309.383, d.f. 255, p = 0.0103).

If one wants an extended range for the hash value, the author proposes the following algorithm:

integer hash 16(array c);

h1 = hash(c); c[1] = (c[1] + 1) \\ 256; /* c[1] + 1 modulo 256 */ h2 = hash(c); h1 = bitShift(h1, 8); /* shift 8 bits to the left */ return (h1 + h2); /* concatenate h1 and h2 */

This version leads to outcomes from 0 to 65,535. A common problem amongst all hashing methods is the definition of the collision-resolution policy (What we do when h(C) = h(C'), for a string C [is not equal to] C' (i.e. hash('ab') = hash('ba'))). To solve the collision problem, Pearson suggests:

"The function h (hash in this paper) us well-suited to this sort of application. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible index values in a very irregular manner."

Since the key itself changes, we do not encounter the primary or secondary clustering problem [3, Section 6.4]. What if the same collision-resolution policy, for a hash function, is used with a table of size 65,536? In that case, we increment the first and the second letter of the word to generate all the combinations of the first two letters of the string ([p.sup.j] = h([c.sup.0] + j \\ 256, [c.sup.1] + j // 256, [c.sup.2, ..., [c.sup.n])). With all those combinations, can we produce a permutation of the space index? The answer is no as shown in the following example.

We want to map a sequence of words composed by a combination of our four letters into a 4-bits integer. We apply a simpler version of the algorithm hash 16. The generation of all two-letter combinations of our alphabet gives the following results ([unkeyable] means the concatenation operator):

If we consider the probe secquence implied by the string 'aa', we generate the strings {'ba', 'ca', 'da', 'ab', 'bb', 'cd', 'db', 'ac', 'bc', 'cc', 'dc', 'ad', 'bd', 'cd', 'dd'} and the following addresses {3, E, 9, 1, 6, B, C, B, C, 1, 6, E, 9, 4, 3}.

Thus, we cannot obtain a permutation of the space index. The proposed collision-resolution policy does not work. Since the table size is a multiple of two, we can calculate [h.sup.2](C) as an odd value in such a way that [h.sup.2](C) is relatively prime to the table size [1]. In this case, the probe sequence defined by [p.sup.j] = h(C) - j[middot] [h.sup.2](C) does not suffer from secondary clustering. The function [h.sup.2] is the following:

integer hash2 (array c);

c[1] = (c[1] + 1) \\ 256; /* c[1]

+ 1 modulo 256 */ h = hash(c); return (bitOr(h, 1));/* to be sure that h is odd */

Jacques Savoy University of Montreal, Dept IRO P.O. Box 6128, station A Montreal (Qc) (Canada) H3C 3J7

I would like to add the following comment to Pearson's Fast Hashing of Variable Length Text Strings. This hashing function needs not to be limited to strings, we have been using it to hash addresses, strings and any other data structures that we use as keys in our CAD software.

From our experience, a good hashing function with a range greater than 8 bits can be built by concatenating the last values of the sequence generated by the algorithm, in example h[n-3]//h[n-2]// h[n-1]//h[n] to generate a 32 bit function. The order of the concatenation depends on which bit should be more "random."

James Litsios Layout & Verification Dept. CSEM Nevchatel, Switzerland

Response

I would be happy to cede priority in the invention of this algorithm to Ingo Dittmer. That my researches did not lead me to his thesis was entirely by oversight. If not inventor, I will cheerfully settle for the role of publicist for this underappreciated algorithm.

Jacques Savoy's observation of a distinctly uneven distribution of values for hashings of French words is indeed alarming: French and English are quite similar, after all, and the algorithm was in no way "tuned" for English. I can think of no accident of common letters, accented letters, common prefixes or suffixesd to explain such an uneven distribution.

Users of James Litsio's extension to 32 bits should be alerted to the fact that if two words share the same last three characters and hash to the same 8-bit value, their extended 32-bit hash values will be the same. More generally, since the last three characters can be calculated from the value of this 32-bit hashing function, the distribution of such 32-bit values will simply be the product of the distribution of 8-bit hash values and the distribution of terminal character triplets. The distribution of terminal character triplets is far from uniform, so collisions will be far more frequent than would be expected for a 32-bit hashing function.

Peter Pearson 5624 Victoria Lane. Livermore, CA 94550

References

[1] Bell, J.R., Kaman, C.H. The linear quotient hash code. Commun. ACM, 13, 11 (Nov. 1970), pp. 675-677.

[2] Dittmer, I. Implementation eines Einschrittcompilers fur die Programmiersprache PASCAL auf der Rechenanlage IBM/360 der Universitat Munster. Diplomarbeit Munster 1976 und doert angegebene Literatur.

[3] Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading Mass., 1973.

[4] Knuth, D.E. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley, Reading Mass., 2nd 3d., 1981.
COPYRIGHT 1991 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1991 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Technical Correspondence
Author:Dittmer, I.; Savoy, Jacques; Litsios, James; Pearson, Peter
Publication:Communications of the ACM
Date:Nov 1, 1991
Words:1575
Previous Article:CAPS: a coding aid for PASM.
Next Article:Calendar of events.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |