# Minimal space, average linear time duplicate deletion.

MINIMAL SPACE AVERAGE LINEAR TIME DUPLICATE DELETION If a file contains two or more records with equal keys, all but one of the instance of this record are called duplicates. Detecting and deleting duplicates is a task that frequently reccurs in data processing. Typical examples are:

* identifying passengers in air travel with double bookings,

* purging erroneous double entries in order processing, mailing lists, etc.,

* identifying all distinct words in a text,

*obtaining a list of all distinct values in a measuring experiment,

* catching duplicates produced by a random number generator.

Av ery common application occurs in relational database systems where relational theory requires that the tuples of a relation form a set in the mathematical sense. Thus following projection, a system should detect and remove all those lines in the table which only differed on the now-deleted columns(s).

The generally accepted method for this task is to sort the recods which move records with identical key fields into consecutive locations. The duplicate records are then deleted either on-the-fly during the sort or in a second pass. If the given file contains N records with 1 [is less than or equal to] k [is less than or equal to] distinct values, deletion on-the-fly or use of a sorting method with a speed-up for duplicates may reduce the sorting time from 0(N log N) to /(N log k) [5, 11, 14, 15]. We should also mention that it is possible to even do stable, minimal space duplicate deletion in time 0 (N log N). It is achieved by first applying a stable, minimal space-sorting method (e.g. [13]), to the given data and then a stable, minimal space, linear time separation algorithm, (e.g. [8]), to the output of the sort. Because of the high constans involved, these methods are of more theoretical interest.

A sorting solution has the additional advantage that the list of unique values, called fod-list (free-of-duplicates-list) is in ascending descending respectively, order. Furthermore, it is known that purging duplicates from a list requires at least [OMEGA](N log N) steps under the general decision tree model [2], p. 292, thus sorting is worst case optimal. A similar result for detecting the maximal repeated element in a list can be found in [11], p. 5.

On the other hand, if key-to-address mappings are permitted, and if the output does not need to be in ascending (descending) order--which is true for tuples in a relational table--solutions based on hashing can do better, at least on the average. One example of hashing is in approximation the number of unique values using a bit-array [3]. Another example is Knuth's linear order hashing as a solution for finding a sorted list of M random numbers in the range 1..N in which no integer occurs more than once [4]. In the same way, the solution presented here produces an unordered fod-list with duplicates in the tail part of the file using one or more hashing passes over the input file. Contrary to intuition, no extra space is required and even a speed-up in the presence of duplicates can be shown.

If a sorted fod-list is required, deleting duplicates in linear time and then sorting the fod-list still be faster in the presence of duplicates because it takes time 0(N + k log k) versus 0(N log k) otherwise.

Apart from space and time requirements, duplicate deletion algorithms (for short dd-algorithms) can be classified into stable, totally stable, order preserving, and totally order preserving methods. We say a dd-algorithm is stable if the first (last) instance in left-to-right input order is picked and retained for the fod-list it is totally stable if records with equal keys in the list of duplicates retain their relative input orde. Stable methods are useful in situations where records have identical keys but are distinguishable on non-key fields.

If the dd-algorithm is stable and produces the fod-list in input orde, then it is called order preserving. If the duplicates retain the input order as well, then it is said to be totally order preserving. As an example, consider a plane's schedule which can be represented as a directed graph as shown in Figure 1. The output from various dd-algorithms is shown in Figure 2, where the input is a list of airport codes as key and departure time as satellite information.

Stable sorting algorithms, e.g. Mergesort [9] or linked list versions of Quicksort [14], give rise to totally stable dd-algorithms. They obviously do not yield order-preserving dd-algorithms without an additional index field. As will be shown, the hashing method introduced here is unstable, but can easily be made stable and even totally order-preserving with additional space of order N. The question is open as to whether there exists a (totally) stable and/or (totally) order-preserving, linear time, minimal space dd-algorithm based on hashing.

Interface Definition

Given a sequence of N records in sequential allocation, the task is to produce a permutation of the input such that afterwards the 1 [is less than or equal to] k [is less than or equal to] N records with pairwise distinct keys are in the front and the N - k duplicates are in the back. Figure 3 shows the general set-up as type-declarations in Pascal-notation.

Declaring the proper type for the domain of key values is up to the user. Assuming that the reader is familiar with the basic principle of hashing (see e.g. [9]), it suffices to mention that the hashing method below will use the division-remainder method as key-to-address mapping. For key types which do not directly permit this mapping, the user must provide his or her own hash functions with the interface definition from Figure 4.

For the common case of text values, several methods have been suggested in the literature [9, 10]. Preference should be given to computationally simple methods because--unlike in scatter storage applications--dd-algorithms are applied only once to a file. Using (f(first character)*f(middle character)*f(last character) + length of text) MOD p + d is a possible choice, where f delivers the numerical code of its character argument. For the following discussion, we may thus assume integers as input without restriction of generality.

Rhe reason for using an offset d in the definition of the hash function becomes apparent in the discussion of the method below. It also permits us to call the duplicate deletion procedure for an input sequence with arbitrary bounds l and r, l [is less than or equal to] r. (1) This way we avoid creating the impression that the method depends on having the data in locations 0..N - 1. In general, the input then resides in locations s, s + 1, ..., s + N - 1 = t (1 [is less than or equal to] s [is less than or equal to]

MaxSize - N + 1, N 0) of an array x. The necessary declarations are:

VAR x : Sequence s,t,u : Index

Deleting duplicates in x[s..t] is achieved by calling a procedure DDT in the form DDT(s,t,u,x) (2). Afterwards, the fod-list is in x[s..u] and duplicates are in x[u + 1..t]. The method's external specification can be taken from Figure 5. Note that inside DDT variables of type Index may run one past the array bounds I and MaxSize.

Description of Method

Definition of Outer Loop

The basic idea of DDT is to scan the input file in one or more passes, reducing the number of unhandled records. Each pass has two phases: a scatter phase which uses hashing to assign records to new addresses, and a collect phase, where detected duplicates and detected unique values are moved out of the way. This leaves only those keys for the next pass for which no decision could be made in the previous pass(es). Figure 5 shows the basic loop and Figure 6 gives a pictorial description of the general situation during a single pass.

In Figure 6, the terms proven fodlist and proven duplicates refer to records which have been extracted from the input in previous pases. Let the term home address of a key denote the location which the hash function returns for a key argument. The terms natives, synonyms, duplicates and unhandled in Figure 6 have the following meanings:

* native a record positioned in the home address,

* synoym a record which collided with some native and must therefore be positioned somewhere else the key is not equal to the native's key,

* duplicate a record with identical key value as some native, and

* unhandled a record for which no decision has been made yet.

Although we shall not formally prove the correctness of DDT, we will state the key observations that are necessary to understand the method. The first one is in a way a redefinition of duplicate deletion and describes a condition that holds at the entry of each iteration (the so-called loop invariant).

Observation 1: If several records have the same key value, then DDT keeps either all instances among the unhandled records or one instance within the fod-list and all other instances within the list of duplicates. Clearly, this observation holds when calling DDT and--if it can be maintained through all passes exhausting the list of unhandled records--it will yield the desire output.

The interface definitions for Scatter and Collect can be taken from Figures 9 and 12. In the following, we indicate how Scatter and Collect maintain the invariant from Observation 1.

Hashing in Place

Let us have a closer look at Scatter--the procedure which hashes its input thereby producing three categories: natives, synonyms and duplicates. The hash function we use for record a[i] is a[i].key MOD (r + 1 - el) + el which produces a home address in the range el..r inclusive. Remember that a record that occupies its home address is called a native. The correctness of Scatter now hinges on the following fact.

Observation 2: During the Scatter-phase, a native is never moved.

Assume Addr1 is indeed equal to Addr2. This opens up two subcases. If the key of the unhandled record is equal to the key in a[Addr1], then the unhandled record is a duplicate and must be moved out of the way. Note that in principle any place will do. It can never be mistaken for a native in the future because the key-to-address mapping is a function and therefore each record can have exactly one home address. We must keep duplicates separated from synonyms, however, which brings up the second subcase. If the unhandled record and a[Addr1] (the record at the home address) have unequal keys, then the unhandled record cannot become a native, it is not a duplicate and must therefore be handled as a synonym.

Assume now that Addr1 is unequal Addr2, that is, the home address of a[Addr1] is not Addr1. Then Addr1 is occupied by a record which is not a native. This record should be moved out of this location to make space for the unhandled record which becomes a native.

The implications of Observation 2 should now become clear. The home address of a record is the place which each unhandled record visits to clarify its identity. If it is already occupied by a native, the unhandled record must either become a duplicate or a synonym depending on equality of keys. If the home address is free, the first visitor grabs the location and becomes a native and is later added to the fod-list. If all keys go through this visit process, then it is guaranteed that all [n.sub.i] records with equal key value i are split into either one native and [n.sub.1] - 1 duplicates or become [n.sub.1] synonyms. This brings us one step closer to the goal from Observation 1.

What is left now is to devise an intelligent way of keeping synonyms, natives, duplicates and unhandled records apart. One way to do this is to have two 'pointers,' low and high, move from the left, respectively, right, end of the list of untreated records until they meet. Figure 7 illustrates the invariant for this loop.

Let a[low] be the next unhandled record. We determine its home address Addr1. If low = Addr1 then we are looking at a native and we advance low. Otherwise we look at the record in the home address, that is, compute Addr2, and determine whether a native resides there (Addr1 = Addr2).

If yes and a[low] is a synonym, simply advance low. If yes and a[low].key is equal to a[Addr1].key, then a[low] is a duplicate and must move to the right into the list of natives and duplicates. We therefore decrease high until we find a record that is not a native, that is, until a[high] has a home address unequal high. This search must stop at least on low itself since a[low] was already tested and found to be a non-native. Now a[low] and a[high] switch their places and a[low] becomes the unhandled record.

Next, consider the case in which the home address Addr1 is free, that is, occupied by a non-native. Clearly, a[low] should move to Addr1 but we must be careful where to place the removed record from Addr1.

If Addr1[is greater than or equal to] high, then a[Addr1] had been identified as a duplicate before and it must remain in this category. We therefore decrease high as before until we find a place occupied by a non-native. Then we can do the triangular switch Circle(a[low], a[high], a[Addr1]) depicted in Figure 8.

If Addr1 low, then a[Addr1] had been identified as a synonym before. Since any place in a[el..low] is as good as any other for a synonym, we simply switch a[low] and a[Addrl] and increment low.

Finally, if low [is less than or equal to] Addr1 high, i.e. a[Addr1] is among the unhandled records, we may again swap places but must keep low stationary in order to consider a[low] next.

In summary, there were six alternatives above for placing the unhandled record in a[low]:

* a[low] becomes a duplicate,

* a[low] becomes a synonym,

* a[low] is a native,

* a[low] becomes a native and displaces a duplicate,

* a[low] becomes a native and displaces a synonym,

* a[low] becomes a native and displaces an unhandled record,

None of the six alternatives ever moved a native. Observation 3 is a slightly more formal summary of the behavior of Scatter.

Observation 3: If the input a[el..r] to Scatter contains exactly n, records with one particular key value i, then in the output of Scatter, either none of the records occupy a home position and all of them come to rest in positions in a[el..low - 1] or one of them occupies a home address and all [n.sub.1] - 1 other instances rest in positions in a[low..r]. Furthermore, at least one record becomes a native.

In other words, Scatter maintains the invariant from Observation 1: records with equal keys are either all synonyms or one of them is a native and the others are duplicates--tertium non est! The complete listing is given in Figure 9. The needed procedures Swap and Circle appear together with Collect as shown in Figure 12.

Separating Records in Place

The task of the second phase in each pass of DDT is to collect the natives and attach them to the fodlist, collect the duplicates and attach them to the list of duplicates in the tail, and collect the synonyms into a[el' ..r'] as unhandled records for the next pass. There are several methods that can be used to do this. In fact our solution seems fairly complicated but turns out to be efficient.

In general, we need a way to separate a sequence of records in place according to the criterion 'is a[i] a native?'. A similar task occurs in Quicksort [7] with the criterion 'is a[i] less than the pivot?'. The major difference here, however, is that once a record has been moved, we cannot recall at a later time from the key value alone whether it was a native, synonym or duplicate.

A convenient way to perform this separation in place is to use a wheel. This data structure has been used by many programmers before, but to our knowledge has never been formally introduced except possibly in connection with descriptions of life in 'Flatland' [1]. (3)

The wheel works as follows: Given a sequence a[1..N] of records where each record either does or does not have property P, we separate a such that all k(0 [is less than or equal to] k [is less than or equal to] N) records with property P are afterwards in a[1 ..k] and all others are in a[k + 1 ..N]. This can be done by either forming a P-Wheel (a sequence of records which have all property P) and rolling this wheel from the right to the left through array a, or inversely by forming a non-P-Wheel which rolls from left to right through a.

Figure 10 shows a P-Wheel with a. The two indices i and j determine the diameter (extension) d = j - of the wheel. If the record a[i] in front of the P-Wheel has property P, the wheel is expanded, that is, i is decremented. Otherwise a[i] and a[j] switch places and i, j are decremented. This can be interpreted as rolling the wheel to the left. Clearly, at most N exchanges of records, respectively, expansions of records are needed to separate N records which makes the method linear in time using O(1) extra space.

Whether to roll a P-Wheel or a non-P-Wheel depends on the circumstances. If it is known that there are more P-records in sequence a than non-P-records, then it is cheaper to move a P-Wheel (on our computer a call of Swap(a[i], a[j]) together with i := i - 1 j := j -1 costs at least four times more than a single decrement). On the other hand, the records not in the wheel retain their relative order which is useful in many applications. Records added to the wheel usually lose their relative input order. There seems to be no cheap solution to this defect. If there were a solution, a simple stable Quicksort could be devised!

It is possible to use more than one wheel, for instance, to separate the sequence according to more than one criterion. These wheels may either move in parallel (used later for Collect) or pass through each other. This concludes our regression to this astonishing little data structure.

Given the terminology we outlined, it is now very easy to describe the procedure Collect: we first move a wheel of natives from low - 1 down to el. This extends the fod-list by the wheel in its final position. Next the collected synonyms and the duplicates form two parallel wheels which are rolled from the original position low up to r. Duplicates in front of the two wheels extend the right wheel, natives cause a triangular exchange Circle(a[i], a[low], a[j]) shown in Figure 11.

When the two wheels reach r, the wheel of duplicates extends the list of proven duplicates and the synonyms are in the middle for the next pass of DTT. Figure 12 gives the complete listing for Collect.

The reason for first choosing a wheel of natives and then two wheels of synonyms and duplicates is that the analysis of Scatter in the next section will show that for random input the fraction of natives is, on the average, 1 - 1/e [is nearly equal] .63 of all unhandled records. Thus it is cheaper to roll a wheel of natives than a wheel of synonyms. A similar argument favors rolling two wheels of synonyms and duplicates in parallel.

The correctness of the wheel technique to separate natives from synonyms and duplicates now follows because we do not move any record until it is immediately in front of a wheel. Again we give a more formal summary of Collect.

Observation 4: Procedure Collect takes synonyms and natives in a[el..low -], duplicates and natives in a[low..r] (el low [is less than or equal to] r + 1) and separates them such that afterwards all natives are in a[el..el' - 1], all synonyms are in a[el' ..r'] and all duplicates are in a[r' + 1..r] (el el' [is less than or equal to] r' + 1 [is less than or equal to] r + 1).

Concerning the question of stability mentioned in the introduction, the reader might be tempted to believe that the wheels in Collect are the reason for DDT's instability. This is not true Collect could easily be changed so that the wheels keep records with equal keys among synonyms and duplicates in relative input order. The trouble rests with Scatter which cannot pick the first representative of several records with equal keys. Switching a[low] and a[Addr1] moves a record from a random location either further up or down in the sequence, thus permuting synonyms and duplicates. On the other hand, using a separate area to either hold natives or synonyms and duplicates would turn DDT totally stable.

In the present form, algorithm DDT is unstable but deletes duplicates with O(1) extra space since both Scatter and Collect only exchange records. A formal proof of correctness can easily be constructed from Observations 1-4.

How Fast Is the DDT-Algorithm?

In Appendix 1 we show that, on the average, for an input of N distinct keys, DDT makes In N passes over

the file. In each pass, about 63.2% of the records become natives and only 36.8% are synonyms and must be treated in the next pass. In total this produces a linear running time proportional to 1.58N.

In the presence of duplicates, that is, if there are only k N distinct keys, the required number of steps decreases to less than ln k. Thus, DDT actually profits from the presence of duplicates. Intuitively speaking, only k potential natives compete for N locations which correspond to a scatter storage performance with a low load factor [alpha] = k/N 1. For example, if each key occurs twice, that is, if [alpha] = 0.5, DDT only needs about 1.27N steps.

Another positive effect is that the running time varies only with the choice of key values, not the input order. Since non-natives (synonyms or duplicates) give up their location in O(1) steps for the first arriving native, the algorithm--like most hashing-based methods--is immune to particular orderings of the input.

What about possible degeneration? In theory, the worst case happens when all keys repeatedly map to one and the same home address. This leads to a running time of O([N.sup.2]) because exactly one record is removed in each pass. For this to occur, however, keys essentially must be some variation of N! (N factorial), since the series of hash functions used is mod N, mod (N - 1), . . . , mod 2. This quickly finds its limit in the size of keys.

A more practical threat is a biased key space. A typical example would be a file of 'all even keys'. Let N be even for the first pass. This implies that only N/2 of the N locations are possible targets for a record. On the other hand, the probability that one of the N/2 locations receives a native increases to 1 - 1/[e.sup.2] [is nearly equal] 0.864 versus the earlier 0.632. The net effect in the first pass is that we are left with [n.sub.1] [is nearly equal] 0.567N records for the second pass versus 0.367N. If our bad luck continues and an even number of natives is removed in each of the following passes, the total effort becomes about 2.31N.

DDT, however, has a natural safeguard built in: the divisor changes at random from step to step because the number of removed natives is a random variable. Thus even in the well-known case of text keys and division by a power of 2, a disastrous first pass will be followed by normal passes. Moreover, the divisors [n.sub.i] and [n.sub.i + 1] in successive passes need not be relatively prime as in the case of collision resolution with multiple hashing--in pass i + 1 the collision cause (i.e., the native) is no longer present!

Separate Chaining and Sorting

as Alternatives

The algorithm proposed here needs no extra space and scans the file in multiple passes, displacing records in place. An obvious alternative is a one-pass hashing scheme where collisions are resolved through separate linked lists for the synonyms. This is the well-known hashing with separate chaining [9, 513-518]. The average number of probes for unsuccessful search is [alpha] + [e.sup.-[alpha]], where [alpha] is the present load factor [9]. Thus the effort of a dd-algorithm with chaining becomes

[Mathematical Expression Omitted]

N/e or about 1.13N for N records without duplicates.

This is better than the 1.58N of DDT, and the dd-algorithm is even totally stable. Initializing the chains and deallocating the extra space afterwards, however, reduces the advantage. In fact, the running times reported in Table I for separate chaining could only be achieved by marking the heap of the run-time system prior to hashing and releasing the total space in one step after moving the records back into array a[1..k]. Disposing the nodes one by one created random holes in the heap which deteriorates the performance.

Table I also gives the reader a quick impression of the absolute speed of DDT in comparison with competitive sorts. We have chosen two data sets: N integers from the interval [1..N] and N integers taken at random from the interval [0..32766] ([2.sup.15] - 1 = 32767 was reserved as a stopper for QuickPurge in Table I). The first set is useful for comparisons since DDT completes it in one pass. In both sets a replication factor M = N/k runs from 1, 2, . . . , N, that is, from no duplicates to a unary file. The performance measure is execution time listed in 1/100th of a second on an IBM PS/2-80, averaged over 10 experiments and reported for the value N = 2048.

The competitors are Sedgewick's fast version of Quicksort [12], a Quicksort version with on-the-fly duplicate deletion (QuickPurge), a textbook Heapsort [2, 6] and a Heapsort version with on-the-fly duplicate deletion (Deapsort). This version is based on Floyd's suggestion of using a path of maximal sons [9] which explains its faster performance even for M = 1. For the standard Quicksort and Heapsort algorithms, a procedure, ThrowBack was added after the sort that moves the duplicates in place to the tail end of the file. Its linear running time is included in their figures.

The empirical data clearly speak out in favor of our in-place hashing method. Note that DDT's coding favors input without any duplicates over input with duplicates only (both one pass). This seems reasonable since the former case should occur quite frequently while the latter is the exception. Having given the complete listing of DDT, the reader is invited to verify our results.

Conclusion

Detecting and deleting duplicates is a frequent task in data processing. As an alternative to sorting, a new method based on hashing is proposed which moves duplicates in place to the tail end of the file. Starting with the abstract concept of multipass scatter and collect phases, the concrete algorithm, called DDT, is derived and listed in the form of three short Pascal-procedures.

Astraightforward analysis shows that algorithm DDT works in minimal space and requires, on the average, linear time. Degeneration is possible, but unlikely since the divisor changes between passes as a random variable. Finally, empirical data from running DDT against one-pass separate chaining and competitive sorts indicate that the constants are low enough to recommend its use in practice for input of any size.

(1) In the following, the letter l has been substituted by its phonetic equivalent el to avoid confusion with the similar-looking digit 1.

(2) DDT stands for Duplication Deletion Technique. The reader is free to associate with it a method which weeds out unwanted items in one application.

(3) Note that we do not claim to have invented the wheel, we simply define it in the context of serious algorithmic design.

References

[1.] Abbott, E.A. Flatland. A romance of many dimensions. Dover, Mineola, N.Y., 1952.

[2.] Aho, A.V., Hopcroft, J.E., and Ullman, J.D. Data Structures and Algorithms. Addison-Wesley, Reading, Mass., 1983.

[3.] Astrahan, M.M., Schkolnick, M., and Whang, K.-Y. Approximating the number of unique values of an attribute without sorting, Inf. Syst. 12, 1 (1987), 11-15.

[4.] Bentley, J. and Knuth, D.E. Programming pearls, literate programming. Commun. ACM 29, 5 (May 1986), 364-369.

[5.] Bitton, D. and DeWitt, D.J. Duplicate record elimination in large data files. ACM Trans. Database Syst. 8 (June 1983), 255-265.

[6.] Floyd, R.W. Algorithm 245, Treesort3. Commun. ACM 7, 12 (Dec. 1964), 701.

[7.] Hoare, C.A.R. Quicksort, Comput. J., 5 (1962), 10-15.

[8.] Huang, B.-C. and Langston, M.A. Stable duplicate-key extraction with optimal time and space bounds. Acta Informatica 26 (1989), 473-484.

[9.] Knuth, D.E. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley, Reading, Mass., 1973.

[10.] Lum, V.Y., Yuen, P.S.T., and Dodd, M. Key-to-address transform techniques: A fundamental performance study on large existing formatted files. Commun. ACM 14, 4 (Apr. 1971), 228-239.

[11.] Munro, I., Spira, P.M. Sorting and searching in multisets. SIAM J. Comput., 5, 1 (Mar. 1976), 1-8.

[12.] Sedgewick, R. Implementing quicksort programs. Commun. ACM 21, 10 (Oct. 1978), 847-857 Corrigendum in Commun. ACM 22, 6 (1979), 368.

[13.] Trabb Pardo, L. Stable sorting and merging with optimal space and time bounds. SIAM J. Comput. 6 (1977), 351-372.

[14.] Wegner, L. Sorting a linked list with equal keys. IPL, 15, 5 (Dec. 1982), 205-208.

[15.] Wegner, L. Quicksort for equal keys. IEEE Trans. Comput. C-34, 4 (Apr. 1985), 362-367.