An insertion algorithm for a minimal internal path length binary search tree.
1. INTRODUCTION Methods that maintain balance in binary search trees have received considerable attention in computer science literature. Since a binary search tree that grows in a random manner will have, on the average, an expected search path lenght of approximately 1.386 log.sub.2.n , we can expect an improvement in excess of 25 percent in the search path length if the tree can be kept to a minimum height. The cost of maintaining a tree to achieve optimal or near optimal search performance can be justified if searching is the predominant activity in the tree, with insertions of new keys into the tree being a relatively infrequent activity. The algorithmic strategies that have previously been used to introduce and maintain balance in binary search trees have been either local or global rebalancing strategies.
In the local strategies, a reasonable notion of balance is introduced and the binary tree insertion algorithm is modified to detect a deviation from balance during an insertion and to restore the tree to balance by performing sequences of subtree rotations. The most popular local balancing strategies have been those for AVL, or height balanced, trees  and for bounded balance (BB), or weight balanced trees . A recent algorithm performs subtree rotations whenever the total internal path length of the tree is reduced .
The global balancing algorithms inspect all of the nodes in a tree in order to reconstruct a balanced tree. The global strategies produce either a perfectly balanced tree [2, 3, 9] or a route balanced tree [4, 11] (i.e., a tree with minimal internal path length). Several global tree balancing algorithms are discussed in , including two that adopt parallel procedures.
We introduce a balancing algorithm for binary search trees whose overall performance in terms of the trees produced and the run-time behavior of the algorithm, is between that of the local and global balancing algorithms. The trees produced by this algorithm have minimal internal path length and are balanced with respect to both the number of complete levels and the number of levels in the two subtrees of each node in a tree. In the worst case, this algorithm requires time that is linear in the number of nodes in the tree (i.e., becomes global, in a sense); the worst-case behavior occurs only as the tree comes close to actually being a complete tree. Rather than performing subtree rotations as the local balancing algorithms do, or detachning and reattaching nodes as the global algorithms do, this algorithm maintains its notion of balance by moving key values within the existing tree; if an insertion would introduce imbalance, then keys are shifted in the tree in an inorder fashion until an insertion of a new node can take place that will preserve the balance criterion. Lastly, some results from simulations that show the method to have promising average case behavior are offered.
2. THE BALANCING CRITERION
The insertion algorithm for this balancing strategy will maintain a binary search tree as a nearly complete tree. We define a binary tree to be nearly complete to level l if it is complete to level l-1 and level l is either empty or is the last non-empty level of the tree. The root of the tree is at level 0. A binary tree having n nodes must have at least *log.sub.2.(n + 1)* levels and the trees that have exactly this number of levels are those that are nearly complete to level *log.sub.2.(n + 1)*. - 1. It can be seen from this, that over all binary trees with n nodes, those with minimal internal and external path length are precisely the nearly complete trees. For this reason, nearly complete trees aare also called ruote balanced trees . The perfectly balanced trees are a subset of the nearly complete trees.
If a tree that is nearly complete to level l is not actually complete, then adding a new node to level l, (i.e., as a child of a level l-1 node), again yields a nearly complete tree. Therefore, to maintain the property of a tree being nearly complete, new nodes should only be added to level l, the last incomplete level, until level l is actually completed. If a normal binary tree insertion calls for the attachment of a new node as a child of a node p that is at level l, then some ancestor of p must have substrees that are nearly compelte to different levels; in fact, this ancestor of p will be the one nearest to p that has a vacancy at level l (of the whole tree). The latter fact will be used by the insertion algorithm to detect when rebalancing actions must be initiated.
To retain the property of a tree being nearly complete, the insertion algorithm will ensure that the number of complete levels in the subtrees of each node differ by one at most, and that one subtree of a node will not be allowed to grow beyond its last complete level until the other subtree of the node completes the same level. To monitor these properties the algorithm will maintain a counter for each node of the number of complete levels in the subtree having the node as the root. A typical Pascal data structure for implementing the tree, along with several global variable declarations, is given below: type tree = *node; node = record key: keytype; left, right: tree; completelevels: integer end; imbalancetype = (left, right, none); var pivot: tree; imbalance: imbalancetype;
An imbalance condition is detected during an insertion at a node p if the subtree of p where the insertion would normally take place has more complete levels than the other subtree of p. In this case, p becomes the pivot node for rebalancing and the variable imbalance is set to indicate which subtree would become too deep. The keys are displaced in an inorder fashion to make room for the new key in this subtree of p. A key that is displaced moves to the node that holds its inorder successor or predecessor, depending on whether the imbalance indicated is left or right, respectively. The key that is displaced out of the subtree of p replaces the key in p, that in turn becomes the key to insert into the other subtree of p.
A simple example will help to illustrate the key displacement actions that maintain the nearly complete property of a tree. The insertion of 5 into the nearly compelte tree that is depicted at the top in Figure 1, will cause a left imbalance condition at the indicated pivot node. Instead of attaching a new node as a child of the node with key 6, keys will be displaced to the nodes of their inorder successors until a value is displaced back up to the pivot node. Until the left imbalance condition is removed, the displaced key will always be the larger of the key to be inserted and the key in the node currently being inspected, with the displaced value becoming the new key to be inserted into the tree. In Figure 1, 5 will replace 6 in the leaf into and this will begin a sequence of displacements that culminates in the displacement of 12 from the left subtree of the pivot. Twelve then replace 15 in th epivot node and 15 is inserted into the right subtree of the pivot node. This latter insertion, that continues the inorder displacement pattern in Figure 1, also begins as though it were a new insertion with no imbalance condition; new imbalance conditions could arise as new nodes are inspected in the right subtree. The tree that results from the rebalancing actions is shown at the bottom in Figure 1, with an asterisk next to a node's key indicating that the key value was changed during the rebalancing.
A recursive implementation of the procedure to insert into a nearly complete tree is given in Figure 2. A sequence of recursive calls to the procedure insert continues until a node with 0 or 1 child is inspected; such a node will have exactly one complete level in the tree of which it is the root and will be at level l or level l-1 if the tree is nearly complete to level l. At the lowest levels of the tree, the procedure insert invokes the procedure displace _or_attach whose task is to either attach a new node to the tree if there is no imbalance condition, or to displace a value up the tree if there is an imbalance condition as shown in Figure 3.
As mentioned earlier, an imbalance condition causes a key that is displaced from a node p to move to the node that is the inorder successor or predecessor of p, depending on whether the imbalance condition is left or right, respectively. There, the key is swapped with the successor or predecessor key. The newly displaced value continues the process, moving down the tree by another invocation of insert if it was displaced from an interior node, and moving up the tree if it was displaced from a leaf node. The key displacement that takes place on return to the pivot node for an imbalance condition will remove the imbalance condition. As long as an imbalance condition exists, no new imbalance conditions can arise until a key is displaced to the pivot, since the active procedure will always be confined to a complete subtree of the pivot node until the pivot node's key is displaced.
Any node that has a new node attached in one of its subtrees must have its level counter recalculated. The only nodes whose complete level counts can change are those that were inspected and no imbalance condition was present or those that were pivot nodes for rebalance actions. The placement of the calls to the procedure resetlevels is suggestive of these observations. The number of complete levels in a tree is one more than the minimum of the numbers of complete levels in its two subtrees; the code is straightforward and is omitted.
It should be noted that the insertion algorithm to maintain a nearly complete tree can be parallelized. The initial insert process examines each node along a path from the root of the tree to a node at level l-1, where a new node wil be attached. Whenever an imbalance and the key to be inserted into the tree are swapped; the previous key to insert moves into the complete subtree of the node, in order to displace a key to the now vacant node, and the initial insert procedure proceeds down the tree with the node's key as the key to be inserted. The process of displacing a key from the complete subtree to the vacant node can be done in parallel with the initial insert process, since these processes will be examining nodes in different subtrees. Similarly, whenever a key displacement process finds itself moving to the wrong subtree of a node, it performs a key swap, proceeds with a new key into the subtree from which it must displace a key, and activates another displacement process to displace a key-from the other subtree to the current node. We note that it is only the initial insert process that must recompute the level counters in nodes.
3. SPACE AND TIME REQUIREMENTS
Assuming a fixed size allocation in each node to store the complete level counts, the additional space required for the storage of a tree having n nodes is O(n). Rather than store complete level counts in each node, the algorithm can be implemented using a bit flag in each node to indicate if the subtree of the node has any gaps in the level that is currently being filled. When a level is completed the level flags throughout the tree will have the same value; to avoid having to change all of the flag values in the tree, the interpretation of the bit flag can just be switched while the next level is filled. The signal that the lowest level of the tree has been completed is the resetting of the flag value in the root of the tree. The extra space requirements for the execution of the insertion algorithm are O(log.sub.2.n), owing to the recursive nature of the algorithm as presented here. An iterative version of the algorithm can be given when extra space requirements for execution are O(1).
To estimate the time requirements of the insertion algorithm we will count the number of calls to the procedures insert and displace_or_attach. The number of node inspections, pointer dereferences, and comparisons can be estimated based on the number obtained. The behavior of the algorithm is dependent on both the shape of the tree (the locations of the nodes in the last level of the tree) and the relationship of the key to be inserted with the keys already present in the tree. Because of the combinatorially explosive nature of the problem, the worst-case behavior of the algorithm is concentrated on and some simulation results will follow.
We write the number of nodes in a binary tree that is nearly complete to level 1 as n = 2.sup.l - 1 + K, 0 [is less than or =] k < 2.sup.l.. This is to emphasize that the tree is complete through level l-1 and that k nodes have been placed on level l. The worst-case behavior of the insertion algorithm for an insertion into a nearly complete tree of n nodes is W(n) = 1 + 2k - t(k), where t(k) is the number of ones in the binary representation of k. This can be justified as follows: If an imbalance at a node p occurs during insertion, say to the left, then in the worst case, a procedure call would be made for every node in the complete left subtree of p. The insertion of p*.key into the right subtree of p could also cause an imbalance condition, but only if there is a complete subtree in the right subtree of p that has some leaves at level l that abut the leaves of the left subtree of p. This follows since the key that is displaced out of p is smaller than all keys in the right subtree of p, and therefore it must occupy the leftmost leaf of the right subtree of p and it must be in level l of the whole tree. Continuing in this manner, we see that the worst case can occur only if all k leaves fill consecutive positions at level l and are all descendants of the node where the imbalance condition was first detected. The number of nodes on the path from the root to the actual position where a new node is attached is l = *log.sub.2.(n + 1)*. The number of nodes in the left subtree of p that are also to the left of this path can be shown to be equal to 2k - t(k). Each of the values in these nodes would be displaced in the worst case, and each such displacement requires a procedure call. The number of procedure calls in the worst case equals l + 2k - t(k). The worst case occurs when the k leaves are consecutive on level l, as described, and the new key's attachment position for a regular search tree insertion is at the leftmost leaf. The case for right imbalance is analogous.
We note that any insertion into a tree that is nearly complete to level l requires that at least 1 nodes be inspected. From this discussion we see that the worst-case behavior of the insertion algorithm is linear in the number of nodes in the last level of the tree at the time of the insertion. We see that the worst-case behavior becomes linear in the number of nodes in the tree itself as the last level of the tree nears completion.
Some results of executions of the insertion algorithm are shown in Table I. Data on the number of extra calls needed to maintain balance during insertions was collected during the creation of 99 trees of 1023 nodes each. The mean number of extra procedure calls needed during insertions and the standard deviation are given for various values of n. Various percentiles for the number of extra calls in the 99 data collection runs for the values of n are also shown.
Table I shows the algorithm's efforts to maintain the balance of the number of complete levels in the subtrees of the nodes in a tree increases dramatically as a level nears completion. In the simulations it was found that the number of extra calls made for each insertion when filling levels 8 and 9 corresponded to less than 10 percent of the nodes in the tree until the level being filled was about 80 percent full. By filling level 9 fewer than 10 percent of the nodes were inspected to accomplish a rebalancing action until level 9 was almost 90 percent full. When the last level was level 7 or greater, the percentage of the nodes inspected because of rebalancing requirements was negligible until the last level was more than 40 percent full.
Table II provides a comparison of several local and global balancing algorithms. H.sub.n denotes the minimum height for a tree with n nodes that is *log.sub.2.(n + 1)*. I.sub.n denotes the minimum internal path length for a tree with n nodes that is asymptotically equal to nlog.sub.2.n . The worst height and internal path entries for the local balancing strategies (the top three rows of Table II) are asymptotic behaviors. The entries for the remaining strategies indicate average behaviors.
The last five entries in Table II are global balancing algorithms, the last four are summarized in . Chang and lyengar actually provide two global balancing algorithms in , one that is a parallel algorithm. Both of these algorithms exhibit the properties indicated in the table. The global balancing algorithm given in  is optimal in its use of both time and space. Although Gonnet's internal path reduction algorithm  has a worst case running time that is O(n), its amortized worst case time is O(log.sub.2.n).
We have presented a balancing strategy and insertion algorithm for binary search trees that maintains trees of minimum height and minimum internal path length, and therefore is optimal for searching (assuming equal likelihood for searches). With respect to the searching effort, it produces trees that are comparable to the perfectly balanced trees that are produced by many global balancing algorithms, and it can be expected to produce trees that are more efficiently searched than those produced by local balancing algorithms. Its storage requirements are the same asthose of the local balancing algorithms and slightly more than those of the global algorithms. While the local balancing algorithms have worst case or amortized worst case running times that are O(log.sub.2.n) and the global algorithms have O(n) expected run times, this algorithms has a worst case run time of O(l + 2k), where n = 2.sub.l - 1 + 2k(0 [is less than or =] k < 2.sub.l.), which can be seen to increase from log.sub.2.n to n. The algorithm presented here is a compromise between the algorithms previously mentioned, maintaining optimal search trees with a strategy that tries to localize efforts but, in its worst case, can become global in nature.