Concurrent operations on priority queues.
1. INTRODUCTION Priority queues play important roles in computer applications as diverse as discrete event simulation and finding shortest paths in a graph. The basic operations on a priority queue are enqueue and dequeue (sometimes called insert and delete-min). Enqueue places an item in the queue and dequeue removes and returns the highest priority item from the queue; by convention, higher priorities are represented by numerically smaller values.
Since many of the applications of priority queues are computationally intensive, it is interesting to consider how they migh make effective use of MIMD multiprocessors with global shared memory. When multiple processes communicate through a shared priority queue, most implementations require that processes take exclusive use of the entire queue for the duration of each operation. The time taken for a call to enqueue followed by a call to dequeue on an n item priority queue can be no better than O(logn n) in the worst case; as a result, contention can be a major problem for large queues.
The contention problems resulting from concurrent use of priority queues are very similar to those resulting from concurrent use of search trees. Concurrent implementations of search trees have been intensively investigated during the last decade, primarily in the context of database systems [1, 5, 10]. In these implementations, multiple processes may simultaneously operate on the tree because each process holds exclusive use of only a subset of the items in the tree. As a result, the severity of the contention problem is determined not by the duration of entire tree operations but by the length of time that processes hold exclousive use of particular items.
Most fast priority queue implementations explicitly represent the priority queue as a heap ordered binary tree. These include leftist trees , pagodas  and skew heaps [13, 14]. Other fast priority queue implementations include implicit heaps  and Binomial Queues [4, 15]. All of these have an O(long n) average time for an enqueue followed by a dequeue.
The hold model is frequently used to test the performance of priority queues . Under the hold model, items are repeatedly dequeued and then re-enqueued with a randomly reduced priority; this sequences of operations is known as a hold operation. The hold model allows the average combined time for enqueue and dequeue to be measured as a function of the queue size and random number distribution. Measurements using the hold model on a number of machines indicate that implicit heaps are somewhat slower than the rest of the implementations mentioned, but that the remaining implementations are sufficiently similar that there is no clear best choice .
When these implementations were examined for opportunities to introduce fine grained concurrency control, most were eliminated because they compute the new root of the tree representing the queue only at the very end of the enqueue or dequeue procedures; as a result, these implementations require users to take exclusive use of the root of the tree for the entire duration of each operation.
The implicit heap is disqualified for a different reason: With this implementation, enqueue involves a bottom-up search from a leaf of the (implicit) tree, while dequeue involves a top-down search from the root. If users take exclusive use of individual items in the heap as the operations progress, deadlock can result when a bottom-up search collides with a top-down search (as outlined in , this deadlock may be avoidable). An alternative implicit heap using top-down searches for both enqueue and dequeue does not suffer from this problem, but it was not considered because for the vast majority of access patterns, it is about half the speed of the conventional implicit heap. Another scheme for parallel access to implicit heaps is mentioned in .
2. THE SEQUENTIAL IMPLEMENTATION
The only priority queue implementation to survive this superficial examination was the top-down variant of skew heaps [13, 14]. A priority queue represented as a skew heap is a heap ordered binary tree, that is, a tree where the priority of each item is higher than (numerically less than) the priorities of any descendant of that item (see Figure 1). Dequeue returns the root of the heap and melds its two subheaps, while enqueue treats the item to be enqueued as a one item heap and melds it with the original one. The meld operation combines two heaps into a new heap representing the updated queue. This is done by merging, as in merge-sort, the right branches of the two heaps from the root downward, exchanging the left and right children of each item visited in the process. The exchange of children is used to ensure that the amortized cost of melding will be bounded by O(long n) for an n item tree.
The concurrent version of the skew heap priority queue implementation was produced by a sequence of transformations starting with Sleator and Tarjan's recursive version presented on page 238 of . A Pascal transliteration of this code is given in Figure 2. In this transliteration, some changes were made; these include moving the rxmeld function into position as a local function in rmeld, unfolding parallel assignments, renaming some of the variables, and moving the responsibility for allocating and deallocating items to the user. This code assumes that users will distinguish between variables of type queue, holding priority queues, and variables of type ref, holding pointers to items that can be enqueued; the fact that their representation is the same should be viewed as coincidental.
Although the code in Figure 2 computes the values to be returned by rxmeld before the recursive call, the actual value is not put in place until after the call. As a result, the user process must hold exclusive use of the location that is to receive the return value for the entire duration of the meld operation. This problem can be eliminated by converting all of the functions in the original version to procedures that return their results using parameters passed by reference. The result of this transformation is shown in Figure 3. This version of enqueue has a different specification from that in Figure 2. Unlike the original, it modifies the caller's pointer to the queue; the original returned a new pointer to the head of the updated queue. In fact, the original version modified the data structure the old pointer referred to, so enqueue could never be viewed as a pure function on the priority queue abstract data type. Because of this, the change to enqueue should not have any effect on its utility.
The code in Figure 3 is recursive, but a good compiler could generate fast iterative code by taking advantage of the fact that the recursive call is the last operation in the procedure (tail recursion). Converting this to iterative code in Pascal almost doubles the size of the program because Pascal does not consider parameters by reference to be equivalent to pointer variables. In languages such as Algol 68 and C, where pointers and parameters passed by reference are equivalent, iterative code can be produced that is as compact as the recursive version.
3. THE CONCURRENT IMPLEMENTATION
The concurrent version of the skew heap priority queue implementation given here follows from the observation that the iterative version in Figure 3 never holds references to more than three items in the tree at anytime: The pointers q1 and q2 to the roots of the subtrees being melded, and q, which references either the root or the left pointer of the leftmost item in the updated tree. As long as the process performing a queue operation holds exclusive use of these three items, multiple processes should be able to perform simultaneous operations on the queue. The code in Figure 4 does this, using Dijkstra's P and V operations for synchronization. The semaphores named mutex in each variable of thype item or queue in Figure 4 are used for mutual exclusion. Each variable of thype queue also contains count, a semaphore used to solve the producer-consumer problem (this may be inappropriate for some applications). Initial values for these semaphores are give in comments on the declarations; they are assumed to be manipulated only by the code shown. Any number of processes may share any number of queue variables and apply arbitrary sequences of enqueues and dequeues to them, but no processes will ever reference or manipulate items currently in some queue.
Early in the development of his code, it was hoped that additional concurrency could be allowed by guarding each pointer in the data structure with its own semaphore. This allows exclusive use of the right pointer to be released before the recursive call on line 7 in Figure 4, while exclusive use of the left pointer must be retained until line 6 in the recursive call. Unfortunately, early release of the right pointer does not lead to any gain, since both pointers are always claimed at the same time.
It may appear that the mutex semaphores in the root item in the queue and in the variable of type queue are redundant, since two mutual exclusion semaphores should never be needed to protect one object. In fact, this is not the case; mutex in a variable of type queue protects not the head item of the queue, but the pointer head itself. similarly, mutex in each item protects the pointers left and right stored in the item, not the item itself.
This code was converted to Concurrent Pascal-S, with a single queue shared by 3 producers and 3 consumers. The producer and consumer processes were trivial, consisting of tight loops moving data between a free list and the queue. Priorities were assigned in numerically increasing order, resulting in first-in first-out queue behavior. With 25 items initially in the free list, and an average queue length of 20 items, there were usually 2 processes active in the rmeld procedure at any instant (the maximum observed in 3,000 dequeues was 5, the maximum observed in any 100 consecutive dequeues was usually 4). Thus, even for small queues, this code introduces significant opportunities for concurrency.
The correctness of the sequential parts of the code in Figure 4 follows from the correctness of Sleator and Tarjan's original code and from the simple transformations that were applied to it. The correctness problems that remain are those involving interactions between concurrent processes.
Intuitively, each process performing an enqueue or dequeue can be viewed as holding a "bubble" of mutual exclusion that enters the tree at the root an percolates along branches of the tree toward the leaves. The fact that the bubble of mutual exclusion always enters the tree at the root and travels along branches toward the leaves ensures that there will never be a deadlock. All changes a process makes to the tree structure are within this bubble, and the bubble always holds exclusive use of at least one and never more than two items. The latter is ensured by the ordering of lines 5 and 6 in Figure 4. Items are added to the bubble (with P operations on the associated semaphores) before any pointers they hold are inspected or modified. Items are never released from the bubble (with a V operation) until all changes to the item have been made. Note that the code never modifies the priority of an item in the queue so it is unnecessary to include items in the bubble in order to check their priorities.
The following two invariants describe the bubble of mutual exclusion on entry to rmeld and rxmeld (lines 1 and 8 in Figure 4): First, the current process has exclusive use of the pointer variable referenced by q, so mutex (the associated semaphore) is zero. Second, no other process has exclusive use of any variable holding a pointer to the items referenced by q1 and q2. Figure 5 illustrates this.
In the calls to rmeld, the first invariant is ensured on lines 9 and 10 of Figure 4 by P operations on the mutual exclusion semaphore associated with the head of the queue. Within rmeld, this invariant is not modified on the path to rxmeld, and within rxmeld, it is ensured prior to the recursve call to rxmeld on line 7 by the P operation on line 5.
The second invariant is ensured in enqueue by the P operation on line 9 and by the assumption that the current process is the only one entitled to enqueue the new item. In dequeue, it is ensured by the P operation on line 11 that takes exclusive use of the old head item in the queue. Curiously, it is perfectly safe to release exclusive use of the old head immediately, on line 12! The empty critical section bounded by lines 11 and 12 may appear to be useless code, but it is essential to ensure that any other process modifying the old head of the queue finishes with it before the dequeting process is allowed to continue. In fact, the V operation on line 12 is not strictly needed; exclusive use of the old head item in the queue can be released at any time, as long as it is released before the item is re-enqueued.
The second invariant is not modified on the path from rmeld to rxmeld, and it is preserved on the path from the entry of rxmeld to the recursive call (line 7 in Figure 4). Exchanging q1 and q2 (lines 2 to 3) does not modify the invariant. If any pointer to q2 remains in the queue, it is destroyed by the assignment on line 4 prior to the V operation on the associated semaphore (line 6). Finally, the new value of q1 is only fetched after the P operation on the semaphore guarding it (line 5).
The user of a shared instance of an abstract data type such as a priority queue will usually assume that the operations on it are serializable. This implies that, if multiple operations are applied to the shared variable, the operations can be described as taking place in some specific order; if a single process were to perform these operations in the same order, it would get the same results. This is trivially true when each operation on the shared object takes exclusive use of the entire object, as is the case when monitors are used. When operations take exclusive use of only parts of a shared object, on the other hand, the situation can be much more complex. For example, in, it was necessary to distinguish between simple serializability and user view serializability in order to show that operations on shared search trees were serializable.
For the concurrent skew heap implementation, the situation is simple. Operations on the skew heap performed by multiple processes are serialized in the order that the processes take exclusive use of the pointer to the root. This follows from the fact that the bubble of mutual exclusion held by a process isolates all items in the queue that may need to be inspected by that process from all processes starting queue operations at a later time. Processes that arrive later may only modify those parts of the data structure that no longer concern the current process, and modifications made by processes that arrived earlier will have been completed by the time the current process inspects any of the affected items. Thus, even though the entire data structure may never exist in the state resulting from one operation prior to the start of the next, the necessary parts of the data structure will be in this state at the time the next operation needs them.
6. CONCLUDING REMARKS
Most of the binary semaphores used in a shared skew heap will have very low contention, assuming that the processes involved run at comparable speeds. This is because the left and right pointers of each item visited are exchanged during melding, thus forcing later operations to visit different parts of the heap. The only binary semaphore for which there is likely to be much contention is the one guarding the pointer to the root. This semaphore will rarely be held for long, since it will be released on line 6 of rxmeld in Figure 4 prior to the first recursive call. As a result, the most appropriate binary semaphore implementation for this implementation on a MIMD machine is probably a busy wait on an exchange or test-and-set instruction.
The above argument suggests that this implementation reduces the time that any process will block others from starting queue operations from O(log n) to O(1) in a MIMD environment where all processors run at comparable speeds. The maximum time that a process will hold exclusive use of the pointer to the root is determined by the path from the P on line 10 of Figure 4 to rmeld to rxmeld through the exchange operation on lines 2 and 3 to the V on line 6. Along this path, there are 5 assignment statements, 3 tests, 2 calls, 2 P operations, and a V operation. As argued above, these P operations should rarely block.
This priority queue implementation can be used to speed up algorithms for finding such things as minimum cost spanning trees or shortest paths in a graph. These can be expressed in terms of a process that cyclically dequeues information about a graph vertex, updates the graph, and enqueues information about some set of vertices. The classic discrete event simulation algorithm can also be described in these terms: A process cyclically dequeues an event notice, updates the state of the simulation model, and enqueues new event notices. The code in Figure 6 outlines the structure common to these problems.
The bulk of the priority queue manipulation needed by this code can be done concurrently by modifying enqueue and dequeue so that they spawn new processes with each call to rmeld. Each of the new processes finishes one priority queue operation and then terminates. This involves the dynamic creation of a large number of processes; on real systems, this is unfortunately expensive, and it rarely pays to create more processes than there are processors. Fortunately, this solution can be restructured in terms of a fixed number support processes, each of which cyclically performs meld operations requested by the main process.
In order to test this idea, the Concurrent Pascal-S implementation was modified to include 4 support processes, each of which performs rmeld operations at the request of either enequeue or dequeue. A main process was written that repeatedly performed hold operations. During 100 hold operations on a 25 item queue, an average of 2.4 of the support processes were busy; 3 or more support processes were busy 40 percent of the time, and there was always enough work to keep one support process busy. The extent that useful applications can be speeded up by this approach remains to be investigated, but this data suggest that, at least for large queue and simple applications, significant speedup is possible.
This solution can only make effective use of concurrency when the priority queue operations are more expensive than the operations in the body of the main process. When this is not the case, effective use of multiple processors will require multiple concurrent instances of the main loop illustrated above. In, a solution to the discrete event simulation problem in these terms is outlined. This introduces a number of new synchronization constraints that replace the semaphore named count in Figure 4. This approach to discrete-event simulaion is currently being tested.
|Printer friendly Cite/link Email Feedback|
|Author:||Jones, Douglas W.|
|Publication:||Communications of the ACM|
|Date:||Jan 1, 1989|
|Previous Article:||Fast parallel thinning algorithms: parallel speed and connectivity preservation.|
|Next Article:||News track.|