# A General Approach to Dynamic Packet Routing with Bounded Buffers.

Abstract. We prove a sufficient condition for the stability packet routing algorithms. Our approach reduces the problem of steady state analysis to the easier and better understood question of static routing. We show that certain high probability and worst case bounds on the quasi-static (finite past) performance of a routing algorithm imply bounds on the performance of the dynamic version of that algorithm. Our technique is particularly useful in analyzing routing on networks with bounded buffers where complicated dependencies make standard queuing techniques inapplicable.We present several applications of our approach. In all cases we start from a known static algorithm, and modify it to fit our framework. In particular we give the first dynamic algorithms for routing on a butterfly or two-dimensional mesh with bounded buffers. Both the injection rate for which the algorithm, is stable, and the expected time a packet spends in the system are optimal up to constant factors. Our approach is also applicable to the recently introduced adversarial input model.

Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design; C.2.2 [Computer-Communication Networks]: Network Protocols; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems; G.2.2 [Discrete Mathematics]: Graph Theory

General Terms: Algorithm, Theory

1. Introduction

The rigorous analysis of the dynamic performance of routing algorithms is one of the most challenging current goals in the study of communication networks. So far, most theoretical work on this area has focused on static routing: A set of packets is injected into the system at time 0, and the routing algorithm is measured by the time it takes to deliver all the packets to their destinations, assuming that no new packets are injected in the meantime (see Leighton [1992] for an extensive survey). In practice, however, networks are rarely used in this "batch" mode. Most real-life networks operate in a dynamic mode whereby new packets are continuously injected into the system. Each processor usually controls only the rate at which it injects its own packets and has only a limited knowledge of the global state.

This situation is better modeled by a stochastic paradigm whereby the packets are continuously injected according to some interarrival distribution, and the routing algorithm is evaluated according to its long-term behavior. In particular, quantities of interest are the maximum arrival rate for which the system is stable (i.e., the arrival rate that ensures that the expected number of packets waiting in queues does not grow with time), and the expected time a packet spends in the system in the steady state. The performance of a dynamic algorithm is a function of the interarrival distribution. The goal is to develop algorithms that perform close to optimal for any interarrival distribution.

Several recent articles have addressed the dynamic routing problem, in the context of packet routing on arrays,(1) on the hypercube and the butterfly [Stamoulis and Tsitsiklis 1991] and general networks [Scheidler and Voecking 1996]. Except for Broder and Upfal [1996], the analyses in these works assume a Poisson arrival distribution and require unbounded queues in the routing switches (though some works give a high probability bound on the size of the queue used [Leighton 1990; Kahale and Leighton 1995]). Unbounded queues allow the application of some tools from queuing theory (see [Harcol-Balter and Black 1994; Harcol-Balter and Wolf 1995]) and help reduce the correlation between events in the system, thus simplifying the analysis at the cost of a less realistic model.

Here, we focus on analyzing dynamic packet routing in networks with bounded buffers at the switching nodes, a setting that most accurately models real networks. Our goal is to build on the vast amount of work that has been done for static routing in order to obtain results for the dynamic situation. Rather than produce a new analysis for each routing network and algorithm, we develop a general technique that "reduces" the problem of dynamic routing to the better understood problem of static routing.

In Section 2, we prove a general theorem that shows that any communication scheme (a routing algorithm and a network) that satisfies a given set of conditions, defined only with respect to a finite history is stable up to a certain interarrival rate. Furthermore we bound the expected routing time. At first glance these conditions seems very restrictive and hard to satisfy, but in fact, as we show later, many of the previous results on static routing can be easily modified to fit into our framework. The theorem applies to any interarrival distribution: the stability results and the expected routing time of a packet inside the network depend only on the expectation of the interarrival distribution. The relationship between the interarrival distribution and the waiting time in the input queues is more complicated and is formulated in the theorem.

In Sections 3, 4, and 5, we present three applications of our general theorem to packet routing on the butterfly network. In Section 6, we present an application to packet routing on a mesh. We assume that packets arrive according to an arbitrary interarrival distribution and have random destinations. In Section 7, we present similar results for an alternative input model, the adversarial model [Borodin et al. 2001], whereby probabilistic assumptions are replaced by a deterministic condition on edge congestion.

Section 3 presents the first dynamic packet routing algorithm for a butterfly network with bounded buffers under constant injection rate. Our algorithm is stable for any interarrival distribution with expectation greater than some absolute constant. The expected routing time in an n-input butterfly is O(log n) and in the case of geometric interarrival times the expected time a packet spends in the input queue is also O(log n). Thus, the performance of the algorithm is within constant factors from optimal in all parameters. Our dynamic algorithm is based on the static routing results of Ranade [1987] and Maggs and Sitaraman [1992].

The above algorithm is not a "pure" queuing protocol (in such a protocol packets always move forward unless progress is impeded by an already-full queue) because as in the algorithms devised in Ranade [1987] and Maggs and Sitaraman [1992] it generates and uses extra messages and mechanisms to coordinate the routing. Maggs and Sitaraman [1992] studied the question of a "pure" queuing protocol routing with bounded buffers. They gave an algorithm that routes n packets on an n input butterfly with bounded buffers in O(log n) steps. Based on their technique, we develop in Section 4 a simple greedy algorithm for dynamic routing. It is stable for any interarrival distribution with expectation [Omega](log n), the routing time is O(log n), and in the case of a geometric interarrival distribution the expected wait in the queues is also O(log n).

In Section 5, we apply our approach to a dynamic version of the simple oblivious routing algorithm on the butterfly described in Upfal [1984] and Leighton [1992]. This algorithm routes n log n packets (all logarithms in this paper are base 2) on an n log n butterfly in expected O(log n) steps, and with high probability no buffer has more than O(log n) packets. Our dynamic version of this algorithm uses a butterfly with buffers of size O(log n) and is stable for any interarrival distribution with expectation greater than some absolute constant. The expected routing time is O(log n) and the expected time a packet waits in a queue in the case of geometric interarrival distribution is also O(log n). Note that for dynamic routing, which is an infinite process, it does not suffice to have a high probability bound on the size of the buffer memory needed at a given time: we must prove that the algorithm is stable for some fixed buffer size.

The result of Section 2 is couched in terms of a general network, but the above examples are all concerned with the butterfly network. In Section 6, we exhibit the generality of the result by applying it to an n x n mesh. Our dynamic algorithm is based on the Greedy Algorithm of Section 1.7 of Leighton [1992]. Our algorithm is stable for any interarrival distribution with expectation at least Cn for some fixed constant C. The expected time a packet spends in the network is O(n). In the case of Poisson arrival (geometric interarrival distribution) the expected time the packet spends in the queue is also O(n). This is optimal up to constant factors. Leighton [1990] studied this problem and obtained similar results provided that the buffers in the routing switches are unbounded. More precisely, Leighton's algorithm ensures that at any fixed time, with high probability, no routing queue has more than 4 packets. However, for any sufficiently long execution, the maximum size of any queue exceeds any given bound. Our results build on Leighton's analysis, by augmenting his algorithm with a simple flow control mechanism, which ensures that every routing queue is bounded at all times, and thus only finite buffers are needed.

In an attempt to avoid probabilistic assumptions on the input, Borodin et al. [2001] defined the adversarial input model. Instead of probabilistic assumptions, for any time interval there is an absolute bound on the number of generated packets that must traverse any particular edge. Surprisingly, our general technique can be applied here as well. In Section 7, we briefly sketch how the results of Sections 3-6 can be extended to this model.

These examples demonstrate several ways of applying our scheme. The analysis required is similar to the analysis used in the proof of the corresponding static case with several small modifications. Most notably, as often done in practice, we sometimes augment the original static algorithm with a simple "flow control" mechanism, such as acknowledgments. Our general theorem can be applied to other topologies and algorithms provided that an appropriate static case analysis can be constructed.

2. The Stability Criterion

Our model is as follows: We are given a routing algorithm A acting on a network [Gamma](n) with n inputs and n outputs. Each input receives new packets with an interarrival distribution F. We distinguish between usual and unusual distributions. We first describe the situation for usual distributions. By this we mean that the probability that the number of arrivals in any time period significantly exceeds its expectation falls off exponentially. A more precise definition is left until later. In the usual case, the packets are placed into an unbounded FIFO queue at the input node. Packets have an output destination chosen independently and uniformly at random. When a packet reaches the front of its queue, it is called active. At some point after becoming active, the packet is removed from its queue and eventually routed to its destination. For convenience, we assume that a packet chooses its random destination upon becoming active.

In an arbitrary input distribution, we modify our routing scheme as follows. We maintain at each input node v two queues, [Q.sub.1] and [Q.sub.2]. On arrival, packets are placed in [Q.sub.1]; the front packet in [Q.sub.1] leaves it to [Q.sub.2] according to a geometric service time at a rate greater than the arrival rate of F; then [Q.sub.2] feeds the network as above. The precise details are discussed in Theorem 2.1 below.

We are interested in determining under what conditions the queuing system is ergodic (or stable), that is, under which conditions the expected length of the input queues is bounded as t [right arrow] [infinity]. To this purpose we have to study the interdeparture time, which is the interval from when a packet becomes active until it leaves the queue, and the packet next in line (if any) becomes active. Besides stability, we are also interested in the expected time a packet spends in the queue, and the expected time it spends in the network.

Since the interarrival times are independent, if the interdeparture times are also independent, then each queue can simply be viewed as a G/G/1 system and the stability condition would trivially be that the interdeparture rate exceeds the interarrival rate. However, the usual situation is that there are complex interactions among packets during routing and thus the interdeparture times are highly dependent and hard to analyze.

The goal of this section is to define a set of relatively simple sufficient conditions such that if the routing algorithm satisfies them, then the system is stable up to a certain interarrival rate and we can bound the expected time a packet spends in the queue and in the network. This is captured in the following. We assume that the system is empty of packets at time t = 0. We use [H.sub.t] to denote the history of the process up to time t, that is, the outcome of the random choices made at times 1 through t.

Let [Mu] denote the expected interarrival time and then p = 1/[Mu] is the interarrival rate.

THEOREM 2.1. Assume that the randomized routing algorithm A acting on the network [Gamma](n) is characterized by four parameters a, b, m, and T, where a and b are positive constants, and m and T are positive integers that might depend on n and satisfy 1/[n.sup.a] [is less than] m/T [is less than] 1 and T [is less than] [n.sup.b]. Assume that the algorithm satisfies the following conditions:

(1) Every packet is delivered at most [n.sup.a] steps after becoming active.

(2) For every time [Tau] [is greater than or equal to] 0, there exists an event [E.sub.[Tau]] with the following properties:

(a) [??E.sub.[Tau]] implies that any packet that at time [Tau] was among the first m packets in its queue, is delivered before time [Tau] + T.

(b) For any fixed time [Tau],

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

(c) [E.sub.[Tau]] is determined by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Thus, for any k [is greater than or equal to] 1,

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

If there exists a positive constant [Epsilon] such that the interarrival distribution F has an interarrival rate smaller than (1 - [Epsilon])m|T, then

(1) The system is stable.

(2) The expected elapsed time between when a packet becomes active and it is delivered is at most 2T + O(1).

(3) The expected time a packet spends in the input queue is bounded by O(T) + f(T/m), where f is a function that depends only on F and not on the routing process. (For "usual" distributions such as geometric f(T/m) = 0 (T/m)).

PROOF. Assume first that the interarrival time is geometric, that is, at each step, each input receives a new packet with some fixed probability p [is less than] (1 - [Epsilon])m/T. (We will show later how to extend the proof to a general interarrival distribution).

Fix an input v and let Q(t) denote the length of the queue at node v at time t. Let

[Pi](t, L) = Pr(Q(t) [is greater than or equal to] L).

We show that the system is stable by proving a uniform bound, independent of t, on [Pi](t, L). Let

[Beta] = [Epsilon]/4 m/T and U = [(T/m).sup.3] [n.sup.a+b+1].

(Hence U [is greater than] m.) We will establish the bound using the following two inequalities:

--For L [is greater than or equal to] U

(2) [Pi](t, L) [is less than or equal to] [Pi](t - ?? L/2 ??, (1 + [Beta]) L) + [Delta]exp(-[Gamma]L).

--For m [is less than or equal to] L [is less than] U

(3) [Pi](t, L) [is less than or equal to] [Pi](?? [(t - 2U/[Epsilon]p).sup.+] ??) + 2U/[Epsilon]p [B.sub.E] + exp(-[Phi]pL).

--(Define as usual [x.sup.+] to be max{0, x}.)

where [Gamma] = [Omega]([(m/T).sup.3]/[n.sup.a+b]), [Delta] = O([n.sup.b]), and [Phi] is a positive constant. Since [Pi](t, L) = 0 for t [is less than] L, these inequalities imply that for L [is greater than or equal to] U,

[Pi](t, L) [is less than or equal to] [Delta]exp(-[Gamma]L) + [Delta]exp(-(1 + [Beta])[Gamma]L) + [Delta]exp(-[(1 + [Beta]).sup.2][Gamma]L) + ... [is less than or equal to] [Delta]exp(-[Gamma]L)(1 + exp(-[Beta][Gamma]L) + exp(-2[Beta][Gamma]L) + ...) = [Delta]exp(-[Gamma]L)/1 - exp(-[Beta][Gamma]L).

and that for m [is less than or equal to] L [is less than] U,

[Pi](t, L) [is less than or equal to] [Delta]exp(-[Gamma]U)/1 - exp(-[Beta][Gamma]U) + 2U/[Epsilon]p [B.sub.E] + exp(-[Phi]pL).

Combining the two bounds, we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Since this holds for any interarrival rate bounded by (1 - [Epsilon])m/T, by Little's Theorem the expected time a packet spends in the queue is O(T).

We now turn to proving the recurrence (2). Since the inequality is trivially true for t [is less than] L, assume that t [is greater than or equal to] L. Let [t.sub.0] = t - ?? L/2 ??. Let I denote the number of packets arriving at input v between [t.sub.0] and t, and let J denote the number of packets leaving the queue at v during this interval. Let [s.sub.i] denote the interdeparture time of the ith packet to become active at v after time [t.sub.0], that is, the interval from when this packet reaches the front of queue until it departs. (If there was an active packet at time [t.sub.0] then [s.sub.1] denotes how long it took that packet to depart.) Let

M = ?? (1 - [Beta] m/T L/2 ??.

We claim that if Q(t) [is greater than or equal to] L, then at least one of the following three events holds:

[F.sub.a] [equivalent] Q([t.sub.0]) [is greater than or equal to] (1 + [Beta])L. (Large initial queue.)

[F.sub.b] [equivalent] I [is greater than or equal to] (1 + [Beta])p L/2 - 1. (Excessive number of new arrivals.)

[F.sub.c] [equivalent] [s.sub.1] + [s.sub.2] + ... + [s.sub.M] [is greater than] L/2. (Slow processing.)

Indeed assume ?? [F.sub.a], ?? [F.sub.b], and ?? [F.sub.c] and consider two cases:

Case 1. Q([t.sub.0]) [is greater than] L/2. This means that at time [t.sub.0] the queue contained more than M packets, and ?? [F.sub.c] implies that M packets left the queue by time t = [t.sub.0] + ?? L/2 ??. Thus, J [is greater than or equal to] M, and

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Case 2. Q([t.sub.0]) [is less than or equal to] L/2. Then

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Thus, in order to prove the recurrence (2) it suffices to show that for L [is greater than or equal to] U

(4) Pr([F.sub.b]) [is less than or equal to] exp(-[Gamma]L)

and that

(5) Pr([F.sub.c] [conjunction] (Q(t) [is greater than or equal to] L)) [is less than or equal to] O([n.sub.b])exp(-[Gamma]L)

Equation (4) follows immediately from standard bounds on the binomial distribution

Pr([F.sub.b]) = Pr(I [is greater than or equal to] (1 +[Beta])p L/2 -1) [is less than or equal to] exp(-[[Beta].sup.2]pL/7).

To prove Eq. (5) note that if at any time during [[t.sub.0], t] the queue at v contains less than m packets, then Q(t) [is greater than or equal to] L only if I [is greater than or equal to] L - m and the probability of the latter can be bounded as above. So let's assume that for all [Tau] [element of] [[t.sub.0], t], we have Q([Tau]) [is greater than or equal to] m.

Let now z denote the number of occurrences of [E.sub.[Tau]] during [[t.sub.0], t]. By the hypothesis of the theorem [s.sub.1] + [s.sub.2] + ... + [s.sub.M] [is less than] [Mn.sup.a]. We partition the interval [[t.sub.0], [t.sub.0] + [Mn.sup.a]] into [2n.sup.b] sets, [T.sub.1], [T.sub.2], ..., T [2n.sup.b] where [T.sub.i] = {[t.sub.0] + i - 1 + [2kn.sup.b]: 0 [is less than or equal to] k [is less than or equal to] ?? ([Mn.sup.a] - i + 1)/([2n.sup.b]) ??}. Let [z.sub.i] denote the number of occurrences of [E.sub.[Tau]] for [Tau] [element of [T.sub.i]. Note that if packet i becomes active at time [Tau] and if ?? [E.sub.[Tau]] then we have [s.sub.i] + [s.sub.i+1] + ... + [s.sub.i+m] [is less than or equal to] T; if [E.sub.[Tau]], we can use the bound [s.sub.i] [is less than or equal to] [n.sup.a]. Thus, we have the following series of implications:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

It follows from (1) that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

So

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This completes the proof of recurrence (2) and we turn to recurrence (3). If t [is less than] L/(2p), then for some constant [Phi],

Pr(Q(t) [is greater than or equal to] L) [is less than or equal to] Pr(L packets arrive in [0, L/(2p)]) [is less than or equal to] exp(-[Phi]L).

Hence, assume that t [is greater than] L/(2p) and let [t.sub.0] = ?? [(t - 2U/([Epsilon]p)).sup.+] ??. Define the following three events:

[F.sub.a] [equivalent] Q([t.sub.0]) [is greater than or equal to] U.

[F.sub.b] [equivalent] The event [E.sub.[Tau]] does not occur for any [Tau] [element of] [[t.sub.0], t].

[F.sub.c] [equivalent] The queue at v receives at most (1 - ([Epsilon]/2))m/T [Theta] new packets in any interval

[t - [Theta], t] with [Theta] [is greater than or equal to] L/2p.

We bound Pr(Q(t) [is greater than or equal to] L) via the inequality

(6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

By definition

Pr([F.sub.a]) = [Pi]([t.sub.0], U).

Clearly

Pr(?? [F.sub.b] [is less than or equal to] (2U/[Epsilon]p + 1) [B.sub.E],

and since (1 - ([Epsilon]/2))(m/T) [Theta] [is greater than or equal to] (1 + ([Epsilon]/2))p [Theta]

(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

for a constant [Phi].

Now assume ?? [F.sub.a], [F.sub.b], and [F.sub.c] and notice that if [F.sub.b] holds, then as long as the queue is not empty it loses at least m packets in any interval of T steps. If Q(t) [is greater than or equal to] L, we claim that these assumptions imply that there is a step in the interval [[t.sub.0], t] in which the queue is empty; otherwise

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

which is less than m since if [t.sub.0] = 0 then Q([t.sub.0]) = 0, and otherwise t - [t.sub.0] = ?? 2U/([Epsilon]p) ?? and Q([t.sub.0]) [is less than or equal to] U - 1.

Thus, under the assumptions ?? [F.sub.a], [F.sub.b], and [F.sub.c], if there are L packets in the queue at time t, then there is an interval [t - [Theta]', t], such that

(i) the queue was empty at time t - [Theta]' - 1;

(ii) the queue was not empty in any step in the interval [t - [Theta]', t]

(iii) at least L + m ?? [Theta]'/T ?? [is greater than] L + (m[Theta]'/T) - m new packets arrived at the queue in that interval.

But if L [is greater than or equal to] m and [Theta]' [is greater than] L/(2p) then (iii) contradicts [F.sub.c]. So we only have to consider the probability that (iii) holds for an interval with L [is less than] [Theta]' [is less than] L/(2p). This is bounded by

(8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This completes the proof of Eq. (3).

Let us now see how to go from a geometric interarrival distribution to something more general. We observe that in the proof above the interarrival distribution is only required to satisfy (4), (7), and (8). Suppose that the interarrival time is a random variable X with distribution F. Let p = 1/E(X) [is less than] 1. We say that F is usual if there exist constants [A.sub.0] and [A.sub.1] such that in any interval of length t, the number N of arrivals satisfies

Pr(N [is greater than or equal to] (1 + [Epsilon])pt) [is less than or equal to] [A.sub.0]exp([-A.sub.1][[Epsilon].sup.2]pt)

for any 0 [is less than or equal to] [Epsilon] [is less than or equal to] 1. Clearly, if F is usual, then our proof will go essentially unchanged provided that p [is less than] (1 - [Epsilon])m/T.

Assume finally that the arrival of packets to the queue is governed by some arbitrary interarrival distribution F. Let [Q.sub.1] and [Q.sub.2] be the two queues in front of a generic node v, as described at the beginning of this section. We move packets from the front of [Q.sub.1] to the end of [Q.sub.2] with probability p = (1 - [Epsilon])m/T. Our analysis has shown that [Q.sub.2] is stable, and that the expected wait in [Q.sub.2] is O(T). The queue [Q.sub.1] is a G/M/1 queue. Thus, if the expected interarrival time to [Q.sub.1] is smaller than p, then the queue is stable and the expected waiting time in [Q.sub.1] is determined (see Kleinrock [1975] for details) by the distribution F, as follows: Let x be the nontrivial (i.e., x [is not equal to] 4: 1) root of the equation (the Laplace transform)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The expected wait in the queue is then x/(p(1 - x)).

We now bound the expected time that a packet takes to reach its destination once it becomes active. Consider a long interval of time [0, L]. Suppose that [x.sub.a] packets arrive during this interval and [E.sub.[Tau]] occurs for [x.sub.b] values of [Tau]. Given this, the average time that a packet takes to reach its destination once it becomes active is at most

(x.sub.a] - [x.sub.b]) T + [x.sub.b][n.sup.a]/[x.sub.a] [is less than or equal to] T + [x.sub.b][n.sup.a]/[x.sub.a].

Now

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Consequently, the expected time that a packet takes to reach its destination once it becomes active is at most 2T as claimed. []

In the next four sections, we deal with applications of Theorem 2.1 to the cases where the underlying topology is (i) a butterfly with log n levels (rows) of n nodes (switches) or (ii) an n x n mesh. In both cases, there are buffers on edges and unbounded queues at input vertices. We show stability for several protocols under suitable assumptions about input rate and internal buffer size. We will explicitly consider geometric interarrival distributions. The general case is implicitly dealt with as in the proof of the main theorem.

3. Dynamic Routing on a Butterfly with Constant Injection Rate and Bounded Buffers

For this section, we assume that the buffer size q is a sufficiently large constant. We first fix m = [Theta](log n) and we will subsequently describe a protocol and define E, T, a, and b to satisfy the conditions of Theorem 2.1.

Our approach is based on the second algorithm of Maggs and Sitaraman [1992] and in places we follow their description very closely. This algorithm uses tokens whose main role is to define a wave number for each packet. We will assume that tokens occupy the same amount of space as a packet. Imagine that behind each input node queue there is an infinite sequence of tokens, packets and blanks. The odd positions are always taken by tokens and the even positions contain packets or blanks, where the packets occur randomly with probability p. The tokens are labeled 1, 2,.... The label of a token is referred to as its wave number. As opposed to Maggs and Sitaraman [1992], we actually use these labels within the algorithm, not only in its analysis.

At each time step, we examine the front of the sequence. If it is blank, then we simply delete this blank and go to the next time step. If there is a token or packet, then we delete it from the sequence and place it in the back of the input queue. The front element (which could be a packet or a token) of the queue tries to enter the network only if it is eligible (we define this subsequently). An eligible packet enters the system if the buffer on the edge that it intends to use is, or becomes not full during the current time step. Upon entrance into the network a token splits into two tokens, one for each outgoing edge. Thus, both buffers need to have space before an eligible token can enter.

The wave number w([Pi]) of packet [Pi] is the wave number of the token that immediately precedes it in entering the network. The rank of a packet is a pair (w, c) where w is the wave number and c is the column number of its input. The rank of a token is given by its wave number. Ranks are ordered lexicographically.

An important invariant of the algorithm is that packets go through a switch in increasing order of rank.

A switch labeled (l, c = [c.sub.0], [c.sub.1], ..., [c.sub.L-1]) where l is the level and L = log n, has a 0-edge entering it from switch (l - 1, c - [c.sub.l][2.sub.l-1]) and a 1-edge entering it from switch (l - 1, c - ([c.sub.l] - 1)[2.sup.l]). The buffer of the i-edge is called the i-buffer.

The behavior of each switch is governed by a simple set of rules. By forwarding a packet or token, we mean sending it to the appropriate queue in the next level. If that queue is full, the switch tries again in consecutive time steps until it succeeds. A switch can either be in 0-mode or 1-mode and is initialized to be in 0-mode. In i-mode, a switch forwards packets in the i-buffer until a token is at the head of the i-buffer. At that time, if i = 0, then the switch simply changes to 1-mode; otherwise, if i = 1, then there will be tokens at the front of both queues and the switch waits until it can forward both tokens, each to one of its outgoing edges. (These tokens have the same wave number). It then switches back to 0-mode.

It will be important in the subsequent analysis to ensure that if [Pi] and [Pi]' are packets or tokens residing simultaneously in the network then ||w(Pi) - w(Pi')|| [is less than or equal to] A log n for some constant A [is greater than] 0. This is achieved as follows: At every time step, every output node generates two chips. The 2n chips generated at time t will be referred to as generation t. Each generation travels back through the network one level at a time. The chips make their journey so that each chip occupies a different edge at each step. By the time a chip of generation t has reached a switch s, it has iteratively computed the lowest wave number of any packet/token which left the network at time t from an output node reachable from s. Thus, when generation t reaches the input nodes, each input node knows the lowest wave number w*(t) of any packet/token that left the network at time t. This happens at time t + log n. w* is initialized at zero and if no packet leaves the network at time t then w*(t) = w*(t - 1). Note that if [Pi] is a packet/token which is in the network at time t or later then w([Pi]) [is greater than or equal to] w*(t) since packets go through network switches in increasing order of rank.

At time [Tau], a packet/token [Pi] will be eligible to enter the network, only if

w([Pi]) [is less than or equal to w*([Tau] - log n) + A log n.

It follows that if [Pi] is any packet/token already in the network at time [Tau], or eligible at time [Tau], then

(9) w*([Tau] - log n) [is less than or equal to] w([Pi]) [is less than or equal to] w*([Tau] - log n) + A log n.

We focus now on one of the first m packets of a queue at time [Tau]. Denote it by [Pi]. Assume for the time being that [Pi] is eligible at time [Tau]. Maggs and Sitaraman define a delay sequence of packets and tokens in a familiar way--an (r, f) delay sequence consists of (i) a path P from an output node to an input node, (ii) a sequence [s.sub.1], [s.sub.2], ..., [s.sub.r] of not necessarily distinct buffers, (iii) a sequence [[Pi].sub.1], [[Pi].sub.2], ..., [[Pi].sub.r] of distinct packets and tokens and (iv) a nonincreasing sequence [w.sub.1], [w.sub.2], ..., [w.sub.r] of wave numbers. The wave numbers of the tokens are shown to decrease strictly as one moves along the delay path, in fact they decrease by one from one token to the next. The length [Lambda] = [Lambda](P) is equal to 2f - log n where f is the number of forward edges of the path. It is a little confusing at first, but here forward edges go from level x to level x + 1, for some x, and are traced backwards by P, assuming it is directed from output to input. It is assumed that [[Pi].sub.i] goes through buffer [s.sub.i] and has wave number [w.sub.i]. Maggs and Sitaraman [1992] show (Lemma 4.1) that if packet [Pi] takes log n + d time to exit from the network, then there is a (d + (q - 2)f, f) delay sequence, with [[Pi].sub.1] = [Pi], for some f [is greater than or equal to] 0.

We have to argue that the delay sequence does not contain many tokens. Let k denote the number of tokens in our delay sequence. We see that

k [is less than or equal to] A log n,

since the wave numbers of tokens decrease by one along the delay path, Eq. (9) holds, and any packet/token on the delay path must be in the network at some time after [Tau], and thus has wave number at least w*([Tau] - log n).

If we assume (and we will subsequently remove this assumption)

A: The destinations of packets under consideration are random,

then the expected number of delay sequences for [Pi] can be bounded as follows.

Choose [Lambda] = 2f - log n for the length of a path P. Let P denote the set of possible delay paths and note that |P| [is less than or equal to] [4.sup.[Lambda]]. Choose a delay d [is greater than or equal to] K log n where K is a large constant. (We assume q [much greater than] K [much greater than] A.) Let r = d + (q - 2)f. We have to count the number of (r, f) delay sequences with delay path P. Choose positive integers [a.sub.1], [a.sub.2], ..., [a.sub.k] so that [a.sub.1] + [a.sub.2] + ... + [a.sub.k] [is less than or equal to] r and along our path there are tokens at positions [a.sub.1], [a.sub.1] + [a.sub.2], ..., [a.sub.1] + [a.sub.2] + ... + [a.sub.k]. There are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] choices for the [a.sub.i]'s.

Let J = [r]\{[a.sub.1], [a.sub.1] + [a.sub.2], ..., [a.sub.1] + [a.sub.2] + ... + [a.sub.k]}. Now choose an edge buffer [s.sub.j] for each j [element of] J. Observe that having chosen P our choices are now restricted. However, for each edge in P, we can choose the multiplicity of its buffer in the delay sequence. This can be done in at most ([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) ways.

Let [d.sub.j] be the depth of the edge with buffer [s.sub.j]. There are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] inputs that could send a packet along this edge. The probability that there is such a packet with a particular wave number (fixed by the preceding token) is at most [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Thus, the expected number of delay sequences is at most

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(for sufficiently large q, K)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

which can be made O([n.sup.-B]) for any positive constant B by choosing K sufficiently large.

Let us now deal with Assumption A. One cannot assert that the destinations of packets in the network at time [Tau] are random. There is a tendency for "bad" configurations to "linger." However, one can assert that the destinations of packets with wave numbers in [w, w + k - 1] are random for any fixed w. What we have actually proved is that there is unlikely to be a delay sequence made up from random packets with wave numbers in [w, w + k - 1] where w = [w.sub.r]. We know however that

w*([Tau]) [is less than or equal to] [w.sub.[Tau]] [is less than or equal to] w*([Tau]) + A log n,

and thus we can assume conservatively that if [[Tau].sub.0] = [Tau] - 2AnS log n and [w.sub.0] = w*([[Tau].sub.0]), then

[w.sub.0] + 2A log n [is less than or equal to] [w.sub.[Tau]] [is less than or equal to] [w.sub.0] + A(2nS + 1)log n.

Here S is a polynomial upper bound (proved below in Lemma 3.1) on the time taken for an active packet or token to get through the network. We use the facts:

(a) w*(t - 1) [is less than or equal to] w*(t) [is less than or equal to] w*(t - 1) + 1.

(b) w*(t + nS) [is greater than or equal to] w*(t) + 1.

Of course, [w.sub.0] itself is a random variable. The conditional distribution of the destinations in wave w for w [is greater than] [w.sub.0] + A log n are random because no packet in this wave could have been in the network at time [[Tau].sub.0]. Let us therefore define [E.sub.[Tau]] [equivalent] There exists w [element of] [[w.sub.0] + 2A log n, [w.sub.0] + A(2nS + 1) log n + (K + 1) log n] and a delay sequence of length exceeding K log n made from packets in waves [w, w + A log n].

The probability of [E.sub.[Tau]] is O(Sn log n/[n.sup.B]) and can be made suitably small. If [E.sub.[Tau]] does not occur, then all of the eligible packets among the first m in each queue at time [Tau] will be serviced in time (K + 1) log n.

We have therefore dealt with eligible packets at time [Tau]. Some of the first m packets in the queue can be ineligible. If [E.sub.[Tau]] does not occur, then they will be serviced in a further (K + 1) log n time because they will be eligible at time [Tau] + (K + 1) log n (all eligible packets in the network at time [Tau] will be out) and the extra (K + 1) log n in the definition of [E.sub.[Tau], means there are no delay paths for them either. We thus take T = 2(K + 1) log n.

We now have to give an estimate for S.

LEMMA 3.1. Under this protocol, no packet takes more than S [is less than or equal to] 4An [log.sup.2] n steps to complete its service once it has become active.

PROOF. If we trace an input-output path, then the tokens we meet have wave numbers that decrease by one each time. This is a basic property of the scheduling protocol. At each time step, at least one packet or token of lowest wave number moves. Thus, if X denotes the set of lowest wave tokens or packets at time t, then X will be through the network at time t + 2n log n. The network can have no more than A log n distinct eligible wave numbers at any time and so we get an upper bound of 2An [log.sup.2] n for eligible packets. An active but ineligible packet might then have to wait this long to become eligible. []

From our definition of [E.sub.[Tau], we see that it depends only on the destinations of packets that have wave numbers in the range [[w.sub.0] + 2A log n, [w.sub.0] + A(2nS + 1) log n + (K + 1) log n]. Every packet that has already made a choice of its destination by time [t.sub.0] - S is out of the system by time [t.sub.0] and thus has wave number at most [w.sub.0] + A log n. On the other hand, packets that enter the network after time [t.sub.0] + nS(A(2nS + 1) + K + 1) log n has wave number [is greater than or equal to] [w.sub.0] + (A(2nS + 1) + K + 1) log n. Hence, [E.sub.[Tau]] depends only on the destination of packets enter the network at times in the range [[Tau] - 2AnS log n - S, [Tau] + (A(2nS - 1) + K + 1)nS log n]. We can thus take b = 5 in the main theorem. To define a, suppose a packet [Pi] becomes active at time [Tau]. Then all packets currently in the network will have left it by the time [Tau] + S. If [Pi] has not left the queue by this time, then [Pi] will certainly be eligible and can now enter. Thus, we can take a = 2. Thus, all the conditions of Theorem 2.1 are satisfied, and we have proved:

THEOREM 3.1. There is a constant C, such that the above algorithm is stable for any interarrival distribution with expectation at least C. The expected time a packet spends in the network is O(log n), and in the case of geometric interarrival time the expected time a packet spends in the system is O(log n).

After running the algorithm for a long time, wave numbers could become very large. To avoid the storage of very large numbers, wave numbers can be stored mod 2A log n and eligibility defined to take account of this in the obvious way.

4. Greedy Dynamic Routing on a Butterfly with Bounded Buffers and Injection Rate O(1/log n)

We present in this section a simple greedy algorithm ("pure queuing protocol") that can sustain an interarrival distribution with expectation [Omega](log n) using buffers of size q = O(1) in the routing switches. The algorithm and analysis is based on the static result of Maggs and Sitaraman [1992].

We first describe the behavior of switches in the network:

--Packets are selected from buffers in FIFO order.

--A switch V alternates between the two switches W, W' feeding it. If at time t - 1 switch V received a packet from switch W, at time t it first checks switch W'. If the buffer of W' is nonempty, then W' send a packet to V; otherwise, V returns to switch W.

The dynamic algorithm uses a token-based flow control mechanism. Each input has m = O(1) tokens. A token can be in one of three modes: enabled, used or suspended. Initially, all tokens are in enabled mode. To inject a packet into the network, the input needs an enabled token. A packet is sent with a token and the mode of the token switches to used mode. When a packet is delivered the token (acknowledgment) is returned to the input node. Let [t.sub.s] be the last time a given token was sent with a packet, let [t.sub.r] be the last time it returns to the input node. If [t.sub.r] - [t.sub.s] [is less than or equal to] K log n, then the token becomes enabled again at time [t.sub.s] + K log n. If [t.sub.r] - [t.sub.s] [is less than or equal to] K log n, then the token mode is switched to suspended mode for 3mnS = O([n.sup.3]) steps (S is defined in Lemma 4.1). Then it is switched back to enabled mode (K is a constant fixed in the proof). This flow mechanism guarantees that an input cannot inject more than m packets in each interval of K log n steps, and that the input does not inject new packets when the network is congested.

We use a separate butterfly network [Gamma]'(n) to route tokens back to their sources. The output nodes of [Gamma](n) are identified with the inputs for [Gamma]'(n). A token from input j of [Gamma](n) which reaches output k of [Gamma](n) must travel from input k of [Gamma]'(n) to output j of [Gamma]'(n).

LEMMA 4.1. Under this protocol no packet takes more than S = 4[qn.sup.2] + 2K log n steps to complete its service once it has become active.

PROOF. Our network [Gamma](n) [union] [Gamma]'(n) has 2 log n levels. We prove by induction on i that if a buffer B at level i is nonempty (level 0 is the output level of [Gamma]'(n), which is the input level of [Gamma](n)), then after at most [2.sup.i] steps the front of the queue moves onto the next level. This is clear for i = 0 and our protocol ensures that after at most 2 x [2.sup.i-1] steps the switch will be able to move the front of B. Let [Mu] be some packet waiting in B. FIFO selection then ensures that a packet spends at most [q2.sup.i] time steps at level i. Thus, a token takes at most [2qn.sup.2] time steps to reach its output once it obtains a packet. By the same argument, a packet has to wait at most [2qn.sup.2] + K log n time steps to obtain a token, once it has become active. []

The proof in Maggs and Sitaraman [1992] is based on a delay tree argument. To apply this proof technique here, we analyze the delays in [Gamma](n) and [Gamma]'(n) separately. Note that since we assume that a processor has sufficient buffer space to receive all packets sent to it, delays in one network do not propagate to the other network. In our setting, the delay tree in [Gamma](n) is defined as follows: Fix a packet [Pi], which is one of the first m packets of an input queue at time [Tau]. Let M([Pi]) be the set of packets that were in the network during any step t in which [Pi] was in the network. A node v of [Gamma](n) is full with respect to [Pi] if at least q packets in M([Pi]) traversed v during this time. The spine SP([Pi]) of [Pi]'s delay tree DT = DT([Pi]) is the path in [Gamma](n) from the input node, j say, to [Gamma]'s output node back to the output node of [Gamma]'(n) labeled j. Let F([Pi]) be the set of full nodes with respect to [Pi] in the network [Gamma](n). DT([Pi]) consists of SP([Pi]) plus any node reachable from it by a path consisting entirely of nodes in F([Pi]). (Paths are directed from nodes of the spine down to lower numbered levels.) The number of packets on the delay tree DT([Pi]), denote by [h.sub.1]([Pi]), is the sum over all the nodes of the tree, of the number of packets in M([Pi]) visiting each of the nodes. (Note that packets can be counted several times in this count). A similar construction gives DT'([Pi]), a delay tree of [Pi] in [Gamma]'(n). Let [h.sub.2]([Pi]), denote the number of packets on DT', and let h([Pi]) = [h.sub.1]([Pi]) + [h.sub.2]([Pi]).

By Theorem 2.1 in Maggs and Sitaraman [1992], the time a packet [Pi] spends in the network is bounded by log n + [h.sub.1]([Pi]), and the time it takes the corresponding token to return to the source is log n + [h.sub.2]([Pi]).

Let

T = 2K log n,

and define the event:

[E.sub.[Tau]]. There exist delay trees DT and DT' such that considering all the packets and tokens that are in the network at time [Tau], and the packets and tokens entering the network during the interval [[Tau], [Tau] + T],

h([Pi]) [is greater than or equal to] (K - 1) log n.

(For each pair of trees D T and D T', we imagine a packet [Pi] that followed SP(DT), and a token that followed SP(DT'), both were in the network during the interval [[Tau], [Tau] + T]. We compute the maximum of h([Pi]) over all DT and DT'.)

Clearly, ?? [E.sub.[Tau]] implies that any packet that is among the first m packets in its queue at time [Tau] will be delivered, and its token returned to the source, before time [Tau] + T, since a packet waits no more than K log n until it has an enabled token, and then its routing takes no more than K log n steps.

LEMMA 4.2. For any [Alpha] [is greater than] 0, there exists K = K([Alpha]) such that if [Tau] - 2K log n [is less than or equal to] t [is less than or equal to] [Tau] - K log n then

Pr([E.sub.[Tau]]|[H.sub.t]) [is less than or equal to] [n.sup.-[Alpha]],

provided [H.sub.t] [subset or equal to] [E.sub.t].

PROOF. Under the conditions of the lemma, any packet and token in the network at time t will have left the network by time [Tau]. The destinations of packets generated in the interval [t + 1, [Tau] + T] are random and independent of [H.sub.t]. Once a token changes to a used state, it requires at least K log n steps in order to become enabled again. Thus, each input node can inject at most 4m new packets between times t + 1 and [Tau] + T, each having a new random destination.

Theorem 2.5 of Maggs and Sitaraman [1992] shows that if each input node injects 1 packet to a random destination then for some [K.sub.0] = [K.sub.0]([Alpha]) and q sufficiently large

Pr([exists][Pi]:[h.sub.1]([Pi]) [is greater than or equal to] [K.sub.0] log n) = O([n.sup.-[Alpha]]).

So if each input injects at most 4m packets with random destinations into the network, then

Pr([exists][Pi]:[h.sub.1]([Pi]) [is greater than or equal to] 4m[K.sub.0] log n) = O([n.sup.-[Alpha]]).

A symmetric argument proves the same bound for [h.sub.2](H). Choosing K = 4m[K.sub.0] + 1 yields the lemma. []

Let [[Tau].sub.0] = [Tau] - 2mnS and

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

COROLLARY 4.1

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

LEMMA 4.3

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

PROOF. The network at time [Tau] contains at most m packets per input node. We must find a way to be able to treat the destinations of these packets as random given [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. For input i and time t, let [D.sub.i,t] be the set of m destinations associated with the tokens for i. The occurrence of events [E.sub.t], t [element of] [[Tau].sub.0], [Tau]] can be determined from the sets [D.sub.i,t], t [element of] [[Tau].sub.0], [Tau] + T]. Imagine that initially, that is, at time [[Tau].sub.0] each token has a destination and it gets a new one whenever it returns from a trip through the network. We will view the sequence [D.sub.i,t], t = [[Tau].sub.0], ..., [Tau] + T as being produced as follows: We start with an arbitrary set of destinations [D.sub.i,[Tau]]. Each token j has due time [d.sub.j] for its next change of destination. At time [d.sub.j] the destination of token j is replaced with a new random destination and [d.sub.j] is replaced by [d.sub.j] + K log n unless either (i)j is late and becomes suspended at some time in the interval [[d.sub.j], [d.sub.j] + S] or (ii) there is no active packet waiting for j at time [d.sub.j] + K log n, in which case j gets a new due date determined by the arrival distribution and not by the workings of the network, which we treat as an adversary. This adversary is limited to arbitrarily choosing tokens to make late. We can assume that the adversary makes its decision on token j at time [d.sub.j].

Suppose that there is a time t [element of] [[Tau].sub.0], [Tau] - 2K log n] such that the adversary decides not to make any token late during the interval [t, t + S]. Then at the end of this time interval the destinations of unsuspended tokens are completely random and so the (conditional) probability Pr([E.sub.t+s]) [is less than or equal to] [n.sub.-[Alpha]]. On the other hand, if the adversary tries to delay at least one token in every such interval then there will be no tokens in the network at time [Tau]' = [Tau].sub.0] + mnS and [E.sub.[Tau]'] will occur, unless there are fewer than m new arrivals at some input in the period [[Tau], [Tau] + mnS]. The probability of the latter is negligible and the lemma follows. []

Combining Corollary 4.1 and Lemma 4.3, we obtain

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Taking T = 2K log n, m = O(1), a = b = 3, [Alpha] = 16 and applying Theorem 2.1 we obtain

THEOREM 4.1. There is a constant C, such that the above algorithm is stable for any injection rate with expected interarrival time greater than C log n. The expected time a packet spends in the network is O(log n). In the case of geometric interarrival time the expected time a packet spends in the system is O(log n).

5. Greedy Dynamic Routing on a Butterfly with Buffers of Size O(log n) and Constant Injection Rate

The algorithm in the previous section sustains an injection rate that is only up to O(1/log n) of the network capacity. We now present a greedy algorithm that is stable for any interarrival distribution with expectation bounded by some constant C; thus, a constant fraction of the network capacity. This algorithm, however, requires buffers of size q = O(log n).

The algorithm and analysis is based on the static result in Section 3.4.4 in Leighton [1992]. When a packet [Pi] is injected into the network it receives a random priority number r([Pi]) chosen uniformly at random from the interval [1, ..., 8eK log n] (K is a constant fixed in the proof). Packets are generally selected from the buffers according to their random priority numbers. This does not guarantee that a packet will get through the network in a reasonable time, since it is conceivable that under this protocol a packet could stay in the network for a very long time. Consequently, once every 10K log n steps we change the selection rule for one step so that packets are selected according to age and FIFO order. We call this a special step.

The algorithm uses the same token-based flow control mechanism as the one described in the previous section, only the values of K and S will change. It also uses a second network [Gamma]'(n) to route the tokens back to the inputs. The tokens inherit the priorities of their packets, for the return phase, as well as their age status.

This time m is of order log n.

LEMMA 5.1. Under this protocol, no packet takes more than S = 40[Kqn.sup.2] log n + K log n steps to complete its service once it has become active.

PROOF. An argument similar to that given in the proof of Lemma 4.1 ensures that a packet's token arrives back at its input within 4[qn.sup.2] special steps of getting a packet. []

Let

T = 2K log n

and define the event:

[E.sub.[Tau]]. There is a delay sequence (in [Gamma](n) [union] [Gamma]'(n)) of length K log n at some time in the interval [[Tau], [Tau] + T].

The definition of a delay sequence is similar to that given in Section 3. A full definition can be found for example on p. 550 of Leighton [1992]. It is basically a path P from an input node to an output node (back to an input node) plus a sequence of nodes [v.sub.1], [v.sub.2], ..., [v.sub.p] of P. The nodes are in order on P but there can be repetition in the sequence. There is a sequence of packets [[Pi].sub.1], [[Pi].sub.2], ..., [[Pi].sub.p] where [[Pi].sub.i] is required to go through node [v.sub.i]. Finally, there is the condition that r([[Pi].sub.i]) [is less than or equal to] r([[Pi].sub.i+1]) for 1 [is less than or equal to] i [is less than] p. Because of our special steps we have to relax this last condition for ?? p/(10K log n) ?? values of i.

Assume first that the buffers are unbounded. Then (see, e.g., the proof of Theorem 3.26 in Leighton [1992]) the event ?? [E.sub.[Tau]] implies that a packet that received an enabled token in the interval [[Tau], [Tau] + K log n] is delivered within K log n steps, that is, before time [Tau] + T. In these circumstances, each token becomes enabled at least once in the interval [[Tau], [Tau] + K log n], and so the first m packets in each queue at time [Tau] are delivered by time [Tau] + T. However, if no packet was delayed more than K log n steps, then no buffer had more than K log n packets at any step in that interval and we get the same performance as if each buffer had size q = K log n. Thus, with this queue size, ?? [E.sub.[Tau]] implies that the first m packets in each queue at time [Tau] are delivered by the time [Tau] + T.

Using the almost identical argument to that given in the previous section, we can prove that for any [Beta] [is greater than] 0 we can choose K = K([Beta]) such that

Pr([E.sub.[Tau]]|[H.sub.[Tau]]-2mnS]) [is less than or equal to] [n.sup.-[Beta]].

Taking T = 2K log n, m = O(log n), a = b = 3, [Beta] = 13 and applying Theorem 2.1 we obtain

THEOREM 5.1. There is a constant C, such that the above algorithm is stable for any injection rate with expected interarrival time greater than C. The expected time a packet spends in the network is O(log n). In the case of geometric interarrival time the expected time a packet spends in the system is O(log n).

6. Dynamic Routing on a Two-Dimensional Mesh with Bounded Buffers under Injection Rate O(1/n)

Our dynamic algorithm is based on the Greedy Algorithm of Section 1.7 of Leighton [1992].

A packet takes the shortest one-bend route from origin to destination, first (left or right) on its origin row to its destination column, then up or down on that column to the packet's destination.

We assume that a switch can receive up to four packets per step, one from each incoming edge, and send four packets per step, one through each outgoing edge. A switch maintains a buffer for each outgoing edge. When there is a space in a buffer the switch receives packets to that buffer according to the following rule: Except for special steps, packets that have to travel farthest have the highest priority. There will be a special step every 10Kn steps, where K is a sufficiently large constant (K [is greater than or equal to] [2e.sup.2] will do). In such a step, priority is given to age, oldest first. Packets of the same age are dealt with in lexicographic order of pair (origin, destination).

The algorithm uses a token-based admission control mechanism similar to that in the previous two sections, with one packet per input, that is, m = 1. Let [t.sub.s] be the last time a given token was sent with a packet, and let [t.sub.r] be the last time it returns to its input node. If [t.sub.r] - [t.sub.s] [is less than or equal to] Kn, then the token becomes enabled again at time [t.sup.s] + Kn + Z, where Z is a random number chosen uniformly from [0, Kn]. If [t.sub.r] - [t.sub.s] [is greater than] Kn, then the token mode is switched to suspended mode until time [t.sup.r] + [3n.sup.2]S + Z steps, then it is switched again to enabled mode, where S is defined in Lemma 6.1 below and Z is chosen as above.

We use a separate network to route tokens back to their sources and the routing mirrors that of the main network.

This flow mechanism guarantees that an input cannot inject more than one packet within each interval of Kn steps. Furthermore, the probability that a token becomes enabled at any fixed time is at most 1/(Kn).

LEMMA 6.1. Under this protocol no packet takes more than S = [100Kn.sup.6] steps to reach its destination once it has become active.

PROOF. Consider a packet P. Let [P.sub.0] be its predecessor in the queue. [P.sub.0] does not leave the queue until it has an enabled token. At that time, there are no more than [n.sup.2] other packets in the network. Consider the progress of the highest priority old packet [Pi] in the network. If [Pi] is moving along a column then it moves at every special time step. If it is moving along a row, then it could fail to move because further along that row there is contention for a column buffer at that special time step. [Pi] waits at most Kn + [n.sup.2] special steps before making another move. This is because the packet waiting to move along the column in question will be the oldest packet trying to get into the column edge buffer. Thus, after Kn + [Kn.sup.2] + [n.sup.3] special steps [Pi] will have reached its destination column and will reach its final destination within a further n special steps. So after at most [Kn.sup.3] + [Kn.sup.4] + [n.sup.5] special steps, [P.sub.0] will be the highest priority packet in the network and will be delivered within a further Kn + [Kn.sup.2] + [n.sup.3] special steps. Thus, [P.sub.0] gets to its destination at most Kn + [Kn.sub.2] + (K + 1)[n.sup.3] + [Kn.sup.4] + [n.sup.5] [is less than or equal to] [2n.sup.5] special steps after leaving the queue. The used token comes back after at most another [2n.sup.5] special steps and after at most 2Kn steps is reactivated. Finally, after at most another [2n.sup.5] special steps the packet P is delivered. The sum of these delays is less than [100Kn.sup.6].

Next, we turn to the definition of the event [E.sub.[Tau]]. Our analysis is based on the technique in Leighton [1990] as described in Leighton [1992, Sect. 1.7.2]. As in Leighton [1990], we relate the execution of the algorithm to an artificial execution on a wide-channel model in which an arbitrary number of packets can traverse an edge at any step, and no packet is ever delayed. We assume that the execution on the wide-channel starts at time [Tau].

Let [Alpha] and q be constants to be defined later, let

[d.sub.0] = ([Alpha] + 3)log n + log(4K) + 1.

This serves as a suitable high-probability upper bound on the delay of any given packet. We now define the events [E.sub.[Tau]] as the union of the following events:

(1) There is a row edge e (in the network that routes the packets), a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [d.sub.0]] [subset or equal to] [[Tau], [Tau] + Kn], such that at least t + [d.sub.0] - 1 packets traverse edge e in that interval in the wide-channel model.

(2) There is a column edge e (in the network that routes the packets), a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [2d.sub.0] [subset or equal to] [[Tau], [Tau] + Kn], such that at least t + [d.sub.0] - 1 packets traverse edge e in that interval in the wide-channel model.

(3) A routing buffer (in the network that routes the packets) has q packets in some step in the interval [[Tau], [Tau] + Kn].

(4) There is a row edge e (in the mirror network that routes the tokens), a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [d.sub.0]] [subset or equal to] [[Tau], [Tau] + Kn], such that at least t + [d.sub.0] - 1 tokens traverse edge e in that interval in the wide-channel model.

(5) There is a column edge e (in the mirror network that routes the tokens), a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [2d.sub.0]] [subset or equal to] [[Tau], [Tau] + Kn], such that at least t + [d.sub.0] - 1 tokens traverse edge e in that interval in the wide-channel model.

(6) A routing buffer (in the network that routes the tokens) has q tokens in some step in the interval [[Tau], [Tau] + Kn].

We say that a packet was delayed d steps in traversing an edge if there is a d step gap between the time it traverses the edge in the wide-channel model and the time it traverses the edge in the standard model. Leighton's analysis in Leighton [1990] is based on the following fact (see Corollary 1.9 and Lemma 1.10 in Leighton [1992]): If buffers are unbounded and the farthest to go packet always has the highest priority, then a packet is delayed d steps in traversing a row edge e only if there is an interval of t + d steps such that t + d packets cross edge e in that interval in the wide-channel model.

This must be modified to account for the special steps. Then, we can only postulate that at least t + d - ?? (t + d)/(10Kn) ?? packets cross e. Similarly, if a packet is delayed d steps in crossing a column edge e, then, assuming no packet is delayed more than [d.sub.0] steps on a row, there is an interval of t + do + d steps in which at least t + d - ?? (t + d)/(10Kn) ?? packets cross edge e in the wide-channel model (see Leighton [1992] for a detailed proof). Thus, we have the following corollary that satisfies requirement (2.b) in the general theorem:

COROLLARY 6.1. The event ?? [E.sub.[Tau]] implies that any packet with an enabled token at time [Tau] is delivered within the next 2n + [2d.sub.0] [is less than or equal to] Kn steps.

PROOF. Any packet with an enabled token is delivered within 2n steps in the wide-channel model. In the standard model its additional delay is at most [2d.sub.0].

It follows that if [E.sub.[Tau]], does not occur then any packet which is active at time [Tau] is delivered within T = 3Kn time steps.

We now bound the probability of the event [E.sub.[Tau]] given [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Once we have proved the equivalent of Lemma 4.2, we can make the same arguments as in Corollary 4.1 and Lemma 4.3, with [[Tau].sub.0] = [Tau] - [2n.sup.2]S. The effect of adding Z is similar to that of waiting for a packet to arrive in the butterfly examples, that is, it is outside the control of the adversary. We will argue next that, for any [Alpha] [is greater than] 0, there is a K = K([Alpha]) such that if [Tau] - 2Kn [is less than or equal to] t [is less than or equal to] [Tau] - Kn, then

(10) Pr([E.sub.[Tau]]|[H.sub.t]) = O([n.sup-[Alpha]),

provided [H.sub.t] [subset or equal to] [E.sub.[Tau]].

For an edge e and an interval [[t.sub.i], [t.sub.2]] [subset or equal to] [[Tau], [Tau] + Kn], let [E.sub.0](e, [t.sub.1], [t.sub.2], r) be the event that in the wide-channel model r packets cross e during time interval [[t.sub.1], [t.sub.2]]. Note that every token is used at most once in the interval and its destination can be treated as random.

Case 1. e is a row edge:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(The nodes on the row under consideration have a total of n tokens. Each token is used at most once in the interval. Choose r of them to transport the packets of interest. The probability that a token becomes enabled at any fixed time is at most 1/(Kn).)

Case 2. e is a column edge:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(There is a total of [n.sup.2] tokens. Each token is used at most once in the interval. Choose r of them to transport the packets of interest. The probability that a token becomes enabled at any fixed time is at most 1/(Kn) and the probability that the token uses a particular column is 1/n.)

Thus, the probability that there is a row edge e, a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [d.sub.0]] [subset or equal to] [[Tau], [Tau] + Kn], such that t + [d.sub.0] packets traverse edge e in that interval in the wide-channel model is bounded by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for K [is greater than or equal to] [e.sup.2]. (There are Kn possible values for [t.sub.0] and [is less than or equal to] [2n.sup.2] edges.)

Similarly, the probability that there is a column edge e, a t [is greater than or equal to] 0, and an interval [[t.sub.0], [t.sub.0] + t + [2d.sub.0]] [subset or equal to] [[Tau], [Tau] + Kn], such that t + [d.sub.0] packets traverse edge e in that interval in the wide-channel model is bounded by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

provided that K [is greater than or equal to] [2e.sup.2].

Remark 1. It follows (see Leighton [1992, Sect. 1.7.2, Lemma 1.10]) that with probability at least 1 - [2n.sup.[Alpha]], for every edge w and time interval I of [d.sub.0] steps, there is a step in I in which w is empty.

Next, we bound the probability that any buffer is full in the interval [[Tau], [Tau] + Kn]. From Remark 1, we can assume that some row edge has q packets in its buffer only if there is a window of do steps in which some input tries to inject at least q packets. The probability of this is at most

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for sufficiently large q.

From Remark 1, we can further assume that a column edge has q packets in its buffer only if there is a window of [d.sub.0] steps in which q packets turn at e. Assuming no row delay of [d.sub.0] or more, the probability of this is bounded by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for sufficiently large q (see Leighton [1992, Sect. 1.7.2, Theorem 1.13]). (The factor [Kn.sup.3] bounds the number of (interval I = [[t.sub.1], [t.sub.2]], edge e) pairs. There are [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] choices of token. There is a probability 1/n that its destination uses e. There is a probability of at most [2d.sub.0]/(Kn) that it becomes enabled at a time which means it would cross e during [[t.sub.1] - [d.sub.0], [t.sub.2]] in the wide-channel model.)

This completes the proof of (10).

Taking T = 3Kn, m = 1, a = b = 9, [Alpha] = 50 and applying Theorem 2.1 we obtain

THEOREM 6.1. There is a constant C, such that the above algorithm is stable for any injection rate with expected interarrival time greater than Cn. The expected time a packet spends in the network is O(n). In the case of geometric interarrival time, the expected time a packet spends in the system is O(n).

7. Adversarial Model

In order to avoid probabilistic assumptions on the input, Borodin et al. [2001] defined the adversarial input model. Instead of probabilistic assumptions, restrictions are placed on the amount of required traffic through each edge. More precisely, for an edge e of the network and a time interval I, we let [Theta] (e, I) denote the number of messages arriving during interval I whose input-output path contains e. An adversary has injection rate a if for all e and I:

(11) [Theta](e, I) [is less than or equal to] [Alpha]|I|,

where |I| is the length of I.

Surprisingly, the main results of this paper can be extended to this model.

7.1. BUTTERFLY WITH CONSTANT INJECTION RATE A AND BOUNDED BUFFERS. We will show how to extend the result of Section 3 to the adversarial model. We assume that (11) holds for some (sufficiently small) [Alpha] [is greater than] 0.

When a packet arrives it now adds a random offset between 1 and c log n to its wave number. We route packets in the way previously described. The only issue is the likelihood of a long delay sequence. As described previously, we have a set of buffers [s.sub.j], j [element of] J and a set of wave numbers W = [w, w + A log n]. For each buffer we have a wave number [w.sub.j] [element of] W, where [w.sub.j+1] [is less than or equal to] [w.sub.j] and there is a set [P.sub.j] of packets that want to use this edge. Let [F.sub.j] be the event: there exists a choice [[Pi].sub.j] [element of] [P.sub.j] such that

(i) [[Pi].sub.j] has wave number [w.sub.j] and

(ii) [[Pi].sub.j] [is not an element of] {[[Pi].sub.1], [[Pi].sub.2], ..., [[Pi].sub.j-1]}.

Our assumptions about input rate imply that Pr([F.sub.j]) [is less than or equal to] [Beta] = [Alpha](c + A)/c [is less than] 1. More importantly, we have

(12) Pr([F.sub.j]|[F.sub.1], [F.sub.2], ..., [F.sub.j-1]) [is less than or equal to] [Beta].

Condition on the occurrence of [F.sub.1], [F.sub.2], ..., [F.sub.j-1], let [P'.sub.j] = [P.sub.j]\{[[Pi].sub.1], [[Pi].sub.2], ..., [[Pi].sub.j-1]} be the current set of choices for [[Pi].sub.j]. The choice of wave number offset by packets is done independently and so the probability of [[F.sub.j] does not increase. Inequality (12) is enough to prove the unlikelihood of long delay sequences. The event [E.sub.[Tau]] can be defined in the same way as before. Thus, we prove the following theorem:

THEOREM 7.1. There is a constant [Alpha] [is greater than] 0, such that for any adversary with injection rate [Alpha] the system is stable, and the expected time a packet spends in the system is O(log n).

7.2. BUTTERFLY WITH O(1/log n) INJECTION RATE AND BOUNDED BUFFERS. Let [Pi] be the first packet that takes time D + log n to reach its destination after becoming active at time t. From Maggs and Sitaraman [1992], we know that the delay tree of H is hit at least D times. We need only consider hits from messages in the time interval [t - D - log n, t + D + log n]. By assumption there are at most

((2D + 2 log n + 1)[Alpha]/log n)(D/q + log n)

such hits. So if D = [Delta] log n and

(13) 2([Delta] + 1)[Alpha]([Delta]/q + 1) [is less than] [Delta],

then [Pi] cannot be delayed by more than D. Now for small [Alpha] and large q, [Delta] = 3[Alpha] satisfies (13) and we conclude that every packet is delivered in time (1 + 3[Alpha])log n.

7.3. BUTTERFLY WITH CONSTANT INJECTION RATE A AND BUFFERS OF SIZE O(log n). In this case, we will have to use our stability theorem (Theorem 2.1). Putting p = 2[Alpha], we know that in time interval of length [[Tau], [Tau] + K log n], no more than 2[Alpha]K log n packets will arrive at any input. Also, the existence of a delay sequence depends only on the priority calculation, since we are guaranteed that no edge sees more than [Alpha]K log n packets in the interval (removing the necessity to prove Theorem 3.2.5 of Leighton [1992].

7.4. MESH WITH INJECTION RATE [Alpha]/n, [Alpha] [is less than] 1/2 AND BOUNDED BUFFERS. This case is straightforward. The assumption (11) implies that in the wide-channel model, fewer packets cross an edge e than are required for the first two possibilities of event [E.sub.[Tau]] assuming [d.sub.0] [is greater than] 1/(1 - 2[Alpha]). We can take q = [d.sub.0] since for every edge e we can argue that in any interval of [d.sub.0] steps there is a step in which e is empty.

(1) See, for example, Leighton [1990], Mitzenmacher [1994], Kahale and Leighton [1995], and Broder and Upfal [1996].

REFERENCES

BORODIN, A., KLEINBERG, J., RAGHAVAN, P., SUDAN, M., AND WILLIAMSON, D. P. 2001. Adversarial queuing theory. J. ACM 48, 1 (Jan.), 13-38.

BRODER, A. Z., AND UPFAL, E. 1996. Dynamic deflection routing in arrays. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (Philadelphia, Pa., May 22-24). ACM, New York, pp. 348-355.

HARCOL-BALTER, M., AND BLACK, P. 1994. Queuing analysis of oblivious packet-routing networks. In Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms (Arlington, Va., Jan. 23-25). ACM, New York, pp. 583-592.

HARCOL-BALTER, M., AND WOLF, D. 1995. Bounding delays in packet-routing networks. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing (Las Vegas, Nev., May 29-June 1). ACM, New York, pp. 248-257.

KAHALE, N., AND LEIGHTON, T. 1995. Greedy dynamic routing on arrays. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 22-24). ACM, New York, pp. 558-566.

KLEINROCK, L. 1975. Queuing Systems. Volume I: Theory. Wiley, New York.

LEIGHTON, F. T. 1990. Average case analysis of greedy routing algorithms on arrays. In Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (Island of Crete, Greece, July 2-6). ACM, New York, pp. 2-10.

LEIGHTON, F. T. 1992. Introduction to Parallel Algorithms and Architectures. Morgan-Kaufmann, San Mateo, Calif.

MAGGS, B. M., AND SITARAMAN, R. K. 1992. Simple algorithms for routing on butterfly networks with bounded queues. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing (Victoria, B.C., Canada, May 4-6). ACM, New York, pp. 150-161.

MITZENMACHER, M. 1994. Bounds on the greedy routing algorithms for array networks. In Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures (Cape May, N.J., June 27-29). ACM, New York, pp. 346-353.

RANADE, A. G. 1987. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Los Alamitos, Calif., pp. 185-194.

SCHEIDELER, C., AND VOECKING, B. 1996. Universal continuous routing strategies. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy, June 24-26). ACM, New York, pp. 142-151.

STAMOULIS, G. D., AND TSITSIKLIS, J. N. 1991. The efficiency of greedy routing in hypercubes and butterflies. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (Hilton Head, S.C., July 21-24). ACM, New York, pp. 248-259.

UPFAL, E. 1984. Efficient schemes for parallel communication. J. ACM 31, 3 (July), 507-517.

The research of A. M. Frieze was supported in part by National Science Foundation Grants CCR 99-25008 and CCR 95-30974.

The research of E. Upfal was supported in part ny NSF Grant CCR 97-31477.

Part of E. Upfal's work was done at the IBM Almaden Research Center, San Jose, California.

Author's addresses: A. Z. Broder, Digital Systems Research Center, 130 Lytton Ave., Palo Alto, CA 94301, e-mail: broder@pa.dec.com; A. M. Frieze, Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA, e-mail: afip@andrew.cmu.edu; E. Upfal, Computer Science Department, Brown University, Box 1910, Providence, RI 02912 e-mail: eli@cs.brown.edu.

Printer friendly Cite/link Email Feedback | |

Author: | BRODER, ANDREI Z.; FRIEZE, ALAN M.; UPFAL, ELI |
---|---|

Publication: | Journal of the Association for Computing Machinery |

Geographic Code: | 1USA |

Date: | Mar 1, 2001 |

Words: | 13821 |

Previous Article: | Concurrent Threads and Optimal Parallel Minimum Spanning Trees Algorithm. |

Topics: |