Printer Friendly

A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra.

1. INTRODUCTION

Distributed-memory multiprocessors such as the IBM SP-2 and the Intel Paragon are attractive for high-performance computing in that they offer potentially high levels of flexibility, scalability, and performance. But the need for explicit message passing resulting from the lack of a globally shared address space renders programming these machines a difficult task. The main objective behind the efforts such as High Performance Fortran (HPF) [Koelbel et al. 1994] and Fortran D [Hiranandani et al. 1992] is to raise the level of programming by allowing the user to write programs with a shared address space view augmented with directives that specify data mapping. The compilers for such languages are responsible for partitioning the computation, inserting the necessary commands that implement the required message passing for access to nonlocal data.

On such machines, the time (cost) to access nonlocal data is usually orders of magnitude higher than accessing local data. For example, on the Intel Paragon the processor cycle time is 20 nanoseconds whereas the remote memory access time is between 10,000 and 30,000 nanoseconds, depending on the distance between communicating processors [Hennessy and Patterson 1990]. Therefore, it is imperative that the frequency and volume of nonlocal accesses are reduced as much as possible. In particular, in message-passing programs, the startup cost for the messages can easily dominate the execution time. For example, on the Intel Paragon the message startup time is approximately 1,720 times the transfer time per word; in the IBM SP-2 this figure is around 360 [Foster 1994]. These figures indicate that optimizing communication is very important. Several software efforts have been aimed at reducing the communication overhead. The main goal of these optimizations is to increase the performance of programs by combining messages in various ways to reduce the overall communication overhead. The most common optimization technique used by previous researchers is message vectorization [Balasundaram et al. 1990; Banerjee et al. 1995; Bozkus et al. 1994; Hiranandani et al. 1992; Gerndt 1990]. In message vectorization, instead of naively inserting send and recv operations just before references to nonlocal data, communication is hoisted to outer loops. Essentially, this optimization replaces many small messages with one large message, thereby reducing the number of messages. For example, consider the program fragment shown in Figure 1(a), and assume that all arrays are distributed across processors blockwise in the second dimension. Figures 1(b) and (c) show naively inserted messages and message vectorization respectively, for a processor p before loop bounds reduction (a technique to allow processors to execute only those iterations which have assignments that write to local memory [Hiranandani et al. 1992]) and guard insertion (a technique that guarantees correct execution of statements within loop nests). The notation send{B, q, n} means that n elements of array B should be sent to processor q; recv{B, q, n} is defined similarly. For this discussion, we are not concerned with exactly which elements are sent and received. Notice that the version in Figure 1(c) reduces the message startup cost as well as the latency. Some of the researchers [Adve et al. 1997; Hiranandani et al. 1992] also considered message coalescing, which is a technique that combines messages due to different references to the same, array, and message aggregation, which combines messages due to references to different arrays to the same destination processor into a single message. In general, due to private physical memory spaces, generating communication code for message-passing architectures might be very difficult, because it requires the correct nonlocal elements to get transferred to the memories of the processors that will use them. Optimizing compilers for data-parallel languages automate this time-consuming task of deriving node programs based on the data distribution specified by the programmer.

Fig. 1. (a) A code fragment. (b) Naive communication placement. (c) Message vectorization. (d) Global communication optimization.
DO j = 2, 255
 DO i = 1, 255
  A(i,j)=B(i,j)+B(i,j-1)
 END DO
END DO

DO j = 2, 255
 DO i = 2, 256
  C(i, j)=B(i,j-1)+C(i,j)
 END DO
END DO

(a)

DO j = 2, 255
 DO i = t, 255
  send {B,p+1,1}, recv {B,p-1,1}
  A(i,j)=B(i,j)+B(i,j-1)
 END DO
END DO

DO j = 2, 255
 DO i - 2, 256
  send {B,r+1,1}, recv {B,p-1,1}
 C(i,j)=B(i,j-1)+C(i,j)
 END DO
END DO

(b)

_end {B,p+1,255}, recv {B,p-1,255}
DO j = 2, 255
 DO i = 1, 255
  A(i,j)=B(i,j)+B(i,j-1)
 END DO
END DO

send {B,p+1,255}, recv {B,r-1,255}
DO j = 2, 255
 DO i = 2, 256
  C(i,j)=B(i,j-1)+C(i,j)
 END DO
END DO

(C)

send {B,p+1,256}, recv {B,r-1,256}
DO j = 2, 255
 DO i = 1, 255
  A(i,j)=B(i,j)+B(i,j-1)
 END DO
END DO

DO j = 2, 255
 DO i -- 2, 256
  C(i,j)=B(i,j-1)+C(i,j)
 END DO
END DO

(d)


The main problem with the optimizations mentioned above is that they optimize communication for a single nest at a time. This restriction prevents a compiler from performing interloop optimizations such as global elimination of redundant communication. To see this, consider Figure 1(d) which shows the global optimization of the same program fragment via elimination of redundant communication. Notice that this version, compared with the message-vectorized program in Figure 1(c), reduces both the number of messages and the communication volume.

Recently a number of authors have proposed techniques based on dataflow analysis to optimize communication across multiple loop nests [Chakrabarti et al. 1996; Gong et al. 1993; Gupta et al. 1995b; Kennedy and Sethi 1995; Yuan et al. 1997a; 1997b]. Most of these approaches use a variant of Regular Section Descriptors (RSD) introduced by Callahan and Kennedy [1998]. Two most notable representations are the Available Section Descriptor (ASD) [Gupta et al. 1995b] and Section Communication Descriptor (SCD) [Yuan et al. 1997a; 1997b]. Associated with each array that is referenced in the program is an RSD that describes the portion of the array being referenced. Although this representation is convenient for simple array sections such as those found in pure block or cyclic distributions, it is hard to embed alignment and general distribution information into it. Apart from inadequate support for block-cyclic distributions, working with section descriptors may sometimes result in overestimation of the communication sets, since regular sections are not closed under union and difference operators. The resulting inaccuracy may be linear with the number of data-flow formulations to be evaluated, thus defeating the purpose of global communication optimization.

This problem can be illustrated using the program fragment given in Figure 2(a) assuming that arrays X and Y are distributed blockwise across two processors, 0 and 1. The RSDs corresponding to these two communications are also shown next to the loop statements. Notice that all communication is from processor 0 to processor 1. The problem here is that a data-flow approach based on RSDs to combine these communications will be unable to represent the combined communication as an RSD. This means that even if all the communication can be hoisted above the i loop, the two communications can only be concatenated, resulting in redundant communication as these two sets have some common elements. Moreover, since the communication cannot be taken out of t loop because of a data dependence [Wolf 1996; Zima and Chapman 1991], the redundant communication will occur T times.

Fig. 2. (a) An example code fragment that shows the shortcomings of RSDs. (b) Code generated by using the Omega to enumerate the communication set in (a). process_element() is an implementation-specific function to handle enumerated elements.
(a)

real X(1000), Y(1000)

DO t = 1, T
  DO i = 0, 49 [right arrow] [S.sub.i] = (1:200:4)
    Y(i+600) = X(4*i+1)
  END DO

  DO j = 17, 99 [right arrow] [S.sub.j] = [50:300:3)
    Y(j+500) = X(3*j-1)
  END DO

DO i = 1, 1000
  X(i) = f(Y(i),X(i))
 END DO
END DO

(b)

for (i = 1; i <= 49; i += 4) {
  process_element (i);
}                       }
for (i = 50; i <= 197; i++) {
  if (Mod(i-1,4) == 0) {
    process_element (i);
  }                     }
  if (Mod(i+1,3) == 0
       && -i-16 <= 12*Div(-i10,12)) {
    process_element (i);
  }
}
for (i = 200; i <= 299; i += 3) {
  process_element (i);
}


On the other hand, we represent these sets in our framework as [S.sub.i] := [[d]:[exists]([Alpha]:d = 1 + 4[Alpha] and i [is less than or equal to] d [is less than or equal to] 197)] and [S.sub.j] := {[d]:[exists]([alpha]:1 + d = 3[Alpha] and 50 [is less than or equal to] d [is less than or equal to] 299)}. Then by using the Omega library [Kelly et al. 1995], we derive the code shown in Figure 2(b) which can enumerate all the elements in [S.sub.i] + [S.sub.j]. As a result, each element will be communicated once and only once. It should be stressed that the same problem with RSDs can occur with set difference (--) operations. For instance, the RSD difference between (1:1000:3) and (1:1000:7) cannot be represented as a single RSD. Unfortunately, the inaccuracies originating from the union (and difference) operations on the RSDs accumulate as the data-flow process proceeds, making the final communication sets imprecise.

In this article, we make the following contributions:

(1) We show that the problem of global communication optimization for regular scientific codes can be cast in a linear algebra framework. This allows the compiler to easily apply traditional loop-based optimization techniques such as message vectorization, message coalescing, and message aggregation, as well as global optimizations such as redundant communication elimination and communication hoisting.

(2) We present two different approaches, primarily for hoisting communication and minimizing the number of messages, respectively, that are aimed at reducing communication overhead and show the trade-off between these two. Both these approaches are accurate; using the linear algebra framework proposed by Ancourt et al. [1997], they are able to handle the optimization problem at the granularity of individual array elements.

(3) We show that the global communication sets resulting from our analysis can be enumerated by our use of the Omega library [Kelly et al. 1995; Pugh 1992] from the University of Maryland. Although the Omega library works on the Presburger formulas, and the best-known asymptotic upperbound of any algorithm for verifying the Presburger formulas is 0([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]), the library is much more efficient for the practical cases that arise in compilation.

(4) We compare our approach both qualitatively and quantitatively to the previous work which focused on a single loop nest at a time.

The remainder of this article is organized as follows. Section 2 briefly describes some important concepts such as control flow graphs, interval analysis, dependence analysis, and the linear algebra framework used throughout the article. We present our approach in detail in Section 3 and show how it uses both the linear algebra framework and data-flow analysis. Section 4 discusses the effect of hoisting communication vis-a-vis reducing the number of messages. In Section 5, we present details of communication generation. Section 6 reports experimental results on a 16-node IBM SP-2 distributed-memory message-passing machine and shows that our technique is effective in reducing the number of communication messages, volume of communication, and execution time. Section 7 discusses related work, and Section 8 concludes the article.

2. PRELIMINARIES

The main idea of this work is to show that a global communication optimization problem can be put into a linear algebra framework and that doing so might be useful in practice. Our approach gives the compiler the ability to represent communication sets globally as equalities and inequalities as well as to use polyhedron scanning techniques to perform optimizations such as redundant communication elimination and global message coalescing which were not possible under the loop-nest-based communication optimization schemes. The following subsections give information about the basic concepts used throughout the article.

2.1 Control Flow Graph

We concentrate on structured programs with conditional statements and nested loops but without arbitrary goto statements. Our technique, however, can be extended to deal with jumps out of loops as well. We assume that array subscript functions, loop bounds, and conditional expressions are affine functions of enclosing loop indices and symbolic constants. We also assume that the number of processors is known beforehand.

A basic block is a sequence of consecutive statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching except maybe at the end [Aho et al. 1986]. A control flow graph (CFG) is a directed graph constructed by basic blocks and represents the flow-of-control information of the program.

For our purposes, the CFG can be thought of as a directed graph G = (V, E) where each v [element of] V represents either a basic block or a (reduced) interval that represents a loop, and each e [element of] E represents an edge between blocks. In this article, depending on the context, we use the term node interchangeably for a statement, a block, or an interval. Two unique nodes s and t denote the start and terminal nodes, respectively, of a CFG. One might think of these nodes as dummy statements. It is assumed that every node n [element of] V lies on a path from s to t. We define the sets of all successors and predecessors of a node n as succ(n) = {m | (n, m) [element of] E) and pred(n) = (m | (m, n) [element of] E), respectively. We say node i dominates node j in the CFG, if every path from s to j goes through i. We write this relation as j [element of] dom(i). The CFGs we consider also have the following properties: (a) empty else branches are added to if/endif constructs; (b) all the nonlocal references in the loop bounds and if-conditions are taken just above the respective constructs; and (c) as in Gupta et al. [1995b], any edge that goes directly from a block with more than one successor to a block with more than one predecessor is split. This last transformation, shown in Figure 3, eliminates all critical edges [Knoop et al. 1994].

[Figure 3 ILLUSTRATION OMITTED]

2.2 Interval Analysis

We assume, that, prior to our snalysis, the compiler has performed all loop-level transformations [Banerjee 1994; Wolfe 1996; Zima and Chapman 1991] to enhance parallelism (e.g., loop permutation, loop distribution) and optimize communication. Our technique is based on interval analysis performed on the CFG. As explained in Allen and Cocke [1976], the interval analysis consists of a contraction phase and an expansion phase. For programs written in a structured language, an interval corresponds to a loop, and there is a well-defined algorithm to partition a CFG into disjoint intervals [Aho et al. 1986]. We use a version of the interval detection algorithm that identifies Tarjan's intervals [Tarjan 1974].

The contraction phase collects information about what is generated and what is killed inside each interval. Then the interval is reduced to a single node and annotated with the information collected. This is a recursive procedure and stops when the reduced CFG contains no more cycles. In other words, the main purpose of this phase is to percolate the influence of each node to the outside into an increasingly more global context.

After the contraction phase, the expansion phase is run. In each step of this phase, a node (reduced interval) is expanded, and the information regarding the nodes in that interval is computed. In our case, at each step of the expansion phase, communication required for the intervals (loops) is determined.

Figure 4 shows the two phases of the interval analysis for an example CFG. In this figure, as shown by the dashed arrows, the contraction phase proceeds from left to right, whereas the expansion phase proceeds in the reverse direction. As an example, the block marked with 3,4 represents an interval (a loop) containing blocks 3 and 4. It is also possible to adapt our approach to work with interval-flow graph, which is basically a CFG with an interval structure imposed on it [Kennedy and Sethi 1997; van Hanxleden and Kennedy 1993; 1994].

[Figure 4 ILLUSTRATION OMITTED]

It should be noted that since we assume that our input programs are structured, irreducible (intermediate) CFGs [Aho et al. 1986] cannot occur during our analysis.

2.3 Data Dependence

Let [S.sub.x] and [S.sub.y] be two statements (not necessarily distinct) enclosed by nested loops. A data dependence determines which iterations of the loops can be executed in parallel. A flow dependence exists from statement [S.sub.x] to statement [S.sub.y] if [S.sub.x] writes a value that is subsequently (in sequential execution) read by [S.sub.y]. Such a dependence implies that instances of [S.sub.x] and [S.sub.y] must execute as if some of the nest levels must be executed sequentially. An antidependence exists between [S.sub.x] and [S.sub.y] if [S.sub.x] reads a value that is subsequently modified by [S.sub.y]. An output dependence exists between [S.sub.x] and [S.sub.y] if [S.sub.x] writes a value that is subsequently written by [S.sub.y] as well. Data dependences are loop-independent if the accesses to the same memory location occur in the same loop iteration; if the accesses occur in different loop iterations they are said to be loop-carried. Note that in that case not all loop nest levels need to contribute to the dependence. The outermost loop level that contributes to the dependence is said to carry that dependence. In-depth discussion of data dependence analysis techniques is beyond the scope of this article and can be found elsewhere [Wolfe 1996; Zima and Chapman 1991].

2.4 Linear Algebra Framework

HPF-like languages provide compiler directives that allow the user to perform data allocation onto local memories. The compiler then uses these distribution directives to partition computation across processors. It has been shown in Ancourt et al. [1997] that linear algebra provides a powerful framework to generate code for distributed-memory message-passing machines, taking into account compiler directives.

Most of the compilers for distributed-memory message-passing machines use the owner-computes rule, which simply assigns each computation to the processor that owns the data being computed [Balasundaram et al. 1990; Gerndt 1990; Hiranandani et al. 1992]. In this article, we also assume the owner-computes rule; our framework, however, can be modified to handle the cases where this rule is relaxed. In such cases, the LHS references can also introduce communication. For clarity of the presentation, we do not consider relaxing the owner-computes rule in this article.

Our approach uses the affine framework introduced by Ancourt et al.[1997]. In this framework, data arrays, templates, and processors are all declared as Cartesian grids as in HPF [Koelbel et al. 1994]. The data arrays are first aligned to templates, and then these templates are distributed across the memories of the processors. Consider the following program fragment under a compilation scheme based on HPF-like directives and the owner-computes rule. A cyclic (C) attribute indicates that the template (or array) dimension in question will be partitioned into blocks of size C, and these are assigned to processors in a round-robin fashion. The block and cyclic (1) are just two common cases for the general cyclic (C) distribution.
real x([a.sub.l]:[a.sub.u])
!HPF$ template T([t.sub.l]:[t.sub.u])
!HPF$ processors PROC([p.sub.l]:[p.sub.u])
!HPF$ align x(j) with T([Alpha]*j+[Beta])
!HPF$ distribute T(cyclic(L)) onto PROC

DO  i = [i.sub.l], [i.sub.u]
X([[Gamma].sub.L]*i + [[Theta].sub.L]) = ... X([[[Gamma].sub.R]*i
  + [[Theta].sub.R]) ...
END DO


Let [R.sub.L] = X([[Gamma].sub.L] * i + [[Theta].sub.L]) and X([[Gamma].sub.R] = X([[Gamma].sub.R] * i + [[Theta].sub.R]). In the rest of the article, for the sake of simplicity, we will sometimes refer to the subscript expressions as data (array) elements when the intention is clear. Assuming p and q denote two processors, we define the following sets.

Own (X, q) = {d|d [element of] X and is owned by q}

Compute (X, [R.sub.L], q) = {i|[[Gamma].sub.L] * i + [[Theta].sub.L] [element of] Own (X, q) and [i.sub.l] [is less than or equal to] i [is less than or equal to] [i.sub.u]}

View(X, [R.sub.R], q) = {d|[exists][Iota] st. [Iota][element of] Compute (X, [R.sub.L], q) and

d == X([[Gamma].sub.R] * [Iota] + [[Theta].sub.R]) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u]}

CommSet(X, [R.sub.R], p, q) = Own(X, q) [intersection] View(X, [R.sub.R], p).

Intuitively, the set own(x, q) refers to the elements mapped onto processor q through compiler directives. The similar Own sets are defined for other arrays as well. The set of iterations to be executed by q due to a LHS reference [R.sub.L] is given by Compute(X, {R.sub.L], q). Of course, during the execution of this local iteration set, some elements (local or nonlocal) denoted by the RHS reference [R.sub.R] will be required; the set view(x, [R.sub.R], q) defines these elements. Finally, commset(x, [R.sub.R], p, q) defines the elements that should be communicated from processor q to processor p due to reference [R.sub.R].

It should be noted that in general there may be more than one RHS reference, and the computation may involve multidimensional arrays and a multilevel nest in which case d and i denote data and iteration vectors respectively. Also in the most general case, [Alpha], [Gamma].sub.L], and [[Gamma].sub.R] are matrices, and [Beta], ([[Theta].sub.L], and [[Theta].sub.R] are vectors.

The definition of the own set above is rather informal. For a more precise definition, we take into account the block-cyclic distribution and define the own set as

Own (x, q) = {d|[exists]t, c, 1 such that t = [Alpha] * d + [Beta]

and t = C * P * c + C * q + 1 and [a.sub.l] [is less than or equal to] d [is less than or equal to] [a.sub.u]

and [p.sub.l] [is less than or equal to] q [is less than or equal to] [p.sub.u] and [t.sub.l] [is less than or equal to] [t.sub.u] [is less than or equal to] [t.sub.u] and 0 [is less than or equal to] 1 [is less than or equal to] C - 1},

where P = [p.sub.u] - [p.sub.l] + 1. In this formulation, t = [Alpha] * d + [Beta] represents alignment information, and t = C * P * c + C * q + 1 denotes the distribution information. In other words, each array element d is mapped onto a point in a local two-dimensional array. This point can be represented by a pair (c, 1) and gives the local address of the data item in a processor. Simple block and cyclic (1) distributions can easily be handled within this framework by setting c = 0 and 1 = 0, respectively. As an example, Figure 5(a) shows the global and local addresses of a one-dimensional array distributed in block-cyclic manner across three processors with C = 4. Figures 5(b) and (c), on the other hand, illustrate two-dimensional views of the global and local addresses, respectively. For each processor, the horizontal dimension corresponds to the c coordinate whereas the vertical dimension denotes 1. For example, the 55th element of the (global) array is mapped onto Processor i with c = 4 and 1 = 3 as local coordinates.

[Figure 5 TABULAR DATA NOT REPRODUCIBLE IN ASCII]

The relation t = [Alpha] * d + [Beta] can be generalized by adding a replication matrix V which eliminates the replicated dimension from the equations: V + t = [Alpha] * d + [Beta]. In the case where no replication is specified, V is the identity matrix. Also, in order to take the collapsed dimensions (the dimensions that are not distributed across processors) into account, another projection matrix Y can be used: Y * t = C * P * c + C * q + 1. All the elements on a collapsed dimension are stored on the same processor. Notice that these projection matrices are only useful if we adhere to a matrix form for describing the relations. We do not need them if the relations are described on a per-dimension basis. In the rest of the article we assume (1) that identity alignment is used and (2) that arrays are directly distributed across processors. For an in-depth discussion of the linear algebra frame work for compiling distributed-memory programs, we refer the reader to Ancourt et al. [1997].

2.5 Parafrase-2 and Omega Library

Parafrase-2 [Polychronopoulos et al. 1989] is used as the front end in our compilation framework. It is a parallelizing compiler implemented as a source-to-source-code restructurer that consists of several passes for analysis, transformation, parallelism detection, and code generation. In order to obtain the loops that enumerate the elements in the ownership and communication sets, we use the Omega library [Kelly et al. 1995]. This library is essentially a set of C++ classes for manipulating integer tuple relations and sets defined using Presburger formulas. We implemented a framework that obtains data access information from Parafrase-2 internal structures and feeds them into the Omega library; when all the required sets have been obtained the framework converts these sets back to internal Parafrase-2 structures.

3. DATA-FLOW ANALYSIS USING A LINEAR ALGEBRA FRAMEWORK

In this section, we define our data-flow framework in detail. First, we introduce some important sets and operations on them.

3.1 Definition of Sets and Operations

3.1.1 Communication Descriptors and Communication Sets. A communication descriptor can be defined as a pair <R, L>, where R is an array identifier (name), and Se is the communication set associated with R. The exact definition of a communication set depends on the context in which it is used. Throughout our analysis, a communication set is defined as {d|d is owned by q and is required by (or should be transferred to or has already been transferred to) p} except for the KILL set, which defines the set of elements written (killed) by q. In these set definitions a refers to a multidimensional array element.

3.1.2 Operations on Communication Sets. Since we define a communication set as a list of equalities and inequalities (this is how the Omega library represents a set), it can be represented as L = {d|P(d)} where P(.) is a predicate. Let {d|P(d)} and {d|I(d)} be two communication sets. We define the operations [+.sub.c], [-.sub.c], and [[intersection].sub.c] on communication sets as follows:

{d|P(d)} + [sub.c]{d|I(d)} = {d|P(d) or I(d)}

{d|P(d)} -[sub.c]{d|I(d)} = {d|P(d) and not (I(d))}

{d|P(d)} [[intersections].sub.c]{d|I(d)} = {d|P(d) and I(d)}

Note that the operations "or," "and," and "not" can be performed by using the corresponding Omega operations on sets which contain equalities and inequalities.

3.1.3 Operations on Sets of Communication Descriptors. Let D = <R, L> be a communication descriptor. We define two functions: a function N from communication descriptors space to array identifiers space, and a function M from communication descriptors space to communication sets space such that N(D) = R and M(D) = L.

Suppose [DL.sub.1] and [DL.sub.2] are two sets of communication descriptors. Three operations, namely [+.sub.d], [-.sub.d], and [[intersections.sub.d], are defined on these sets as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

When there is no ambiguity, we also use [[union].sub.c] and [[union].sub.d] instead of [+.sub.c] and [+.sub.d], respectively. It should be noted that although these operations are similar to those given by Gong et al. [1993], there is an important difference. Since we keep the communication sets accurately in terms of equalities and inequalities, we can optimize (e.g., coalesce) communication messages even if the messages do not have the same communication pattern (e.g., broadcast, point-to-point) or identical sender/receiver sets. Most of the previous approaches to global communication optimization cannot optimize these kinds of messages, mainly due to their representation of communication sets.

It should be noted that our analysis works with sets of equalities and inequalities. As compared with the previous approaches based on RSDs, our technique may be slower. In order to alleviate this problem, we do not operate on the contents of the sets in every data-flow equation to be evaluated; instead we represent the sets with symbolic names and postpone the real computation on them until the end of the analysis where the communication code should be generated. For example, suppose that a data-flow equation requires combining two sets [S.sub.x] = {[x]:[I.sub.1](x)} and [S.sub.y] = {[y]:[I.sub.2](y)} where [I.sub.1] and [I.sub.2] are predicates consisting of equalities and inequalities. Instead of forming the set {[z]:[I.sub.1](z) V [I.sub.2](z)} immediately and using it in subsequent computations, our approach represents the resulting set abstractly as [S.sub.x] + [S.sub.y]. When the whole process is finished, the resulting sets are rewritten in terms of equalities and inequalities, and then the simplify utility of the Omega library is used to simplify them. Our experience shows that this approach requires a manageably small symbolic expression manipulation support and is fast in practice (see Section 6 for a cost analysis of the compilation time). Next we present our data-flow framework.

3.2 Local (Intrainterval) Analysis

In order to make the data-flow analysis task easier, the CFG of the program is traversed prior to the local analysis phase, and for each LHS reference a pointer is stored in the header of all enclosing loop nests. This allows the compiler to reach a LHS reference inside a loop quickly during the data-flow analysis. The local analysis part of our framework computes KILL, GEN, and POST_GEN sets for each interval. Then the interval is reduced to a single node and annotated with this information.

Let [R.sub.L](i) and [R.sub.R](i) be the data elements obtained from references [R.sub.L] and [R.sub.R], respectively, with a specific iteration vector i. The computation of the KILL set proceeds in the forward direction; that is, the nodes within the interval are traversed in topological sort order. Let KILL ( i, q) be the set of elements written (killed) by processor q in node i, and let Modi fled ( i, q) be the set of elements that may be killed along any path from the beginning of the interval to node i (including node i). Then,

KILL(i, q) = {d|d [element of] Own (X, q)

and [exists][Iota], [R.sub.L] st. d = [R.sub.L]([Iota]) and [[Iota].sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

assuming that Modified(pred(first(i)),q) = 0 where first(i) is the first node in i. If last (i) is the last node in i, then

KILL(i,q) = Modified(last(i) ,q).

This last equation is used to reduce an interval into a node. Notice that i is used to denote a node in the CFG whereas [Iota] is used for an iteration vector. In order to see how the computation of the KILL set proceeds, consider Figure 6. In this example there are two intervals corresponding to the j and i loops. We concentrate only on the computation of the KILL sets for array X (the computation of the KILL sets of other arrays can be performed in a similar manner). The analysis starts with the first node of the innermost interval (the j loop), and proceeds as follows:

KILL(4,q) = {d|d [element of] Own(X, q) and [exits][Iota], J st. d = x([Iota], j) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to] [j.sub.u]}.

Modified (4,q) = KILL(4,q)

KILL(5,q) = 0

Modified(5,q) = Modified(4,q) [union] KILL(5,q) = Modified (4,q)

KILL(6,q) = {d|d [element of] Own(X,q) and [exixts][Iota], J st. d = X([Iota] - 1, J + 2) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to] [j.sub.u]

Modified(6,q) = Modified(5,q) [union] KILL(6,q)

KILL (7,q) = 0

Modified(7,q) = Modified(6,q) [union] KILL(7,q) = Modified (6,q)

KILL(9,q) = {d|d [element of] Own(X,q) and [exists][Iota] J st. d = x([Iota] + 1, J - 3) and [i.sub.l][is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to] [j.sub.u]}.

Modified(9,q) = Modified(5,q) [union] KILL(9,q)

KILL(10,q) = 0

Modified(10,q) = [Modified(7,q) [union] Modified(9,q)] [union] KILL (10,q) = Modified(7,q) [union] Modified(9,q)

KILL(11,q) = 0

Modified(11,q) = Modified(10,q) [union] KILL(11,q) = Modified(10,q) = [Modified(7,q) [union] Modified(9,q)] = KILL(4,q) [union] KILL(6,q) [union] KILL(9,q) = {d|d [element of] own (x, q) and ([exixts][Iota], J st. d = x([Iota], J) d = x([Iota] - 1, J + 2)or d = x([Iota] + 1, j - 3)) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to] [j.sub.u]}.

Since last (3, q) = 11, at this point we can reduce the innermost interval into a single node and annotate it by its KILL set:

KILL (3,q) = Modified (11,q)

Then the analysis continues with the first node of the outer interval (i loop):

KILL(2,q) = {d|d [element of] Own(x,q)

and [exists][Iota] st. d = x([Iota] - 2, [Iota]) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u]}.

Modified(2,q) = KILL (2,q)

Modified(3,q) = Modified (2,q) [union] KILL (3,q)

= {d|d [element of] Own (X, q)

and ([exists][Iota], j st. d = X([Iota], j) or d = X ([Iota] - 1, j + 2)

or d = X ([Iota] + 1, j - 3) or d = X ([Iota] - 2, [Iota]))

and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] j [is less than or equal to] [j.sub.u]}.

Since last (1,q) = 3, at this point we can reduce the interval into a single node

KILL(1,q) = Modified(3,q).

Although for the sake of presentation we show the analysis here in terms of communication sets, the data-flow analysis is actually performed on sets of communication descriptors, since in general there may be accesses to several arrays. That is, the KILL set for a program that refers to arrays [N.sub.i] is as follows:

KILL( i,q) = {<[N.sub.1], [KILL.sub.N1](i,q)>>, <[N.sub.2], [KILL.sub.2](i,q)>,...}

Since we concentrate on computation of the KILL set for a single array, we use KILL(i,q). Similar simplification will be used for presentation of the computation of the GEN (i,q,p) and POST_GEN (i,q,p) sets as well.

Fig. 6. An example program fragment. In this fragment, there are two intervals corresponding to the i and j loops, respectively.
1    DO i = [i.sub.l], [i.sub.u]
2      X(i-2,i) = Y(i-1,i-1) + X(i,1)
3      DO j = [j.sub.l],[j.sub.u]
4        X(i,j) = Y(i-2,i+2)
5        IF (cond)
6          X(i-i,j+2) = Y(i-2,j-2)
7          Y(i,j) = ...
8        ELSE
9          X(i+1,j-3) = Y(i+3,j-3)
10       END IF
11       Z(i,j) = Y(i-4,j)
12     END DO
13   END DO


GEN (i,p,q) is the set of elements required by processor p from processor q at node i with no preceding write (assignment) to them. The computation of the GEN proceeds in the backward direction, i.e., the nodes within each interval are traversed in reverse topological sort order. The elements that can be communicated at the beginning of a node are the elements required by any RHS reference within the node except the ones that are written by the owner before being referenced. Notice that this process involves considering all the LHS references within an interval for a given RHS reference; this leads to an exponential cost. However, there are two factors that make the analysis affordable. First, the scope of the analysis is a single interval (loop nest). In practice the number of distinct references in a loop nest is a small value. Second, since, as mentioned earlier, prior to analysis we keep pointers to all LHS references within a loop nest, we do not have to traverse the parse tree once more to search for the LHS references.

Assuming [Iota] = ([[Iota].sub.1],..., [[Iota].sub.n]) and [Iota] = ([[Iota].sub.1],..., [[Iota].sub.n]), let [Iota] [is less than] [Iota] mean that [Iota] is lexicographically less than or equal to [Iota]; and let [Iota] [is less than] [k.sup.[Iota] mean that [[Iota].sub.j] = [[Iota].sub.j] for all j [is less than] k, and ([[iota.sub.k],..., [[Iota].sub.n]) [is less than] ([[Iota].sub.k],..., [[Iota].sub.n]). Since a node can refer to multiple RHS references, we first define gen(i, [R.sub.R], p,q) as the set of elements to be sent by processor q to processor p at node i due to reference [R.sub.R]. In that case we can compute

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the sake of explanation, we assume one RHS reference per node, and we use only GEN(i,p,q) in the following. The extension to the multiple RHS reference per node is straightforward. Let Comm(i,p,q) tie the set of elements that may be communicated at the beginning of interval i to satisfy communication requirements from the beginning of i to the last node in the interval that contains i. Then, for an array x, we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Additionally, we use the following equation to reduce an interval into a single node:

GEN(i,p,q) = Comm(First(i),p,q)

In the definition of GEN, [R.sub.R] denotes the RHS reference, and [R.sub.L] denotes the LHS reference of the same statement. [R.sub.L], on the other hand, refers to any LHS reference within the same interval. Notice that while [R.sub.L] is a reference to the same array as [R.sub.R], [R.sub.L] can be a reference to any array (e.g., array Y in the formulation above), level (i) gives the nesting level of the interval (loop), with the value i corresponding to the outermost loop in the nest. If the dependence is loop-independent the textual positions of the references in the nest may also need to be taken into account when computing the GEN set. In that case the formulation of the GEN set should contain terms showing the precedence relations between references. For the sake of simplicity, we assume that all the dependences that we are dealing with are loop-carried.

After the interval is reduced, the GEN set for it is recorded, and an operator F is applied to the last part of this GEN set to propagate it to the outer interval:

F(J [is less than] [sub.k][Iota] = J [is less than] ([sub.k-1])[Iota]

For example, consider Figure 6 once more, this time concentrating on the computation of GEN sets due to array Y. Notice that array Y is written only in statement (line) 7. The analysis starts with the last statement of the innermost interval (j loop):

GEN (11,p,q) = {d|[exists][Iota], J st. [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to] [j.sub.u] and d [element of] own (Y,q)

and d = Y([Iota],J] and Z([Iota],J] [element of] (Z,p) and not ([exists][Iota],J,st. d = Y([Iota],J) and [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and [j.sub.l] [is less than or equal to] J [is less than or equal to][j.sub.u] and ([Iota] = [Iota] and J [is less than] J))}.

To keep the presentation simpler, we do not show the remaining GEN sets in this interval. The analysis proceeds as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Since first(3) = 4, the innermost interval can now be reduced as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After GEN (3,p,q) is recorded, the compiler applies the F operator to GEN(3,p,q). The effect of this operator for this example is

([Iota] = [Iota] and J [is less than] J) [right arrow] ([Iota] [is less than or equal to] [Iota] or ([Iota] = [Iota] and j' [is less than] j)).

That is, at this point the compiler takes into account flow dependences carried by the i loop as well. Then we continue with the last statement of the outer interval (i loop):

Comm(3,p,q) = GEN(3,p, q)

Comm(2,p,q) = Comm(3,p,q) [union] GEN(2,p,q)

Since first (1) = 2, the outer interval can now be reduced:

GEN(1,p,q) = Comm(2,p,q)

Since i is the index of the outermost interval, there is no need to apply the F operator after this reduction. We should emphasize that computing the GEN sets gives us all the communication that can be vectorized or coalesced above a loop nest, i.e., our analysis easily handles message vectorization and message coalescing [Hiranandani et al. 1992]. Finally, POST_GEN (i, p, q) is the set of elements required by processor p from processor q at node i with no subsequent write to them. For an array x we have

POST_GEN (i,p,q) = {d|[exists][Iota], Y st. [i.sub.l] [is less than or equal to] [Iota] [is less than or equal to] [i.sub.u] and d = [element of] (X,q)

and d = [R.sub.R]([Iota]) and

[R.sub.L]([Iota]) [element of] Own (Y,p) and not

([exists] J, [R.sub.L] st. [i.sub.l] [is less than or equal to] J [is less than or equal to] [i.sub.u] and d = [R.sub.L](J) and [Iota] [is less than] [sub.level(i).J)}.

The computation of POST_GEN(i,p,q) proceeds in the forward direction. Its computation is similar to those of the KILL ( i,q) and GEN (i,p,q) sets, so we do not discuss it in detail.

3.3 Data-Flow Equations

In our framework, any communication incurred is placed at the beginning of the nodes. Here, we concentrate on the computation of a communication set called RECV. The actual send and recv sets used by the code generator are produced in a later pass of the compiler from the RECV sets discussed here using two projection functions as explained in Section 5. Our data-flow analysis framework consists of a backward and a forward pass. In the backward pass, the compiler determines sets of data elements that can safely be communicated at specific points. The forward pass eliminates redundant communication and determines the final set of elements (if any) that should be communicated at the beginning of each node i. The data-flow equations that we present here are aggressive in the sense that a communication incurred by a nonlocal reference is hoisted to the highest point possible in the CFG. Later in Section 4 we discuss how to refine this approach to control communication hoisting. The input for the equations

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

consists of the GEN (i, p, q), KILL (i, q), and POST_GEN (i, p, q) sets for each i as computed during the local analysis.

The data-flow equations for the backward analysis are given by Eqs. (1) and (2) in Figure 7. The symbol [intersection] in this figure denotes [[intersection].sub.d]. SAFE_ IN (i, p, q) and SAFE_OUT ( i, p, q) are the sets of communication descriptors; these denote the elements that can safely be communicated at the beginning and end of node i, respectively. Equation (1) says that an element should be communicated at a point if and only if it will be used in all of the following paths in the CFG. This is the fundamental rule that our data-flow analysis, as well as some of the previous approaches as in Kennedy and Sethi [1995], adheres to. Equation (2), on the other hand, gives the set of elements that can safely be communicated at the beginning of node i, and makes use of the GEN and KILL sets. Intuitively, an element can be communicated at the beginning of node i if and only if it is either required (generated) by node i or reaches the end of node i (in the backward analysis) and is not overwritten (killed) in it. It should be noted that if the elements contained in SAFE_IN sets are directly communicated without any further analysis, there would be significant amounts of redundant communication. The task of the forward analysis phase is to eliminate redundant communication.

[Figure 7 [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The data-flow equations for the forward analysis are given by Eqs. (3), (4), and (5) in Figure 7; these equations observe the following two rules:

(1) a node should not fetch data needed by a successor unless it dominates that successor and

(2) a successor should ignore what a predecessor has received so far unless that predecessor dominates it.

RECV_IN ( i, p, q) and RECV_OUT ( i, p, q) denote the set of' communication descriptors containing the elements that have been communicated so far (at the beginning and end of the node i, respectively) from q to p. On the other hand, RECV (i, p, q) denotes the set of communication descriptors containing the elements that should be communicated from q to p at the beginning of node i and is finally used by the communication generation portion of the compiler to generate the actual send and recv commands as explained in Section 5. Equation (3) simply says that the communication set arriving in a join node can be found by intersecting the sets for all the joining paths. Equation (4) is used to compute the RECV set which corresponds to the elements that can be communicated at the beginning of the node except the ones that have already been communicated (RECV_IN). The elements that have been communicated at the end of node i (that is, RECV_OUT set) are simply the union of the elements communicated up to the beginning of i, that is, the elements communicated at the beginning of i provided that the condition in Eq. (5) is not satisfied (except the ones that have been overwritten (killed) in i) and the elements communicated within i and not written subsequently (POST_GEN), again provided that the condition in the equation is not satisfied. It should be emphasized that all these sets are communication descriptor sets, and the order of operations as indicated by the parenthesis is important.

3.4 Global Data-Flow Analysis

Our approach starts by computing the GEN, KILL, and POST_GEN sets for each node. Then the contraction phase of the analysis reduces the intervals from innermost to outermost and annotates them with GEN, KILL, and POST_GEN sets. When a reduced CFG with no cycles is reached, the expansion phase starts, and RECV sets for each interval is computed, this time from outermost to innermost. There is one important point to note: before starting to process the next inner graph, the RECV_IN set of the first node in this graph is set to the RECV set of the interval that contains it. More formally, in the expansion phase, we set

(6) [RECV_IN(i, p, q).sup.kth pass] = [RECV(i, p, q).sup.(k-1)th pass]

This assignment then triggers the next pass in the expansion phase. Before the expansion phase starts [RECV_IN(i,p, q).sup.1st pass] is set to 0. Figure 8 shows the overall algorithm COMM-OPT followed by compiler to generate the send and recv sets. Notice, that, due to Eqs. (1) and (2) in Figure 7 a datum can only be communicated when it is safe to do so (i.e., the semantics of the program is preserved). In the forward analysis, the RECV sets contain only the elements needed to be communicated; therefore no stale data are used, and the correctness is ensured.

Fig. 8. Communication optimization algorithm COMM-OPT based on data-flow analysis. This algorithm computes the send and recv sets.

INPUT: A connected CFG.

OUTPUT: A processed CFG with optimized communication calls.

Step (a) Pre-processing phase:

(a.1) The CFG is traversed and in each loop a pointer for each LHS it encloses is stored;

(a.2) The CFG is traversed to add empty else branches to "if" constructs and to eliminate the critical edges;

(a.3) The "dominance" relation for each node in the CFG is computed.

Step (b) Initialization phase: For each node in the initial CFG, KILL, GEN and POST_GEN sets are computed in terms of symbolic set names;

Step (c) Contraction phase: Until a CFG with no cycles is reached, recursively each CFG is handled by reducing its intervals and annotating each interval by its KILL, GEN and POST_GEN sets;

Step (d) Expansion phase: For each intermediate CFG, the following is repeated:

(d.1) Using data-flow Equations (1) and (2) in Figure 7, the SAFE_IN sets are computed in backward direction;

(d.2) Using data-flow Equations (3), (4) and (5) in Figure 7, the RECV sets are computed in forward direction;

(d.3) The CFG is expanded; the equation 6 is used to trigger the data-flow activity in the new CFG;

Step (e) Substitution phase: The symbolic set names in the resultant RECV sets are replaced with actual sets consisting of equalities and inequalities;

Step (f) Set generation phase: The Omega library is called to generate send and recv sets used by the code generator from the RECV sets.

3.5 Example

We use the synthetic benchmark program in Figure 9(a) to illustrate our framework. We concentrate on the communication placement at the higher level CFG that is acyclic. Figure, 9(b) shows the message vectorized program with communication calls before the loop bounds reduction and guard insertion. The notation send{B, q} means that some elements of array B should be sent to q; recv{B, q} is defined similarly. We omitted from the figure the number of elements communicated to make the code look clear. In this example, communication arises only due to references to array B. A loop-based communication analysis places eight send and eight recv calls (in fact these are themselves loop nests) for eight RHS references marked as bold in Figure 9(b). The communication points for these references are just above the corresponding loop nests. For example, communication required due to reference B(i-1, j-1) in line 33 in Figure 9(b) would be performed in line 29. Notice that; in this example array B is written only once (in line 38).

[Figure 9 PROGRAM LISTING NOT REPRODUCIBLE IN ASCII]

Without loss of generality, assume after local analysis that the GEN and KILL sets are obtained as shown in the second and third column of Table I respectively. The corresponding line numbers are shown in the first column. Notice for this example that POST_GEN(i,p,q) is 0 for every i (column 4). The fifth column in Table I shows the SAFEIN sets for array B after backward analysis corresponding to the lines given in the first column of the same table. Notice that the communication set S25 cannot be hoisted above line 22 due to the conditional branch. The sixth column, on the other hand, shows the final RECV sets for the same array after the forward analysis and simplifications. Notice that the write to array B in line 38 kills all the communication before it.

[TABULAR DATA I NOT REPRODUCIBLE IN ASCII]

For this example, the data-flow analysis framework achieves the following:

--The communication sets due to references B(i-1, j) and B(i-1, j +1) in line 45 of Figure 9(b) are combined; that is, our approach handles message coalescing easily.

--Communication due to reference B(i-1, j-1) in line 51 is combined with the communication in line 45; and this combined communication can be performed above line 4:3.

--Similarly, the communication sets due to references in lines 33, 11, and 4 can be combined and performed above line 2 in Figure 9(b).

--The communication in line 22 is reduced in volume (from 4,032 elements per processor to 3,024 elements per processor).

--The communication in lines 7, 8, 29, and 48 are entirely eliminated.

--Overall, for a single processor, 16 communication calls (eight send and eight recv) are replaced by six communication calls.

The resulting optimized program is shown in Figure 9(c). It should be emphasized that the final communication sets are precise, i.e., there is no overestimation. Moreover, these communication sets can be enumerated using the Omega library [Kelly et al. 1995]. Notice that all communication sets are enumerated in terms of abstract processors P and q; in general, if desired, message aggregation can also be performed easily.

3.6 Extension for Interprocedural ,Analysis

It is relatively straightforward to extend our analysis to work interprocedurally. In a simple interprocedural setting, our approach can be used as follows. We first build a call graph [Aho et al. 1986] where each node corresponds to a procedure and where there is a directed edge between two nodes [P.sub.1] and [P.sub.2] if and only if [P.sub.1] calls [P.sub.2]. We assume that there is no recursive procedure call. We then traverse the call graph in two steps corresponding to backward and forward analyses. In the backward analysis, we traverse the graph in such a way that a node is visited only after all of the nodes it calls have been visited. When a node [P.sub.k] is visited, the compiler runs our algorithm for the backward analysis. After the algorithm terminates, we summarize this node's communication by using three sets:

GEN, KILL, and POST_GEN. Notice that these three sets completely define the communication behavior of [P.sub.k]. Subsequently, [P.sub.k] is transformed to a new single node and is annotated by these sets (of course, all formal parameters are replaced with actual parameters). When the whole program is reduced to a single node, the forward analysis starts. This time we traverse the call graph in such a way that a node is visited only after all the nodes that call it have been visited. During the visit of a node, we compute the RECV sets for each node of it.

It should be noted that there are several interprocedural communication optimization algorithms (e.g., Creusillet and Irigoin [1995], and Hall et al. [1992; 1995]) with different degrees of sophistication, and the detailed analysis of communication optimization across procedure boundaries is beyond the scope of this article. However, we believe that for most of the algorithms found in the literature, the summarized communication information represented by GEN, KILL, and POST_GEN would be sufficient to optimize communication interprocedurally.

4. HOISTING COMMUNICATION VERSUS MINIMIZING THE NUMBER OF MESSAGES

The approach explained so far is focused on hoisting communication as far as possible, and in general, it results in reduction in communication volume as well as in the number of messages. However, as also pointed out by others, hoisting the communication too eagerly can, under some circumstances, lead to an excessive buffer requirement [Kennedy and Sethi 1995] and an increase in the number of communication calls inserted [Chakrabarti et al. 1996]. In particular, failing to take resource constraints into account may affect the correctness of the communication placement. For example, if the buffer requirements exceed the maximum available buffer, the program may stall [Kennedy and Sethi 1997]. One way to prevent these problems is to avoid hoisting communication aggressively and to reduce breaking of messages into smaller ones. Since the optimal placement of communication is NP-hard [Garey and Johnson 1979], we present a simple heuristic that stops accumulating communication sets as soon as it encounters a node that satisfies a predicate P(i). The content of this predicate depends on a specific implementation. A few alternatives are presented in Table II. For example, Kennedy and Sethi [1995] use the third alternative. An implementation can also employ a combination of these alternatives. For example, consider the predicate obtained by the conjunction of the first and second alternatives, i.e., P(i) = (KILL (i, q) [is not equal to] 0) and GEN(i,p,q) = 0].

Table II. Possible P(i) Predicates to Control Communication Hoisting. Any logical combination of those predicates can also be used. Note that changing P(i) changes behavior of the optimization algorithm completely. With appropriate P(i) predicates, most of the previous optimization algorithms can be simulated.
P(i)                               Comment

KILL (i, q) [is not equal to] 0    avoids message splitting

GEN (i, p, q) = 0                  avoids hoisting too
                                    far--clustering

Buffer_Length(i) [is greater       avoids protocol delays and
 than or equal to] limit            hot spots

Number_of_Buffers (i)              avoids buffer pressure
 [is greater than or equal
 to] limit


The data-flow equations given in Figure 11 are very similar to those shown in Figure 7. The only difference is in the computation of the SAFE_IN (i, p, q) set in which the predicate is taken into account. The reason for this is to prevent a communication set from breaking into smaller sets each requiring a message of its own. This also eliminates some of the complexity of the resultant code. A possible impact of the new approach is shown in Figure 10. In this figure [S.sub.G] and [S.sub.K] denote the GEN and KILL sets respectively for the node shown. The two approaches described in this article behave similarly for the cases shown in Figures 10(a) and (b). But when a node performs only writes and no reads, the approach in Figure 7 still hoists the communication as shown in Figure 10(c), whereas the approach in Figure 11 stops hoisting as shown in Figure 10(d). That is, the new approach does not issue a communication call unless there are additional elements required by the node. This, in turn, reduces the number of communication calls.

[Figures 10-11 PROGRAM LISTING NOT REPRODUCIBLE IN ASCII]

To compare our new approach with the previous one (Figure 7), consider the example program fragment given in Figure 12, a modified version of the last part of the program shown in Figure 9. Columns two, three, and four of the top part of Table III shows the GEN, KILL, and POST_GEN sets respectively corresponding to the line numbers given in the first column. The fifth and sixth columns of the top part of the table show the SAFE_IN and RECV sets respectively of the previous approach. Although we obtain some reduction in communication volume, the number of messages is three which is larger than that of the loop-based approach (column 7) that uses message vectorization alone. The bottom part of Table III, on the other hand, presents SAFE_IN and RECV sets obtained by our new approach. In that case the number of messages is 1, and we have reduction in communication volume as well.

Fig. 12. An example program fragment to show solution to the problem due to aggressive hoisting. Aggressive communication hoisting does not work for this example.
!HPF$ processors PR0C(0:3)
!HPF$ distribute (cyclic(4),*) onto PROC :: A, B, C, D
 real A(128,128), B(128,128), C(128,128), D(128,128)
  1   DO i = 32, 63
  2    DO j = 1, 63
  3     B(i,j)=D(i,j)+2
  4    END DO
  5   END DO
  6   DO i = 1, 31
  7    DO j = l, 63
  8     B(i, j)=B (i,j)-C(i,j)+A(i,j)
  9    END DO
  10  END DO
  11  DO i = 2, 127
  12   DO j = 2, 127
  13    A(i,j)=(B(i-1,j)+.B(i-1,j+1))/2.0
  14   END DO
  15  END DO
  16  DO i = 2, 127
  17   DO j = 2, 127
  18    C(i,j)=(B(i,j)+B(i-1,j-1))/2.0
  19   END DO
  20  END DO


[TABULAR DATA III NOT REPRODUCIBLE IN ASCII]

The main advantages of the new approach are less computation time during the compilation, less complex send/recv loops, and a reduced number of communication messages. However, in real programs when a communicated array is written by the owner processor, it is usually written entirely; therefore, the two approaches discussed behave similarly in practice.

5. COMMUNICATION GENERATION

Our communication code generator uses the Omega library from the University of Maryland [Kelly et al. 1995; Pugh 1992]. After the RECV(i, p, q) sets are obtained in terms of symbolic expressions, they are rewritten in terms of equalities and inequalities. Then the Omega library is called to generate the send and recv loops.

Let us now consider the example given in Figure 9 (and Table I) once more to show how the communication sets are generated. We first concentrate on the computation of [S.sub.4] + [sub.c][S.sub.11] + [sub.c][S.su.33]. The compiler keeps this set as a symbolic expression until the code generation phase where it inserts equalities and inequalities corresponding to [S.sub.4], [S.sub.11], and [S.sub.33], and then calls the Omega library to enumerate the elements. Figure 13 shows the communication sets for [S.sub.4], [S.sub.11], [S.sub.33], and S' = [S.sub.4] [+.sub.c] [S.sub.11] [+.sub.c] [S.sub.33] as represented in Omega. A set element in this figure is represented as a quadruple [q, p, [d.sub.1], [d.sub.2]] meaning that the array element indexed by [[d.sub.1, [d.sub.2]] should be transferred from q to p. Later in the code generation, the projection function [proj.sub.R] := {[q, p, [d.sub.1, [d.sub.2]] [right arrow] [q, [d.sub.1, [d.sub.2]]} is applied to this set to generate the recv set, and similarly the projection function [proj.sub.S] := {[P, p, [d.sub.1, [d.sub.2]] [right arrow] [p, [d.sub.1, [d.sub.2]]} is applied to generate the send set, for a particular processor P. Notice that deriving send and recv sets from a common set ensures correctness. In Figure 13, [l.sub.1] and [c.sub.1] denote the coordinates of an element to be communicated in the source (sending) processor whereas [l.sub.2] and [c.sub.2] denote its coordinates in the target (receiving) processor. [l.sub.3] and [c.sub.3], on the other hand, refer to coordinates of the LHS reference in the same statement. Notice that the bounds on [l.sub.2] are adjusted in the appropriate directions to accommodate the received (nonlocal) elements; and the entire procedure works on the local address space similar to the one shown in Figure 5(c).

[Figure 13 MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

After the projection functions are applied, the code generator part of the Omega library is called to generate the loops to enumerate [q, [d.sub.1, [d.sub.2]] and [p, [d.sub.1], [d.sub.2]] triples. Finally, the loops are converted to Fortran, and the internal data structures of the compiler are updated. As an example, the code enumerating the triples for ([S.sub.25] [+.sub.c] [S.sub.33]) [-.sub.c] ([S.sub.4] [+.sub.c] [S.sub.11] [+.sub.c] [S.sub.33]) is shown in Figure 14(a) as C code for the send set and in Figure 14(b) for the recv set. In these codes, process (.) is an implementation-specific function that handles the resulting elements. These codes enumerate the elements and only the elements that should be communicated between q and p. The remaining sets are computed and enumerated similarly. Notice that redundant equalities and inequalities can be eliminated before the code generation phase by using the "simplify" utility provided by the Omega library.

[Figure 14 PROGRAM LISTING NOT REPRODUCIBLE IN ASCII]

As a final note, although our use of Omega library increases the compilation time as compared to the previous approaches based on RSDs, this increase was not an issue for the programs we experimented with and was more than compensated by the run-time gains due to optimized communication, as explained in the next section.

6. EXPERIMENTS

In this section we report experimental results for eight programs that exhibit regular communication behavior. The salient characteristics of these programs are given in Table IV. addx and eflux are two subprograms from the Perfect Club Benchmarks. The hydro_m code is a modified version of hydro. To obtain this version two modifications have been made to the program aimed at highlighting the difference between our two global optimization techniques. First, the second loop nest is distributed over its statements. Second, the loop bounds in the first loop nest are reduced to 1/4th of the original values. The REFS column shows the number of references in the program in question whereas the c REFS column gives the number of references that require communication. The ITER column shows how many times the outermost timing loop has been iterated for each program. Except for some hard-coded (small) values of array dimensions, the size of each dimension of an array used in the experiments is set to the value shown in the SIZE column. In tred2 for 8 and 16 processors we used 60 and 120, respectively, as the SIZE parameter. The DISTR column shows how the highest-dimensional arrays in the program are distributed. A "D" in a dimension means that the dimension is distributed across processors while an asterisk denotes a nondistributed dimension as in HPF [Koelbet et al. 1994].

Table IV. Programs in Our Experiment Set and Their Characteristics. The REFS column shows the number of references in the program, whereas the c REFS column gives the number of references that require communication. The ITER column shows how many times the outermost timing loop has been iterated for each program. Each dimension of an array used in the experiments is set to the value shown in the SIZE column. The DISTR column shows how the highest-dimensional arrays in a program are distributed. A "D" in a dimension means that the dimension is distributed across processors, while an asterisk denotes a nondistributed dimension.
Program      Source           Arrays                 REFS

hydro        Livermore        nine 2D                  52
hydro_m      Livermore        nine 2D                  52
adi          Livermore        three 3D, three 1D       33
tomcatv      Spec92           seven 2D, two 1D         75
swim         Spec92           fourteen 2D             196
addx         Perfect Club     five 3D, one 2D          72
eflux        Perfect Club     four 3D, one 2D          76
tred2        Eispack          two 2D, two 1D           42

Program      C REFS     DISTR          SIZE      ITER

hydro          10       (*, D)         400         20
hydro_m        10       (*, D)         400         20
adi             6       (*, D, *)      400         10
tomcatv        20       (D, *)         400         10
swim           43       (D, *)         513         20
addx           32       (D, *, *)      194          1
eflux          13       (D, *, *)      5000        10
tred2          22       (D, *)        60/120        1

Program        Brief Description

hydro          2D hydrodynamics
hydro_m        modified hydro
adi            iterative method
tomcatv        2D mesh generation
swim           water equation solver
addx           mesh related comp.
eflux          mesh related comp.
tred2          matrix reduction


The distributed dimensions shown in the table are the best distributions for these programs as far as the communication is concerned. For example, selecting a (*, D) distribution for tomcatv would prevent message vectorization. For each distributed dimension we experimented with four different distributions: block (BLK), cyclic (CYC), cyclic(4) (CYC(4)), and cyclic(7) (CYC (7)). The last two distributions are taken into account to demonstrate the effectiveness of our approach with block-cyclic distributions where most of the previous techniques fail. Two cyclic factors, namely 4 and 7, are selected arbitrarily, one being a power of two whereas the other one is prime. Gupta and Banerjee [1992] note for tred2 that the block-cyclic distribution is the best choice. We also found in addx that block-cyclic distribution performs best (depending on the number of processors used).

We have found that except hydro_m for all of these programs our two global optimization approaches given in Figures 7 and 11 result in the same optimized code. For each program except hydro_m we experiment with two different versions of the code. The base version does not perform any global communication optimization but does perform message vectorization. In fact, a direct application of the owner-computes rule without any optimization results in run-time resolution. In run-time resolution the ownership and communication for each reference are computed at runtime. Since each processor must execute the entire iteration space to compute ownership, this method results in large amounts of overhead. Communication for resolution programs is also very inefficient, as it involves transmission of a large number of small messages [Palermo et al. 1994]. Instead we considered the message-vectorized version with loop bounds reduction as the base version. Since most of the compilers for message-passing architectures apply some kind of message vectorization, we felt that it would be unfair to compare our method against run-time resolution without loop bounds reduction. Notice, however, even in a single loop nest our global optimization approach subsumes most local optimizations including message vectorization, message coalescing, and message aggregation. For all the programs except hydro_m we refer to the globally optimized version as opt. In the hydro_m code, opt refers to the approach given in Figure 7 whereas opt* denotes the approach given in Figure 11. For all the programs and the versions, we also applied an optimization that we call communication pattern reuse. For example, assuming a (*,D) distribution for all arrays, in a statement such as x(i, j) = Y(i, j - 1) + Z(i, j - 1), arrays Y and z have the same communication structure; therefore, we can generate communication loops only once and reuse the communication patterns with a different name for each array. This optimization has not been fully implemented yet.

We now briefly discuss the implementation status of our framework. We have finished the implementation of local communication analysis, Omega-Parafrase data structure interfacing, and communication loop generation parts. Currently, the global communication analysis part and communication pattern reuse optimizations are being implemented. Experimenting with different message-combining techniques (different P(i) predicates) and extension to an interprocedural setting are in our future plans. Below, we present the first results from our implementation.

We measure the effectiveness of our approach in terms of three different but correlated parameters: number of communication messages across all processors, data volume to be communicated across all processors, and execution time. The number of messages and the communication volume are counted dynamically during the execution. The execution times are obtained on a 16-node IBM SP-2 at the Center for Parallel and Distributed Computing at Northwestern University. Each node of this machine has 128MB memory, 2GB disk, and an IBM Power2 processor.

Tables V through XII give the number of communications, the communication volume, and the execution times (in seconds) for our programs for the base and opt versions. Table XIII summarizes the improvement in number of messages for our programs. For hydro_m there are two rows corresponding to our two methods (opt and opt* from top). Overall there is a 32% reduction in the number of messages. Improvement with 16 processors is slightly higher than that with 8 processors. This is because with the 16 processors in general there are more communication messages to optimize. It is also interesting to note that our optimization technique achieves 33% improvement with block-cyclic distribution (CYC (4) and CYC (7)) where most of the previous techniques fail. As expected, for hydro_m our second approach which controls communication hoisting performs better than aggressive hoisting.

Table V. Results for hydro on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                               # of PROCS = 8

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base            1,410       1,440        1,440          1,440
opt             1,120       1,280        1,280          1,280

                               # of PROCS = 8

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base             2.1        108.9         27.1           14.8
opt              1.9        97.0          24.2           12.0

                               # of PROCS = 8

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base            4.12        4.75          5.09           4.83
opt             3.07        3.74          3.87           3.37

                               # of PROCS = 16

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base            3,005       3,880        3,880          3,880
opt             2,424       2.560        2,560          2,560

                               # of PROCS = 16

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base             5.1        111.0         28.8           17.1
opt              4.3         99.1         26.0           13.9

                               # of PROCS = 16

version       (*, BLK)    (*, CYC)    (*, CYC (4))   (*, CYC (7))

base            2.81        3.33          3.83           3.06
opt             2.11        2.75          2.94           2.80


Table VI. Results for hydro_m on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base           1,540      1,604        1,604         1,604
opt            1,110      1,227        1,227         1,227
opt*           1,110      1,180        1,180         1,180

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base            2.0        75.4        19.1          11.0
opt             1.8        17.0         4.2           3.1
opt*            1.8         5.9         3.2           2.8

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base            3.65       4.96        3.97          3.99
opt             2.67       3.05        3.14          2.98
opt*            2.30       2.82        2.95          2.81

                            # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base           3,318      3,810        3,810         3,810
opt            2,400      2.611        2,611         2,611
opt*           2,330      2.555        2,555         2,555

                            # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base            4.9        85.2        19.8          11.7
opt             4.1        17.8         4.3           3.9
opt*            3.9         6.6         4.4           4.1

                            # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC(4))   (*, CYC(7))

base            2.40       2.82        3.10          2.88
opt             1.90       1.99        1.97          1.98
opt*            1.73       1.80        1.87          1.78


Table VII. Results for adi on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in kilobytes for different versions. Bottom: Execution times in seconds for the different versions.
                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            710        968          968            968
opt             288        480          480            480

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            20.1      187.2         46.8           27.0
opt             11.4       94.0         23.3           13.0

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            0.58       0.79         0.64           1.11
opt             0.43       0.52         0.47           0.58

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base           1,640      1,922        1,922          1,922
opt              644        960          960            960

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            84.0      187.2         46.8           27.0
opt             53.9       94.0         23.3           13.0

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            1.08       1.13         1.10           1.98
opt             0.81       0.87         0.88           0.84


Table VIII. Results for tomcatv on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            140       6,396        1.612           932
opt              56         124          124           124

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            0.35       19.2         4.9            2.8
opt             0.06        6.1         1.5            0.86

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            1.39       3.19         2.39           2.41
opt             1.06       1.31         1.44           1.34

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            300       6,498        1,694           998
opt             120         252          252           252

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            0.75       19.2         4.9            2.8
opt             0.11        6.1         1.5            0.86

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            1.30       2.25         2.26           2.29
opt             0.88       1.06         1.09           1.07


Table IX. Results for swim on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                              # of PROCS = 8

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            3,967     120,142       31,598         17,593
opt             3,678      84,182       22,358         13,753

                              # of PROCS = 8

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base             7.40      240.20       62.20          37.90
opt              7.10      163.80       44.40          26.00

                              # of PROCS = 8

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            11.31      36.55        21.14          19.64
opt             10.48      25.47        18.47          16.13

                              # of PROCS = 16

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            8,215     125,142       34,598         21,593
opt             7,615      88,182       26,358         19,753

                              # of PROCS = 16

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            15.80      248.00       68.00          42.40
opt             14.20      168.00       48.50          30.10

                              # of PROCS = 16

version        (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base             7.03      19.22        12.20          12.24
opt              6.71      11.12        10.78          10.31


Table X. Results for addx on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base           91,470     92,014       91,462         93,646
opt            57,266     57,538       57,190         58,626

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            0.37       0.38         0.37           0.37
opt             0.23       0.23         0.23           0.24

                             # of PROCS = 8

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            5.18       6.65         5.10           5.44
opt             3.33       4.94         4.12           3.16

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base           98,078     98,622       98,614        101,342
opt            61,426     61,698       61,690         63,466

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            0.40       0.40         0.40           0.41
opt             0.26       0.26         0.26           0.27

                             # of PROCS = 16

version       (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base            5.99       6.48         6.89           5.70
opt             3.08       4.79         4.07           3.36


Table XI. Results for eflux on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                           # of PROCS = 8

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          203        470          470            470
opt            84        408          408            408

                           # of PROCS = 8

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          0.07      30.30         7.40           4.30
opt           0.04      30.10         7.00           4.00

                           # of PROCS = 8

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          5.90      13.10        13.10          12.96
opt           5.76      12.90        12.97          12.01

                           # of PROCS = 16

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          435        950          950            950
opt           180        816          816            816

                           # of PROCS = 16

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          0.15      30.30         7.40           4.30
opt           0.09      30.10         7.00           4.00

                           # of PROCS = 16

version     (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

base          4.22       7.46         8.78           7.41
opt           3.99       7.28         6.98           6.72


Table XII. Results for tred2 on an IBM SP-2. Top: Number of communications for different versions. Middle: Communication volume in megabytes for different versions. Bottom: Execution times in seconds for the different versions.
                                # of PROCS = 8

version        (*, BLK)      (*, CYC)    (*, CYC(4))   (*, CYC (7))

base             1,706        1,759         1,740        1,721
opt              1,650        1,719         1,718        1,711

                               # of PROCS = 8

version        (*, BLK)      (*, CYC)    (*, CYC(4))   (*, CYC (7))

base             11.5          12.1         11.9          11.9
opt              11.1          11.8         11.4          11.3

                               # of PROCS = 8

version        (*, BLK)      (*, CYC)    (*, CYC(4))   (*, CYC (7))

base             0.83          0.90         0.80          0.76
opt              0.78          0.78         0.73          0.69

                               # of PROCS = 16

version        (*, BLK)      (*, CYC)   (*, CYC (4))   (*, CYC (7))

base             2,028        2,300         2,286         2,280
opt              1,988        2,015         2,004         1,996

                               # of PROCS = 16

version        (*, BLK)      (*, CYC)   (*, CYC (4))   (*, CYC (7))
base             14.2          15.5         15.1          15.2
opt              13.7          15.0         14.7          14.7

                               # of PROCS = 16

version        (*, BLK)      (*, CYC)   (*, CYC (4))   (*, CYC (7))

base             0.91          1.16         0.90         0.89
opt              0.90          1.05         0.82         0.79


Table XIII. Percentage Improvements in Number of Messages
                             # of PROCS = 8

program            (*, BLK)   (*, CYC)   (*, CYC (4))   (*. CYC (7))

hydro                 21         11          11             11
[hydro.sub._m]        28         24          24             24
[hydro.sub._m]        28         26          26             26
adi                   59         50          50             50
tomcatv               60         98          92             87
swim                   7         30          29             22
addx                  37         37          37             37
eflux                 47         13          13             13
tred2                  3          2           1              1
average               33         33          32             30

                                 # of PROCS = 16

program            (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

hydro                 19         34         34             34
[hydro.sub._m]        28         31         31             31
[hydro.sub._m]        30         33         33             33
adi                   61         50         50             50
tomcatv               60         96         85             75
swim                   7         29         24              9
addx                  37         37         37             37
eflux                 59         14         14             14
tred2                  2         12         12             12
average               33         37         35             33


Table XIV shows the percentage improvement in communication volume across all processors. We note that in both 8- and 16-processor cases we have on average 37% improvement over the base version. Considering block-cyclic distributions alone, we have a 40% improvement. As mentioned earlier these counts are collected dynamically at run-time using the performance analysis tools available on the SP-2. Also it should be emphasized that most of the improvements on adi and tomcatv result from a single nest, meaning that an aggressive loop-level optimizer that applies a combination of vectorization, coalescing, and aggregation could also obtain similar improvements.

Table XIV. Percentage Improvements in Communication Volume
                                # of PROCS = 8

program           (*, BLK)   (*, CYC)    (*, CYC (4))   (*, CYC (7))

hydro                10         11            11             19
[hydro.sub._m]       10         77            78             72
[hydro.sub._m]       10         92            83             75
adi                  43         50            50             52
tomcatv              83         68            69             69
swim                  4         32            29             31
addx                 38         39            38             35
eflux                43          1             5              7
tred2                 3          2             4              5
average              27         42            41             40

                                   # of PROCS = 16

program            (*, BLK)  (*, CYC)    (*, CYC (4))   (*, CYC (7))

hydro                16         11            10             19
[hydro.sub._m]       16         79            78             67
[hydro.sub._m]       20         92            77             65
adi                  36         50            50             52
tomcatv              85         68            69             69
swim                 10         32            29             29
addx                 35         35            35             34
eflux                40          1             5              7
tred2                 4          3             3              3
average              29         41            39             38


Finally, Table XV gives the improvement in execution times. We note that the performance improvement for some programs such as hydro, adi, tomcatv, and swim is very good whereas for eflux and tred2 the improvement is only modest. This is due to the fact that the communication for this second group of codes is either small compared to the total execution time or difficult to optimize. Therefore, there is not much opportunity for improvement. Overall we have 26% improvement. Our approach improves performance in all cases, and more importantly we see a 27% improvement in block-cyclic distributions showing through a global analysis that it is possible to optimize communication globally even in the existence of block-cyclic distributions.

Table XV. Percentage Improvements in Execution Time
                                # of PROCS = 8

program           (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

hydro                25         21           24            30
[hydro.sub._m]       27         39           21            25
[hydro.sub._m]       37         43           26            30
adi                  26         34           27            48
tomcatv              24         59           40            44
swim                  7         30           13            18
addx                 36         26           19            41
eflux                 2          2            1             7
tred2                 6         13            9             9
average              21         30           20            28

                                # of PROCS = 16

program            (*, BLK)   (*, CYC)   (*, CYC (4))   (*, CYC (7))

hydro                25         17           23              8
[hydro.sub._m]       21         29           36             30
[hydro.sub._m]       28         36           40             36
adi                  25         23           20             58
tomcatv              32         53           52             53
swim                  5         42           12             16
addx                 49         32           41             41
eflux                 5          2           21              9
tred2                 1          9            9             11
average              21         27           29             29


Having established the benefits of our global optimization approach, we now quantify the additional costs incurred by our approach at compile-time and run-time. The results of our cost analysis are summarized in Tables XVI and XVII. All the compilation times shown in the rest of the article have been obtained on a Model 712/60 HP workstation with a 132MHz PA RISC processor, 64KB first-level cache, 1MB second-level cache, and a 256MB memory.

Table XVI. Compilation Times in Milliseconds for Different Distributions. For a given distribution, the compilation time is divided into three components: GLO is the time it takes for our global data-flow analysis to run. OME is the time the Omega library takes to generate communication loops. And REM is the remaining time in compilation, including parsing and code generation.
                          BLK                     CYC

program           GLO     OME      REM    GLO     OME     REM

hydro             157    1,883    1,414    211   1,945   1,466
[hydro.sub._m]    166    1,900    1,290    188   1,906   1,400
adi               161      955    1,100    161     970   1,134
tomcatv           167    2,308    1,100    200   2,616   1,232
swim              300    2,967    1,800    284   3,817   1,834
addx              200    1,283    1,155    254   1,367   1,184
eflux             183    2,017    1,104    184   2,082   1,106
tred2             180    2,417    1,334    183   2,584   1,334
average           189    1,966    1,245    208   2,161   1,274

                        CYC (4)                  CYC (7)

program           GLO     OME      REM    GLO     OME     REM

hydro              213   2,267    1,176    213   2,280   1,466
[hydro.sub._m]     199   2,444    1,176    180   2,200   1,366
adi                174     984    1,100    176     976   1,132
tomcatv            217   3,008    1,200    217   2,867   1,186
swim               266   3,767    1,834    384   3,783   1,834
addx               198   1,417    1,184    242   1,555   1,180
eflux              187   2,300    1,134    187   2,117   1,130
tred2              184   2,466    1,366    187   2,484   1,360
average            205   2,332    1,271    223   2,283   1,269


Table XVII. Cost Analysis of Our Approach Over All Distribution Types. On average, half the compilation time is spent in generating the communication loops. All the values are in percentages of the total compilation time (except the RUN column). The run-time overhead of executing these loops is not very high.
                      Breakdown (%)

program            GLO    OME    REM     GLP    LOP   DIF    RUN

hydro               5     57      38      62     48    14     6
[hydro.sub._m]      5     59      36      64     48    16     6
adi                 7     43      50      50     36    14     4
tomcatv             5     66      29      71     47    24     5
swim                5     63      32      68     58    10     9
addx                8     50      42      58     54     4     8
eflux               5     63      32      68     54    14     7
tred2               3     62      35      65     55    10     9
average             6     58      36      64     50    14     7


Table XVI shows the compilation times in milliseconds for our programs under different distributions. For each distribution the compilation time is divided into three components: GLO is the time it takes for our global data-flow analysis to run; OME is the time the Omega library takes to generate communication loops; and REM is the remaining time in compilation including parsing and code generation. The extra time required to write intermediate code into disk files is excluded from these figures.

The first three columns of Table XVII show the percentages for GLO, OME, and REM in compilation time, considering all the distributions used in the programs. The GLP column gives us the sum of the columns GLO and OME and represents the percentage of the compilation time that our global optimization approach takes (global analysis + generating communication loops). We can see on the average that 64% of the compilation time is spent on our global approach. However, it is also important to observe how much compilation time the base version using the Omega library would take. For the case where we do not use any global optimization but still use an Omega-based loop-level optimization, the percentages of compilation time the Omega library takes to generate communication loops are shown under column LOP. Even if we do not use the global framework, we see that just using the Omega library takes 50% of the compilation time on the average. The DIF column shows the difference (GLP-LOP) between the global optimization approach and the loop-nest-based optimization approach, both using the Omega library. We see that the additional burden of our framework over the existing framework is only 14%.

We can conclude that a hypothetical global optimization approach using RSDs to represent communication sets may be able to eliminate at most 64% of the compilation time. This is a theoretical bound, as we do not know of any RSD-based framework with zero cost that can handle block-cyclic distributions globally. Given the gains in execution time, we believe that the extra overhead that our approach incurs at compile-time is tolerable. In general, over several runs, the extra compilation time will be amortized. Moreover, we can expect the Omega-like tools to be much faster in the future.

The RUN column shows the percentages of execution times spent on executing the communication loops (without communication statements). On the average, only 7% of the execution time is spent on communication loops; therefore, the overhead incurred by our Omega-based approach at run-time is reasonable.

We also compared the compilation time taken by our Omega-based global approach with that of an approach based on processor-tagged descriptors (PTDs) [Su et al. 1994], an enhanced form of RSDs built on top of Parafrase-2. PTDs provide an efficient way of describing distributed sets of iterations and regions of data and are based on a single-set representation parameterized by the processor location for each dimension of a virtual mesh. Table XVIII shows the overall compilation times of the Omega-based approach (OME), the PTD-based approach (PTD), and the percentage increase (INC) when going from PTD to OME for pure block (BLK) and pure cyclic (CYC) distributions, as the PTDs cannot compile for general block-cyclic distributions. The results show that the use of Omega instead of an RSD-like approach increases the compilation time 7% to 27%, averaging on 19% for both block and cyclic distributions.

Table XVIII. Total Compilation Times in Milliseconds of the Omega-Based Approach and the PTD-Based Approach. The OME column and the PTD column give the compilation times obtained using the Omega-based and the PTD-based approaches, respectively. The INC column shows the percentage increase when going from PTD to OME.
                                BLK

program              OME        PTD        INC

hydro               3,454      2,715       27
[hydro.sub._m]      3,356      2,644       25
adi                 2,216      2,044        8
tomcatv             3,575      3,148       14
swim                5,067      4,426       15
addx                2,638      2,241       18
eflux               3,304      2,650       25
tred2               3,931      3,355       17
average             3,443      2,903       19

                                CYC

program              OME        PTD        INC

hydro               3,622      2,927       24
[hydro.sub._m]      3,494      2,801       25
adi                 2,265      2,086        9
tomcatv             4,048      3,290       23
swim                5,935      5,015       18
addx                2,805      2,615        7
eflux               3,372      2,814       20
tred2               4,101      3,390       21
average             3,705      3,117       19


7. RELATED WORK

Several papers have addressed the problem of generating local address and communication sets for HPF programs where arrays are distributed using the general block-cyclic distributions [Ancourt et al. 1997; Chatterjee et al. 1995; Gupta et al. 1996; Kennedy et al. 1995; 1996; Thirumalai and Ramanujam 1996; Venkatachar et al. 1997]. Of these, Ancourt et al. use a linear algebra framework; this renders their approach general. The rest of the approaches are very efficient for a restricted class of mappings. Considering the lack of generality of these approaches, their use in the communication optimizations of the kind discussed in this article appears to be limited.

Most of the previous efforts considered communication optimization at the loop level. Although each approach has its own unique features, the general idea has been the use of an appropriate combination of message vectorization, message coalescing, and message aggregation [Balasundaram et al. 1990; Banerjee et al. 1995; Bozkus et al. 1994; Gerndt 1990; Hiranandani et al. 1992; Zima and Chapman 1991].

More recently some researchers have proposed techniques based on data-flow analysis in order to optimize communication across multiple loop nests. Agrawal and Saltz [1997] present a framework for partial redundancy elimination for communication optimization in data-parallel programs with irregular data access patterns. Amarasinghe and Lam [1993] present several algorithms to optimize communication on machines with distributed address spaces. Their approach uses the last write tree representation to eliminate redundant messages within a single loop nest. Although their technique is also based on data-flow information, they do not allow loop nests within conditionals.

Granston and Veidenbaum [1991] propose an algorithm that applies combined flow and dependence analysis to programs with parallel constructs. Their algorithm detects partial redundancies across loop nests and in the presence of conditionals. However, their approach is not directly applicable to programs with general data distributions.

Gong et al. [1993] describe optimizations that reduce communication overhead and execution time. Their optimizations include elimination of redundant communication and combining messages. However, their approach cannot handle general types of distributions, and they offer no optimizations to eliminate the excessive number of communication calls due to split operations.

Gupta et al. [1995b] present a framework to optimize communication based on data-flow analysis and available section descriptors. Their approach is aggressive in exploiting the locally available data, but fails to support general block-cyclic distributions, and the representation that they use makes it difficult to embed alignment and distribution information. Moreover, the communication set information they compute may not be precise.

Hanxleden and Kennedy [1993; 1994] present a code placement framework for optimizing communication caused by irregular array references. Although the framework provides global data-flow analysis, it treats arrays as indivisible entities; thus, it is limited in exploiting the information available in compile-time. In contrast, Kennedy and Nedeljkovic [1995] offer a global data-flow analysis technique using bit vectors. Although this approach is efficient, it is not as precise as the approach presented in this article. They do not give any clue how their method can be extended to handle general type block-cyclic distributions.

Kennedy and Sethi [1995; 1996; 1997] show the necessity of incorporating resource constraints into a global communication optimization framework. They take into account limited buffer size constraint and illustrate how strip-mining improves the efficacy of the communication placement. Their approach works with multiple nests but not for general block-cyclic distributions. Since they do not give any experimental results, a direct quantitative comparison of this work with ours is not possible. Their work defines a data-flow variable called SAFE which can be used in a similar manner as our predicate P(i). Kennedy and Sethi [1995; 1996; 1997] do not use a linear algebra framework; later work from the dHPF project at Rice [Adve and Mellor-Crummey 1998; Adve et al. 1997] includes the use of the Omega library for message optimizations.

The IBM pHPF compiler [Chakrabarti et al. 1996; Gupta et al. 1995a] achieves both redundancy elimination and message combining globally. But message combining is feasible only if the messages have identical patterns, or if one pattern is a subset of another. The general block-cyclic distributions, however, can lead to complicated data access patterns and communication sets which, we believe, more precisely can be represented within a linear algebra framework.

Yuan et al. [1997a; 1997b] present a communication optimization approach based on array data-flow analysis. The cost of the analysis is managed by partitioning the optimization problem into subproblems and then solving the subproblems one at a time. Since that approach is also based on RSDs, it has difficulty in handling block-cyclic distributions.

Adve et al. [Adve and Mellor-Crummey 1998; Adve et al. 1997] describe an integer-set-based approach for analysis and code generation for data-parallel programs that uses the Omega library [Kelly et al. 1995]. They consider performing message vectorization and message coalescing for general access patterns. Their method can also work with computation decomposition schemes that are not based on the owner-computes rule. These papers do not show how their techniques handle global communication optimization for multiple loop nests in the case of block-cyclic distributions.

Interval analysis used in this article was first introduced by Allen and Cocke [1976]. They used it to solve several data-flow problems; the analysis was then extended by Gross and Steenkiste [1990] to array sections. The approach proposed by Gupta et al. [1995b] mentioned above refines the technique by Gross and Steenkiste using loop-carried dependences.

In this article we used ideas from the linear algebra framework [Ancourt et al. 1997] and data-flow analysis [Aho et al. 1986; Allen and Cocke 1976] developed for performing optimizations on the CFG representation of the programs. We have shown that these two techniques blend together in a nice manner, which makes dealing with the global communication optimization problem feasible even in the presence of general block-cyclic distributions. We should emphasize that the data-flow equations given by Figures 7 and 11 are only two representative solutions to show how the global communication problem can be put into the linear algebra framework. We believe most of the previous approaches can also be put into this framework by redefining the communication and ownership sets in terms of equalities and inequalities. This would not only give those approaches the capability to handle arbitrary alignments and distributions, but also provides high accuracy in manipulating the communication sets.

8. SUMMARY

Management of accesses to nonlocal data to minimize communication costs is critical for scaling performance on distributed-memory message-passing machines. We presented here a global communication optimization scheme based on two complementary techniques: data-flow analysis and a linear algebra framework. The combination of these techniques allows us to optimize communication globally and use polyhedron scanning techniques to enumerate global communication sets effectively for HPF-like alignments and distributions including block-cyclic distributions. Our framework takes into account control flow and achieves message vectorization, message coalescing, message aggregation, and redundant communication elimination, all in a unified framework. The cost of the analysis is managed by keeping the communication sets symbolically until the end of the data-flow analysis where the Omega library is called to generate actual sets in terms of equalities and inequalities. The experimental results demonstrate the effectiveness of our approach in reducing the number of messages and the volume of the data to be communicated. Future work will address the development of performance models to provide the compiler with the ability to estimate the profitability of message aggregation and coalescing globally.

ACKNOWLEDGMENTS

The authors would like to thank Evan Rosser for his help in installing the Omega library and Omega calculator, a high-level interface to the Omega library.

REFERENCES

AGRAWAL, G. AND SALTZ, J. 1997. Inter-procedural data flow based optimizations for distributed memory compilation. Software Practice and Experience, 27, 5, 519-545.

ADVE, V., MELLOR-CRUMMEY, J., AND SETHI, A. 1997. An integer set framework for HPF analysis and code generation. Technical Report TR97-275, Computer Science Dept., Rice University.

ADVE, V. AND MELLOR-CRUMMEY, J. 1998. Advanced code generation for High Performance Fortran. In Languages, Compilation techniques, and Run-time Systems for Scalable Parallel Systems, S. Pande and D. Agrawal (Eds.), Chapter 18, Lecture Notes in Computer Science Series, Springer-Verlag. To appear.

AHO, A. V., SETHI, R., AND ULLMAN, J. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, second edition.

ALLEN, F. E. AND COCKE, J. 1976. A program data flow analysis procedure. Communications of the ACM, 19, 3, 137-147, March.

AMARASINGHE, S. AND LAM, M. 1993. Communication optimization and code generation for distributed memory machines. In Proc. SIGPLAN`93 Conference on Programming Language Design and Implementation, pages 126-138, Albuquerque, NM, June.

ANCOURT, A., COELHO, F., IRIGOIN, F., AND KERYELL, R. 1997. A linear algebra framework for static HPF code distribution. Scientific Programming, 6, 1, 3-28, Spring.

BALASUNDARAM, V., FOX, G., KENNEDY, K, AND KREMER, U. 1990. An interactive environment for data partitioning and distribution. In 5th Distributed Memory Computing Conference, Charleston, SC.

BANERJEE, U. 1994. Loop Parallelization. Kluwer Academic Publishers.

BANERJEE, P., CHANDY, J. A., GUPTA, M., HODGES IV, E. W., HOLM, J. G., LAIN, A., PALERMO, D. J., RAMASWAMY, S., AND SU, E. 1995. The PARADIGM compiler for distributed-memory multicomputers. IEEE Computer, 28, 10, 37-47, October.

BOZKUS, Z., CHOUDHARY, A., Fox, G., HAUPT T., AND RANKA, S. 1994. A compilation approach for Fortran 90D/HPF compilers. Languages and Compilers for Parallel Computing, U. Banerjee et al. (Eds.), Lecture Notes in Computer Science, Volume 768, pages 200-215.

CALLAHAN, D. AND KENNEDY, K. 1998. Analysis of inter-procedural side effects in a parallel programming environment. Journal of Parallel and Distributed Computing, 5, 5, 517-550, October.

CHAKRABARTI, S., GUPTA, M., AND CHOI, J.-D. 1996. Global communication analysis and optimization. In Proc. ACM SIGPLAN`96 Conference on Programming Language Design and Implementation, pages 68-78, Philadelphia, PA, May.

CHATTERJEE, S., GILBERT, J., LONG, F., SCHREIBER, R., AND TENG, S. 1995. Generating local addresses and communication sets for data-parallel programs. Journal of Parallel and Distributed Computing, 26, 1, 72-84, April.

CREUSILLET, B. AND IRIGOIN, F. 1995. Inter-procedural array region analyses. In Proc. 8th International Workshop on Languages and Compilers for Parallel Computers, pages 46-60, Columbus, Ohio.

FOSTER, I. 1994. Designing and building parallel programs. Addison-Wesley Publishing Company, Reading, MA.

GAREY, M. AND JOHNSON, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.

GONG, C., GUPTA, R., AND MELHEM, R. 1993. Compilation techniques for optimizing communication on distributed-memory systems. In Proc. International Conference on Parallel Processing, Volume II, pages 39-46, St. Charles, IL.

GRANSTON, E. AND VEIDENBAUM, A. 1991. Detecting redundant accesses to array data. In Proc. Supercomputing`91, pages 854-865, Albuquerque, NM.

GROSS, T. AND STEENKISTE, P. 1990. Structured data-flow analysis for arrays and its use in an optimizing compiler. In Software-Practice and Experience, vol 20, no 2, pages 133-155, February.

GUPTA, M. AND BANERJEE, P. 1992. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3, 2, 179-193, March.

GUPTA, M., MIDKIFF, S., SCHONBERG, E., SESHADRI, V., SHIELDS, D., WANG, K., CHING, W., AND NGO, T. 1995a. An HPF compiler for the IBM SP-2. In Proc. Supercomputing 95, San Diego, CA.

GUPTA, M., SCHONBERG, E., AND SRINIVASAN, H. 1995b. A unified data-flow framework for optimizing communication. In Languages and Compilers for Parallel Computing, K. Pingali et al. (Eds.), Lecture Notes in Computer Science, Volume 892, pages 266-282.

GUPTA, S. K. S., KAUSHIK, S. D., HUANG, C.-H., AND SADAYAPPAN, P. 1996. Compiling array expressions for efficient execution on distributed-memory machines. Journal of Distributed and Parallel Computing, 32, 2, 155-172, February.

HALL, M. W., HIRANANDANI, S., KENNEDY, K., AND TSENG, C.-W. 1992. Inter-procedural compilation of Fortran D for MIMD distributed-memory machines. In Proc. Supercomputing`92.

HALL, M. W., MURPHY, B., AMARASINGHE, S., LIAO, S., AND LAM, M. 1995. Inter-procedural analysis for parallelization. In Proc. 8th International Workshop on Languages and Compilers for Parallel Computers, pages 61-80.

HENNESSY, J. L. AND PATTERSON, D.A. 1990. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA.

HIRANANDANI, S., KENNEDY, K., AND TSENG, C. 1992. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35, 8, 66-80, August.

KELLY, W., MASLOV, V., PUGH, W., ROSSER, E., SHPEISMAN, T., AND WONNACOTT, DAVID. 1995. The Omega Library interface guide. Technical Report CS-TR-3445, CS Dept., University of Maryland, College Park.

KENNEDY, K. AND NEDELJKOVIC, N. 1995. Combining dependence and data-flow analyses to optimize communication. In Proc. 9th International Parallel Processing Symposium, pages 340 -346.

KENNEDY, K. AND SETHI, A. 1995. A constrained-based communication placement framework, Technical Report CRPC-TR95515-S, CRPC, Rice University.

KENNEDY, K. AND SETHI, A. 1996. A communication placement framework with unified dependence and data-flow analysis. Proc. 3rd International Conference on High Performance Computing.

KENNEDY, K. AND SETHI, A. 1997. Resource-based communication placement analysis. In Languages and Compilers for Parallel Computing, D. Sehr et al. (Eds.), Lecture Notes in Computer Science, Volume 1239, pages 369-388, Springer-Verlag.

KENNEDY, K., NEDELJKOVIC, N., AND SETHI, A. 1995. A linear-time algorithm for computing the memory access sequence in data parallel programs. In Proc. the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, CA, pages 102-111.

KENNEDY, K., NEDELJKOVIC, N., AND SETHI, A. 1996. Communication generation for cyclic(k) distributions. In Languages, Compilers, and Run-Time Systems for Scalable Computers, B. Szymanski and B. Sinharoy (Eds.), Chapter 14, Kluwer Academic Publishers.

KNOOP, J., RUTHING, O., AND STEFFEN, B. 1994. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems, 16, 4, 1117-1155, July.

KOELBEL, C., LOVEMEN, D., SCHREIBER, R., STEELE, G., AND ZOSEL, M. 1994. High Performance Fortran Handbook. The MIT Press.

PALERMO, D. J., Su, E., CHANDY, J. A., AND BANERJEE, P. 1994. Communication optimizations used in the PARADIGM compiler for distributed-memory multicomputers. In Proc. International Conference on Parallel Processing.

POLYCHRONOPOULOS, C., GIRKAR, M. B., HAGHIGHAT, M. R., LEE, C. L., LEUNG, B. P., AND SCHOUTEN, D.A. 1989. Parafrase-2: An environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors. In Proc. the International Conference on Parallel Processing, St. Charles IL, August 1989, pages II 39-48.

PUGH, W. 1992. A practical algorithm for exact array dependence analysis. Communications of the ACM, 35, 8, 102-114, August.

Su, E., LAIN, A., RAMASWAMY, S., PALERMO, D. J., HODGES IV, E. W., AND BANERJEE, P. 1995. Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers. In Proc. 9th ACM International Conference on Supercomputing, pages 424-433.

Su, E., PALERMO, D. J., AND BANERJEE. P. 1994. Processor tagged descriptors: a data structure for compiling for distributed-memory multicomputers. In Proc. Conf. on Parallel Architectures and Compilation Techniques.

TARJAN, R. E. 1974. Testing flow graph reducibility. Journal of Computer and System Sciences, 9:355-365.

THIRUMALAI, A. AND RAMANUJAM, J. 1996. Efficient computation of address sequences in data-parallel programs using closed forms for basis vectors. Journal of Parallel and Distributed Computing, 38, 2, 188-203, November.

VAN HANXLEDEN, R. AND KENNEDY, K. 1993. A code placement framework and its application to communication generation. Technical Report CRPC-TR93337-S, CRPC, Rice University.

VAN HANXLEDEN, R. AND KENNEDY, K. 1994. Give-n-take--a balanced code placement framework. In Proc. ACM SIGPLAN`94 Conference on Programming Language Design and Implementation.

VENKATACHAR, A., RAMANUJAM, J., AND THIRUMALAI, A. 1997. Communication generation for block-cyclic distributions. Parallel Processing Letters, 7, 2, 195-202, June.

WOLFE, M. 1996. High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company.

YUAN, X., GUPTA, R., AND MELHEM, R. 1997a. An array data flow analysis based communication optimizer. In Proc. 10th Annual Workshop on Languages and Compilers for Parallel Computing.

YUAN, X., GUPTA, R., AND MELHEM, R. 1997b. Demand-driven data flow analysis for communication optimization. Parallel Processing Letters, 7, 4, 359-370, December.

GERNDT, M. 1990. Updating distributed variables in local computations. Concurrency--Practice and Experience, 2, 3, pages 171-193, September.

ZIMA, H. AND CHAPMAN, B. 1991. Supercompilers for Parallel and Vector Computers, ACM Press.

Received March 1998; revised July 1998; accepted November 1998

The material presented in this article is based on research supported in part by A. Choudhary's NSF Young Investigator Award CCR-9357840, the NSF grant CCR-9509143, DOE AV-6193, and the Air Force Materiels Command under contract F30602-97-C-0026. The work of P. Banerjee is supported in part by the NSF under grant CCR-9526325 and in part by the DARPA under contract DABT-63-97-C-0035. The work of J. Ramanujam is supported in part by the NSF Young Investigator Award CCR-9457768 and the NSF grant CCR-9210422. Authors' addresses: M. Kandemir, Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY 13244; email: mtk@ece.nwu.edu; P. Banerjee, A. Choudhary, and N. Shenoy, Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208; email: {banerjee; choudhar; nagaraj}@ece.nwu.edu; J. Ramanujam, Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803; email: jxr@ee.lsu.edu.

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

M. KANDEMIR Syracuse University

P. BANERJEE and A. CHOUDHARY Northwestern University

J. RAMANUJAM Louisiana State University and

N. SHENOY Northwestern University
COPYRIGHT 1999 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:KANDEMIR, M.; BANERJEE, P.; CHOUDHARY, A.; RAMANUJAM, J.; SHENOY, N.
Publication:ACM Transactions on Programming Languages & Systems
Date:Nov 1, 1999
Words:18671
Previous Article:A Type System for Object Initialization in the Java Bytecode Language.
Next Article:Local Type Inference.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters