Printer Friendly

Fusion-Based Register Allocation.

1. INTRODUCTION

The task of register allocation is to map virtual registers (e.g., variables, temporaries, constants) to a limited number of physical registers. A common approach is to model the register allocation problem as a graph-coloring problem. In the interference graph, nodes represent live ranges of virtual registers, and two live ranges are connected by an edge if they are simultaneously live. The interference graph is colored (using N colors, N the number of registers) so that two conflicting live ranges are assigned different colors (physical registers).

Finding the minimal number of colors for an arbitrary graph is NP-complete [Garey and Johnson 1979]; therefore, most register allocation approaches assign colors to live ranges in some heuristic order [Pi]. Assume that a color has been assigned to live ranges [l.sub.[Pi](1)], [l.sub.[Pi](2)], ..., [l.sub.[Pi](k-1)], but no color can be found for [l.sub.[Pi](k)]. Then we say that coloring blocks and the compiler must then somehow modify the interference graph to allow coloring to proceed. Two major techniques have been developed:

Spilling. A live range lr(x) for virtual register x is assigned a location in memory, and all references to lr(x) are done by memory accesses (loads/stores), which are referred to as spill code. A spilled live range is removed from the interference graph, since it is no longer a register assignment candidate; this removal lowers the number of edges in the interference graph and may therefore reduce the degree of this graph (the degree is the maximum number of edges for a node).

Splitting. Live-range splitting segments a long live range lr(x) into smaller live ranges [lr.sub.Ri](x) for some region [R.sub.i]. Shuffle code is then needed to move the data value in virtual register x when control passes from region [R.sub.i] to another region [R.sub.j]. Splitting may reduce the degree of the interference graph, because a smaller live range may interfere with fewer live ranges and render the revised graph colorable.

When the details of the definition of regions [R.sub.1] and [R.sub.2] do not matter, we use the notation [lr.sub.1](x) and [lr.sub.2](x) to refer to two distinct segments of lr(x). If the identity of the virtual register x is not relevant, we use the short form lr to refer to a live range.

Simplification [Chaitin et al. 1981] is a common technique to determine the coloring order, i.e., the order in which live ranges are assigned colors. (We refer to simplification as used by Chaitin [Chaitin et al. 1981] as Chaitin-style simplification.) Simplification is based on the observation that if a node lr has degree less than N (i.e., it is unconstrained), then lr can be trivially colored no matter what colors are assigned to the residual graph. Simplification proceeds by successively removing unconstrained nodes from the graph. Each time a node lr is removed from the graph, the edges that are incident upon lr are also removed, and the degrees of lr's neighbors are decremented. Once all nodes have been removed from the graph, colors are assigned to nodes in the reverse order in which they were removed. Simplification blocks when all remaining nodes have degrees greater than or equal to N, and at this time a live range is picked to be spilled based on a heuristic cost function. Once the spilled live range is removed from the graph, simplification then may proceed where it blocked because the degrees of the neighbors of the spilled live range are lowered.

1.1 Cost of Register Allocation

To choose between different register allocation decisions, the register allocator needs a cost model. Our basic machine model is a RISC processor that requires the operands of all operations to reside in registers. The register allocation cost includes all overhead operations that move operands in and out of a register; this cost includes also overhead operations that move values from one register to another. That is, we compare the results of a register allocator against a perfect allocation with unbounded registers. The higher the cost (for a fixed number of registers), the worse the register allocator has performed.

The register allocation cost is the sum of three components: (1) spill cost (to move values to and from memory), (2) shuffle cost (moving values from one location to another), (3) call cost (to free up/restore registers upon procedure entry/exit).

1.1.1 Spill Cost. When a live range lr is assigned a location in memory, all references to lr are done by memory accesses (commonly referred to as spill code). These memory accesses make up the spill cost.

1.1.2 Shuffle Cost. Shuffle code can be of three types: (1) shuffle-move: A register-to-register move instruction shuffles the value between the two live-range segments that are assigned to different registers; (2) shuffle-load: A memory-to-register load instruction moves the value from one spilled live-range segment to a register-allocated live-range segment; (3) shuffle-store: A store instruction moves the value to a spilled live-range segment.

There is no need for memory-to-memory shuffle code because the two live ranges of the same virtual register are always spilled to the same memory location.

1.1.3 Call Cost. Many compilers divide the registers into two sets, callee-save and caller-save registers, respectively. The distinction between these registers provides the register allocator with more choices when minimizing the call overhead [Chow and Hennessy 1990]. There is a distinct cost associated with each kind of register assigned to a live range. The save/restore operations for callee-save and caller-save registers at procedure boundaries determine the call cost. The register allocation cost must include the call cost; otherwise the evaluation of the register allocator may be overly optimistic [Lueh and Gross 1997].

1.2 Example of Early Binding

The spilling and splitting decisions are interrelated: if lr is split, each segment can be spilled individually. And if a live range lr' is spilled, it may no longer be necessary to split other live ranges. Many compilers address this problem by imposing an ordering and by making a decision before all parts of a compilation unit have been analyzed. To illustrate the limitations of early binding, consider a Tera-like approach [Callahan and Koblenz 1991] to register allocation. Figure 1 illustrates unnecessary constraints imposed on register allocation by premature coloring decisions. This approach constructs a tile tree for a program. The code in Figure 1(a) consists of two tiles, one for the loop (the shaded region) and one for blocks [B.sub.1] and [B.sub.6]. Recall that this compiler performs register allocation for the tiles in two phases. The first phase colors each tile in a bottom-up fashion (from the bottom of the tile tree up to the root tile). Pseudoregisters are assigned to live ranges in the first phase. Once two variables in a tile are assigned to the same pseudoregister, the parents of the tile must adhere to this decision.

[Figure 1 ILLUSTRATION OMITTED]

The loop tile is colored first. The interference graph of the loop tile is depicted in Figure 1(b). Inside the loop tile, the live range lr(y) is live in [B.sub.2] and [B.sub.5] (and therefore live across the back edge <[B.sub.5], [B.sub.2]>). There are two live ranges for x, lr{[B.sub.2],[B.sub.5]}(x) and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (x), in the loop. Coloring the graph with two registers, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (x) is placed into the same pseudoregister as either lr{[B.sub.2],[B.sub.5]} (x) or lr(y). Based on the information inside the loop region, both choices are reasonable. However, if [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (x) and lr(y) use the same pseudoregister, a shuffle move is inevitable on either <[B.sub.4], [B.sub.6]> or <[B.sub.5], [B.sub.6]>, since lr{[B.sub.2],[B.sub.5]}(X) and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (x) merge in [B.sub.6]. <[B.sub.m], [B.sub.n]> represents a control-flow edge that goes from block [B.sub.m] to [B.sub.n]. If both segments of lr(x) use the same pseudoregister, no shuffle code is required, but that this mapping is beneficial can only be determined when dealing with [B.sub.6].

1.3 This Article

Fusion-based register allocation attempts to consider spilling and splitting together and uses the structure of the program to guide the register allocation decisions. The key idea is to delay the decision details (that lr(x) or lr(y) are spilled, or that lr(x) is split along an edge) while making and propagating summary decisions (e.g., that k live ranges are spilled for a region). Fusion-based register allocation has been implemented in the cmcc compiler [Adl-Tabatabai et al. 1996; Lueh et al. 1997]. To allow for a fair comparison of various approaches to register allocation, we developed a register allocation framework (described in Section 2), and this framework allows us to compare fusion-based register allocation with other approaches.

This article focuses on intraprocedure register allocation. Issues related to call cost are not discussed further in this article, since the call cost is influenced by the compiler's calling convention. However, all performance data are given for a compiler that incorporates call-cost optimizations [Lueh and Gross 1997]. Interprocedural register allocation is beyond the scope of this article.

2. THE REGISTER-ALLOCATION FRAMEWORK

A register allocator must make a number of decisions and must address these issues:

--coloring order. How does the allocator determine the order in which live ranges are assigned colors?

--spilling versus splitting: Since both spilling and splitting live ranges reduce register pressure, how does the register allocator make a trade-off between these techniques, and how does it select the live ranges to operate on?

-- splitting points: In the case of splitting live ranges, how does the allocator determine splitting points for the live ranges?

--reconciliation: How are register allocation decisions for two regions reconciled? This is a fundamental challenge for all approaches that consider the structure of a program. If the register allocator performs register allocation for a region and then does not reconcile this region with other regions, excessive shuffle code may arise at region boundaries.

--call cost: What is the heuristic of assigning a callee-save or caller-save register to a live range?

To allow us to compare different approaches, we present a framework that covers a number of current register allocation techniques. A framework partitions a solution to a design problem into abstract classes and defines their responsibilities and collaboration [Gamma et al. 1995]. A framework facilitates a fair comparison, which is essential to understand the benefits and limitations of a technique, e.g., if an improvement is added to some parts of register allocation then all register allocation approaches can benefit from the improvement. Using a framework as the basis of an evaluation avoids a bias in favor of a one particular approach.

This framework identifies seven phases in the register allocator: graph construction, live-range coalescing, color ordering, color assignment, graph reconstruction, spill-code insertion, and shuffle-code insertion (as illustrated by Figure 2). A register allocation approach customizes the register allocation framework by providing specific implementations for each of these phases. We briefly summarize those aspects of the framework that allow us to introduce and later compare fusion-based register allocation.

[Figure 2 ILLUSTRATION OMITTED]

2.1 Color Ordering

The framework introduces two data structures to model the interface between the color-ordering and color-assignment phases: (1) the color stack (C), and (2) the spill pool (S). The color-ordering phase heuristically determines the order in which live ranges are to be assigned colors and pushes live ranges onto C based on the ordering--the higher the position within C, the higher the priority. S keeps all spilled live ranges. While deciding the color ordering, this phase may make spilling decisions due to lack of colors and may add the spilled live range to S. The spill-code insertion phase later allocates space on the local heap for spilled live ranges in S, and inserts actual spill code.

2.2 Color Assignment

The color-assignment phase pops live ranges from C and finds legal colors (non-conflicting colors) for them, i.e., this phase assigns colors based on the ordering computed by the previous phase and uses the interference graph to make sure that conflicting live ranges are assigned different colors. Biased coloring [Briggs 1992] is used to find legal colors so as to eliminate copy instructions. That is, if the coalescing phase identifies for lr a partner live range lr', which prefers the same register as lr, then in the absence of conflicts, lr' is assigned the same color as lr to eliminate the copy instruction that moves the value between the partner and lr. If this phase is unsuccessful in finding a legitimate color for a live range because all colors are already taken by its neighbors, the live range is then spilled and added to S.

2.3 Graph Reconstruction

For simplicity, the color-ordering and color-assignment phases make spilling decisions under the assumption that a spilled live range does not need any register resource. However, for load/store machines, memory accesses must be done by load/store instructions, which require to use a register as one operand (destination/source) [Hennessy and Patterson 1990]. In other words, wherever the value of a spilled live range is used as an operand, a register is needed, and spill code needs to acquire registers. Reserving registers for spill code is one way of making sure that there are registers for spill code. A register allocator that reserves registers for spill code, however, works with fewer colors than one without reservations. Our register allocator reserves no registers for spill code, i.e., the physical register space is not divided into "local" and "global" registers. To ensure that there are registers for spill code, the register allocator rebuilds the interference graph (constructs live ranges for spill code) after spilling live ranges and restarts from the coalescing phase.

There are two ways to reconstruct the interference graph: (1) spill code (a load/store) is inserted into the schedule immediately before/after a use/definition, the interference graph is thrown away, and the whole coloring process restarts from the first phase; (2) the existing interference graph is modified by producing a new small live range for each occurrence of a spilled live range (no spill code is inserted; the newly produced live ranges span only one instruction), and the coloring process can then skip the graph construction phase and proceed from the coalescing phase with the modified graph. Those newly produced live ranges are called spill-code live ranges. The first approach reuses the code for constructing the interference graph. The second approach favors compilation speed, because the interference graph is not rebuilt from scratch.

Because spill-code live ranges span only one instruction (and, in our machine model, registers are freed at instruction boundaries), spilling such a live range does not lower register pressure but increases spill code instead. Therefore, they are marked nonspillable so that the register allocator does not spill them in subsequent iterations of register allocation.

2.4 Spill-Code Insertion

The spill-code insertion phase allocates space for spilled live ranges on S. We classify virtual registers into global and local virtual registers. Global virtual registers are those who appear in multiple blocks. Local virtual registers are those that appear within one single basic block. The spilled live ranges that correspond to the same virtual register are always spilled to the same location. No space is allocated to a virtual register if no live range of the virtual register is spilled. Global virtual registers have their own locations; spaces for local virtual registers are compacted using the same graph-coloring technique.

After spaces for spilled live ranges are allocated, the spill-code insertion phase traverses each instruction and inserts spill code for each reference of a spilled live range. Each such reference has its own spill-code live range, which acquires a register for loading from or storing to memory.

2.5 Shuffle-Code Insertion

Register allocation may include one or more live-range splitting approaches to split live ranges into smaller segments to reduce register pressure. In this register allocation framework, splitting can occur only on edges in the graph that represents the program; no live range is split within the region that is represented by a single node in this graph. If regions are basic blocks, then live ranges are split only along control-flow edges. The shuffle-code insertion phase goes through these edges and inserts shuffle code for those live ranges that should span across the edges but reside in different storage locations instead (details are discussed in Section 3.3).

3. FUSION-BASED REGISTER ALLOCATION

Fusion-based register allocation identifies regions, builds the interference graph for each region, and then merges all regions' interference graphs along edges that connect two regions. In this section, we concentrate on two essential phases of fusion-based register allocation--graph construction and color assignment--that make spilling, splitting, and register-binding decisions. Graph fusion takes place during the graph construction phase, which is further divided into three steps (shown in Figure 3): region formation, graph simplification, and graph fusion.

[Figure 3 ILLUSTRATION OMITTED]

Here we define some notations that are used in this article.

--<[B.sub.m], [B.sub.n]> indicates a control-flow edge that goes from block [B.sub.m] to [B.sub.n].

--[G.sub.R] represents the interference graph of region R.

--[lr.sub.i](x) denotes a segment of a virtual register x's live range lr(x); [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] denotes the segment for region [R.sub.j]; and there is at least one segment for every region R where x is live and reaching.

--[lr(x) , lr(y)] is the interference edge that connects live ranges lr(x) and lr(y).

--[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the combined live range of [lr.sub.i] and [lr.sub.j] (short form [lr.sub.ij]).

--neighbor(lr) is the set of all live ranges that lr interferes with.

--[Psi](lr) is the set of all interference edges of lr.

--[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] refers to the resulting graph of fusing (or merging) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

--[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] represents the resulting graph of fusing [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] along some control-flow edge E.

--ReachIn(B) for region B is the set of live ranges lr(x) so that x is not defined prior to use in B and so that a definition of x reaches the entry of B.

--LiveOut(B) for region B is the set of live ranges that are live at the exit of B.

--[Psi](C) is the number of live ranges that must (eventually) be spilled for the clique summary node C in the interference graph (Section 3.2.3).

3.1 Overview

We briefly describe the phases that are unique to fusion-based register allocation (shown in gray in Figure 3).

3.1.1 Region Formation. In this phase, regions are formed using any number of possible techniques. A region consists of multiple blocks (or one single block) and the edges that connect the blocks within the region. For example, a region can be a single basic block, a trace [Lowney et al. 1993], a superblock [Hwu et al. 1993], a region as defined by Hank et al. [1995], the blocks at a particular static loop-nesting level [Callahan and Koblenz 1991], or the blocks within a Program Dependence Graph (PDG) region node [Norris and Pollock 1994]. Control-flow edges that lie outside of regions are then ordered according to some priority functions consistent with the region formation approach, e.g., edges entering innermost loop regions are placed before those entering outermost loop regions. One particularly attractive priority function is the use of execution probabilities. These can be derived either from profile information [Fisher and Freudenberger 1992; Wall 1991], from static estimates such as loop-nesting depth [Chow and Hennessy 1990], or from static branch estimates [Lowney et al. 1993]. The choice of edge ordering is orthogonal to register allocation but of course influences the quality of the code, as our register allocation framework is edge order sensitive: shuffle code is less likely to end up on edges that are ordered first, and spilling decisions are delayed until later edges are processed. During region formation, an interference graph [G.sub.R] is built for each region R.

There are two issues related to choosing the size of the base region: spill and shuffle costs. The smaller the size of the base region, the more likely the register allocator generates shuffle code and reduces the spill cost because live ranges are allowed to be split at finer granularity (at the cost of shuffle code) to reduce the spill cost. On the other hand, the bigger the size of the base region, the more likely the register allocator generates spill code to reduce the shuffle cost. It is impossible to state that one strategy is better than the other because the actual outcome depends on characteristics of the programs.

3.1.2 Graph Simplification. The objective of the graph simplification phase is to determine how many live ranges must be spilled within each region. If an interference graph [G.sub.R] can be simplified, then no spill code is necessary within region R. But if [G.sub.R] cannot be simplified, then from R's perspective the cheapest live ranges to spill within R are those that are transparent, i.e., a live range lr(y) that spans R with no definition or use of y in R. From a global perspective, the choice of which live ranges are the best ones to spill cannot be determined at this point in the algorithm. Thus the decision on which live ranges to spill is delayed until more global knowledge about the reference patterns is available; this phase determines only how many transparent live ranges need to be spilled within each region. The next phase, graph fusion, determines which live ranges are the best ones to spill. This technique is referred to as delayed spilling.

When coloring the interference graph for a base region, delayed spilling may not remove enough live ranges to make the graph simplifiable. In this case, additional live ranges must be spilled, chosen by some heuristic. The choice (size) of the base region has a direct influence on this step; this aspect is discussed in Section 4.4.

3.1.3 Graph Fusion. The graph fusion phase takes the sequence of control-flow edges determined by the region formation phase and fuses interference graphs along each edge. Graph fusion is based on a powerful fusion operator that maintains the invariant that the resulting interference graph is simplifiable. Live-range splitting decisions are made by the fusion operator: if fusing two graphs [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] along an edge E results in an interference graph that cannot be simplified, then one or more live ranges that span E are split. Only the live ranges that span E need to be considered, because in the worst case splitting all such live ranges partitions the graph [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] back to the two original graphs, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], both of which are simplifiable. At the end of the graph fusion phase, we are left with one simplifiable interference graph; we know how many live ranges to spill for each region and where to place the shuffle code, but no physical registers are committed to any live range. The simplifiability invariant allows us to avoid making any coloring decisions prematurely during the graph fusion phase. Assume that we choose loops as regions; there is no coloring assignment decision for any live range while graphs of the loop region are fused. All we know is that the interference graph of the loop region is colorable. The actual coloring decision is deferred till the color-assignment phase.

As more graphs are fused, the delayed spilling mechanism gradually spills live ranges. The net effect of combining splitting with delayed spilling is that the register allocator may spill only those segments of a virtual register's live range that do not contain references but span regions of high register pressure. This effect can be observed for numerous programs and is the reason for the performance improvements discussed in Section 4. Note the relationship between the choice of base region and the number of fusion steps: if the base region is small (e.g., a basic block), the fusion operation is applied for many edges, and there exist many places to split a live range. Also, the initial interference graphs for each base region may be smaller than if the base region is large (e.g., loop nests). However, working with large base regions may require less time (fewer edges for the fusion operator).

3.1.4 Color Assignment. In this phase, physical registers are assigned to live ranges. The simplifiability invariant guarantees that once all interference graphs have been fused, the resulting interference graph is colorable and spilling or splitting decisions have already been made. Maintaining the simplifiability invariant is necessary; otherwise we are back to the original problem of making splitting and spilling decisions based on a whole interference graph. There exist further opportunities for improving the code in this phase: biased coloring [Briggs 1992] may eliminate shuffle code; optimistic coloring [Briggs et al. 1989] may assign registers to live ranges that have been spilled by simplification; or call-cost optimization may spill live ranges to reduce call cost [Lueh and Gross 1997].

3.1.5 Properties of Graph Fusion. Initially, each interference graph [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for a region [R.sub.j] of a function f is sparser (i.e., less constrained) than the function-wide interference graph [G.sub.f] because each [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is only a portion of [G.sub.f]. As graph fusion proceeds further, more interference graphs are fused, resulting in denser interference graphs (i.e., more constrained) than the ones that are fused earlier. Thus, the earlier an edge is considered by the fusion process, the less likely it is that live ranges are split along that edge. Fusing graphs based on an edge ordering provides a nice property: if we do not want shuffle code on a particular edge, we can fuse the interference graphs along that edge first. Consequently, the decision of where to split is prioritized according to the edge ordering, and shuffle code ends up on less frequently executed edges. Register allocation now becomes a problem of edge ordering and forming regions. Compiler writers can determine the size of regions and the edge ordering to tune the quality of register allocation. Moreover, maintaining the simplifiability invariant allows register allocation to avoid committing any register-binding decision during graph fusion. As a result, no early binding is made.

3.1.6 Example. We illustrate the above steps with an example. Consider assigning registers to the program fragment shown in Figure 4(a). Assume we have only two physical registers (N = 2) and that regions are basic blocks. There are three virtual registers (x, y, and z) in this program with initial live ranges indicated by the vertical bars. The interference graph of each basic block prior to fusion is depicted in Figure 4(b). Suppose edges 1 and 3 form the most frequently executed path, and edges are fused in the order 1, 3, 2, 4. After we fuse graphs along edges 1, 3, and 2, we obtain the interference graph of Figure 4(c); interference graph nodes list live ranges that have been fused. If we now fuse the interference graphs along edge 4, [lr.sub.3](y) and [lr.sub.4](y) are combined, since the only live range spanning edge 4 is lr(y). However, graph fusion along edge 4 makes the graph unsimplifiable (clique of size 3), so the algorithm undoes combining [lr.sub.3](y) and [lr.sub.4](y). lr(y) is effectively split at the less frequently executed point in this control-flow graph. The result is that all values can be kept in registers with an additional register move instruction at the end of block [B.sub.3].

[Figure 4 ILLUSTRATION OMITTED]

3.2 Fusion Operation

This section describes the essential core of the fusion-based approach, the fusion operation. It fuses two interference graphs into one and maintains the simplifiability invariant. Figure 5 depicts the important steps of the fusion operation: finding live ranges that span the edge E, unioning nonsplitable live ranges, merging cliques, taking a snapshot of the interference graph, unioning splitable live ranges, and maintaining simplifiability.

[Figure 5 ILLUSTRATION OMITTED]

There are two places (unioning nonsplitable and splitable live ranges as shown in Figure 5) where the fusion operation combines live ranges that span edge E. A live-range segment [lr.sub.i] can be in one of these three states: it has been spilled; it is represented in [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and transparent; or it is represented in [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and nontransparent (i.e., there is a reference). If both segments, "from" and "to," are spilled, no shuffle code is needed. If [lr.sub.1](x) is spilled and [lr.sub.2](x) is transparent, we always spill the combined live range [lr.sub.12](x) in [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] to eliminate the shuffle code that is needed otherwise. If [lr.sub.1] (x) is spilled and [lr.sub.2](x) is nontransparent, we have a split point. Otherwise the segments are coalesced, although if the new interference graph is not simplifiable, coalescing may be suppressed (i.e., we have a split). If both [lr.sub.1](x) and [lr.sub.2](x) are transparent, then [lr.sub.12](x) in [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is nonsplitable because spilling those live ranges does not introduce spill cost. (In this case, [lr.sub.12] is also transparent, i.e., there is no reference to x in [R.sub.1] [union] [R.sub.2], so there is no point in treating [lr.sub.1](x) and [lr.sub.2](x) differently.) Figure 6 enumerates all possible cases of combining two live ranges. Empty entries in this table are symmetric cases.

Fig. 6. Different possibilities when coalescing [lr.sub.1](x) and [lr.sub.2](x).
                                [lr.sub.2](x)

[lr.sub.1](x)    Spilled        Transparent

Spilled          Spilled        Spill (see Section 3.2,3)
                 Nonsplitable   Nonsplitable

Transparent                     Coalesced, Transparent
                                Nonsplitable

Nontransparent

                 [lr.sub.2](x)

[lr.sub.1](x)    Nontransparent

Spilled          Split (see Section 3.2.3)
                 Splitable

Transparent      Coalesced, Nontransparent
                 Splitable

Nontransparent   Coalesced, Nontransparent
                 Splitable


3.2.1 Finding Spanning-E Live Ranges. Consider two regions [B.sub.1] and [B.sub.2], and an edge E = <[B.sub.1], [B.sub.2]> connects these regions in the control-flow graph. The sets ReachIn([B.sub.2]) and LiveOut([B.sub.1]) indicate which live ranges span edge E, i.e., if [lr.sub.1](x) [element of] LiveOut([B.sub.1]) and [lr.sub.2](x) [element of] ReachIn([B.sub.2]), one definition of [lr.sub.2](x) comes from [lr.sub.1] (x) (there may be other definitions reaching [lr.sub.2](x) from other paths to [B.sub.2]). To facilitate applying the fusion operation along edge E, it is convenient to define an internal structure span_e that is composed of two live-range pointers, from and to; "from" points to [lr.sub.1](x) and "to" points to [lr.sub.2](x).

The first step of the fusion operation attempts to coalesce the live-range segments [lr.sub.1](x) and [lr.sub.2](x) in the interference graph [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] so that no shuffle code is needed for x along E.

The example of Figure 7 shows the relationship among the structures of span_e, ReachIn, and LiveOut. LiveOut([B.sub.1]) has three live ranges--[lr.sub.1](x), [lr.sub.1](y), and [lr.sub.1](z)--because the virtual registers x, y, and z are defined in [B.sub.1] and because their values are live at the exit of [B.sub.1]. ReachIn([B.sub.2]) contains [lr.sub.2](x) and [lr.sub.2](y) because x and y are referenced in [B.sub.1].

[Figure 7 ILLUSTRATION OMITTED]

3.2.2 Unioning Nonsplitable Live Ranges. After identifying span_e, we coalesce those live ranges whose combined live ranges are not splitable. Live ranges that are coalesced in this step are removed from further consideration. Combining two live ranges "from" and "to" consists of three major steps, unioning two live ranges, merging interference edges, and propagating attributes. Figure 8 shows the steps that are performed for each lr in span_e; a small diamond indicates a condition checker. For instance, the checker on the left-hand side checks if the combined live range of the current span_e is splitable or not.

[Figure 8 ILLUSTRATION OMITTED]

Unioning Two Live Ranges. How we union two live ranges depends entirely on the data structures of the live range and the instruction. Prior to register allocation, machine instructions access virtual registers. After register allocation, the code emitter wants to obtain the physical register for each operand of a machine instruction. So when two live ranges are unioned into one, the operands that refer to the two live ranges must be able to record the change. That is, the operands must access the combined live range as opposed to the two old live ranges. The algorithm disjoint-set forests [Cormen et al. 1990] with union-by-rank and path-compression heuristics provides an efficient practical solution; see Lueh [1997] for details.

Merging Interference Edges. The two live ranges have their own interference edges, [Psi]([lr.sub.1]) and [Psi](lr.sub.2]), prior to unioning. In this step, [Psi]([lr.sub.1]) and [Psi]([lr.sub.2]) are merged. If [lr.sub.12] is [lr.sub.1], then each interference edge [[lr.sub.2], X] of [lr.sub.2] is updated by substituting [lr.sub.2] with [lr.sub.12], where X [element of] neighbor([lr.sub.2]). Likewise, each edge of [Psi]([lr.sub.1]) is also updated in a similar fashion if [lr.sub.12] is [lr.sub.2]. Duplicate edges occur when [lr.sub.1] and [lr.sub.2] interfere with the same live range prior to unioning, but the duplicates are not added to the interference graph.

Propagating Attributes. Each live range has several attributes that record information if there is a definition in the region, the estimated cost of using caller-or callee-saved registers, etc. These attributes are propagated during graph fusion [Lueh 1997].

3.2.3 Merging Cliques. When a region R needs M physical registers to be colored, and M [is greater than] N, then some live ranges must be spilled. Considering only local spill costs inside R, the best spill choice is a transparent live range lr(x), since it has a high degree in the interference graph, and the cost of spilling lr(x) is zero inside R. If we assume that there are T transparent live ranges in region R, and K live ranges must be spilled, then there are three cases that must be considered:

K [is less than] T. In this case, the number of transparent live ranges is more than the number of live ranges that must be spilled. However, choosing the transparent live ranges for spilling should be delayed until the compiler obtains more global information about the reference patterns of the transparent live ranges. Searching the region's immediate neighbors does not solve the problem because transparent live ranges may be transparent across many basic blocks. Delayed spilling deals nicely with this case, as explained below. A large number of transparent live ranges is common while processing the high-priority edges.

K = T. In this case, the spill needs are satisfied by spilling all transparent live ranges.

K [is greater than] T. In this case, all transparent live ranges, as well as K - T live ranges with a reference inside R, are spilled. The spilling candidates with a reference inside R are selected by a heuristic based on spill cost, area, and the degrees in the interference graph.

Although spilling a transparent live range is very attractive (because it has no spill cost), there may be better choices from a global perspective. Figure 9 shows a flow graph for which spilling transparent live ranges is not optimal; regions are basic blocks. In Figure 9(a), each of [B.sub.1] and [B.sub.2] has three live ranges simultaneously live. The vertical lines on the right of the blocks represent the live ranges of the virtual registers x, y, and z (a lightly shaded line indicates that the live range is transparent). The number next to each control-flow edge is the execution frequency of the edge. [B.sub.1] and [B.sub.3] have the same execution frequency, 100. Three colors are needed to color the interference graph of [B.sub.1] and [B.sub.2]. Since there are only two registers, we must spill one live range from each block. Figure 9(b) depicts the case of spilling transparent live ranges (spilled live ranges are drawn with a hatched pattern). [lr.sub.1] (z) and [lr.sub.2](z) are transparent live ranges and spilled to reduce register pressure. Although the two live ranges do not have spill cost, they need shuffle code to move values between registers and memory. The shuffle cost is 370 load/store operations. The optimal result (least overhead operations) in this example is spilling y's live ranges ([lr.sub.1](y) and [lr.sub.2](y)) as depicted in Figure 9(c) (only 190 operations of spill code).

[Figure 9 ILLUSTRATION OMITTED]

We now describe in more detail the delayed spilling technique used to handle the case where K [is less than] T. Since all transparent live ranges conflict with each other, these live ranges form a clique in the interference graph. The transparent live ranges are therefore collected into a single clique summary node C in the interference graph, as depicted in Figure 10. The node C contains an edge to all other nodes in the graph, since the transparent live ranges interfere with all other live ranges in a region. The clique summary node is annotated with the number T(C) of transparent live ranges that it represents. We record that [Psi](C) = K of the T transparent live ranges must be spilled, without specifying which ones. The actual size of the clique is thus T(C) - [Psi](C). The clique is dealt with as a single unit; eventually [Psi](C) live ranges will be spilled. By keeping a summary node in the graph, we can keep more live ranges in the interference graph than there are registers, and we delay the decision on which live range(s) to spill until more information is available. The clique representation affects the graph coloring in two aspects: computing the degree and fusing graphs.

[Figure 10 ILLUSTRATION OMITTED]

Computing Degree. The degree of an interference graph node is usually computed by summing the number of interference graph edges that are incident upon the node. With clique summary nodes, however, additional care must be taken. Consider a clique node C and a regular node lr, connected by an interference edge E. The degree that edge E contributes to node lr is T(C) - [Psi](C) because lr interferes with every node inside clique C. The degree that edge E contributes to clique C is one. In addition to the interference edges of C, the degree of a clique node must add T(C) - [Psi](C) - 1 because within the clique each node interferes with T(C) - [Psi](C) - 1 other nodes.

Merging Live Ranges. Given an edge E - <[B.sub.1], [B.sub.2]>, the live ranges that span across E are merged. When a live range [lr.sub.1] is merged with [lr.sub.2], and at least one of the two is transparent, there are three cases to consider (assume that [lr.sub.2] is transparent):

(1) If [lr.sub.1] is a spilled live range, [lr.sub.2] is in a clique C: [lr.sub.2] is removed from C and spilled. Both T(C) and [Psi](C) are decremented by one because one spilling decision is made. The decision grows a spilled live range, thereby allowing the compiler to avoid shuffle code on the edge E (a contiguous spilled live range crosses the edge E).

(2) If [lr.sub.1] is a nontransparent live range, [lr.sub.2] is in a clique C: The fusion operation makes a nonspilling decision for [lr.sub.2] and removes [lr.sub.2] from C. Only T(C) is decremented. The decision enlarges a nonspilled live range; since the live range is in a register in an adjacent region, keeping [lr.sub.2] in the register (not spilled) eliminates the shuffle code needed otherwise on the edge E. Hence the compiler favors this live range over the other live ranges in the clique.

(3) If [lr.sub.1] is in clique [C.sub.1] and [lr.sub.2] is in clique [C.sub.2]: [lr.sub.1] and [lr.sub.2] are coalesced and added to a new clique, C = [C.sub.1] [intersection] [C.sub.2]. After processing all live ranges across E, there are three cliques C, [C'.sub.1], and [C'.sub.2], where [C'.sub.1] = [C.sub.1] - C and [C'.sub.2] = [C.sub.2] - C. For a live range from C that is spilled, one fewer live range must be spilled from [C'.sub.1] and [C'.sub.2]. Thus [Psi](C) establishes how many ranges of C must be spilled and is computed as a function of [Psi]([C.sub.1]) and [Psi]([C.sub.2]). There exists several possibilities how to distribute the responsibility to spill live ranges to C, [C'.sub.1], and [C'.sub.2] (i.e., to decide how many live ranges of each clique should be spilled). Once we determine how many live ranges to spill from C, [Psi]([C'.sub.1]) and [Psi]([C'.sub.2]) can be computed: [Psi]([C'.sub.1]) = [Psi]([C.sub.1]) - [Psi](C) and [Psi]([C'.sub.2]) = [Psi]([C.sub.2]) - [Psi](C). When we determine [Psi](C), we have to ensure that [Psi](C) [is less than or equal to] T(C) always holds, so there are four practical ways to set [Psi](C):

(a) [Psi](C) = min(T(C), max([Psi]([C.sub.1]), [Psi]([C.sub.2])))

(b) [Psi](C) = min(T(C), min([Psi]([C.sub.1]), [Psi]([C.sub.2])))

(c) [Psi](C) = min(T(C), max(?? T(C)/T([C.sub.1] [Psi]([C.sub.1]) ??, ?? T(C)/T(C.sub.2] [Psi]([C.sub.2]) ??))

(d) [Psi](C) = min(T(C), min(?? T(C)/T([C.sub.1]) [Psi](C.sub.1] ??, ?? T(C)/T(C.sub.2] [Psi]([C.sub.2]) ??))

The clause min(T(C), ...) in the equations is required to ensure that the condition [Psi](C) [is less than or equal to] T(C) holds for the clique C. The first approach aims to lower the [Psi] functions of [C'.sub.1] and [C'.sub.2] because [Psi](C) takes the maximum of [Psi]([C.sub.1]) and [Psi]([C.sub.2]). The second formula aims to maximize [Psi]([C'.sub.1]) and [Psi]([C'.sub.2]). The third and fourth options attempt to divide the number of spilled live ranges among cliques proportional to their sizes.

If T(C) = [Psi](C) for a clique C at any time during this phase, then all live ranges of C must be spilled. At that point, C is an empty clique and removed from the interference graph. When a live range lr is removed from clique C, a copy of [Psi](C) and new conflict edges are generated for lr (because lr interferes with every live range in C). The example of Figure 11 has three live ranges in the clique C. We want to remove lr(x) from the clique. A copy of the interference edges (edges 1 and 2) of C is generated (dashed lines). The new generated interference edge [lr(x),C] is added as well.

[Figure 11 ILLUSTRATION OMITTED]

Example of Delayed Spilling. For Figure 12(a), assume that basic blocks are regions; each basic block has three live ranges, and they all conflict with each other. Again, the vertical lines on the right of the blocks represent the live ranges of the virtual registers x, y, and z (a lightly shaded line indicates that the live range is transparent). Three colors are needed to color the interference graph of each block. Since we have only two registers, we must spill one live range from each block. The graph simplification phase (Section 3.1.2) needs to spill one live range from each interference graph of basic blocks so as to maintain the simplifiability constraint.

Without delayed spilling, spilling decisions need to be made locally, but such decisions may not be the best from a global point of view. If [lr.sub.1](y), [lr.sub.2](z), and [lr.sub.3](x) are spilled (no spill cost), six load/store operations (shuffle code) are required to move values between registers and memory, one store and one load operation for each live range as shown in Figure 12(b). The delayed-spilling technique is able to delay spilling decisions by constructing a clique for each block to collect the two transparent live ranges of the block. The [Psi] function for each clique is one (one out of two will eventually be spilled). Consider that the interference graphs of [B.sub.1] and [B.sub.2] are fused. The unioning rules listed in Figure 6 decide that [lr.sub.2](x) and [lr.sub.1](y) should not be spilled because [lr.sub.1](x) and [lr.sub.2](y) are nontransparent live ranges. Therefore, we spill [lr.sub.1](Z) and [lr.sub.2](z) (depicted in Figure 12(c)). This spilling decision results in only four load/store operations.

[Figure 12 ILLUSTRATION OMITTED]

3.2.4 Taking a Snapshot of the Interference Graph. As mentioned earlier, combined live ranges may have to be split to maintain the simplifiability invariant. As two live ranges, [lr.sub.1] and [lr.sub.2], are combined into [lr.sub.12], their interference edges, [Psi]([lr.sub.1]) and [Psi]([lr.sub.2]), are merged. Duplicate edges may arise and are thrown away. To undo combining, the register allocator must retain the original interference graph. Otherwise, it is hard and very inefficient to recompute interference information. This step takes a partial snapshot of the current interference graph before combining any splitable live ranges.

3.2.5 Unioning Splitable Live Ranges. This step traverses the list of live ranges that span edge E and combines each pair of live ranges; all these resulting live ranges are splitable. Unioning two splitable live ranges [lr.sub.1] and [lr.sub.2] requires two operations that are similar but not identical to unioning nonsplitable live ranges. First, the two live ranges are combined into one live range, [lr.sub.12], as described in Section 3.2.2. The second operation merges interference edges without suppressing duplicate edges. There is another noteworthy point--attributes of live ranges are not propagated at this point of time. All differences of unioning nonsplitable and splitable live ranges originate from the fact that the latter wants to retain the old interference information so as to split combined live ranges easily, and the former does not. Deletion of duplicate edges and propagation of attributes of splitable live ranges are deferred until the simplifiability invariant is satisfied.

3.2.6 Maintaining the Simplifiability (Colorability) Invariant. The step that maintains simplifiability ensures that the invariant holds by performing simplification on [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The algorithm described in Nickerson [1990] is used to deal with register pairs. This step consists of four major components: compute degree, remove nodes with degree less than N, lower register pressure, and commit the live-range unioning decision. Computing the degree and removing unconstrained nodes proceed as usual. Lowering the register pressure has to undo one or more unioning operations.

Lowering Register Pressure. If simplification of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] blocks, then fusing the interference graphs of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] along edge E induces high register pressure. Some actions are required to lower the register pressure. In a first attempt, we find a transparent live range to spill, instead of splitting live ranges. Spilling a transparent live range incurs no spill cost (but implies shuffle cost), and the delayed-spilling technique may lower the incurred shuffle cost, i.e., push shuffle code onto less frequently executed edges. After spilling a transparent live range, degrees of its neighbors are decreased correspondingly.

If there is no transparent live range, then a coalesced (splitable) live range [lr.sub.12] is chosen from the remaining constrained nodes to split, [lr.sub.12] is split along E back into [lr.sub.1] and [lr.sub.2]. The choice of which live range to split is solely based on a node's degree, because the shuffle costs (move) at edge E for all splitable live ranges are the same. After splitting [lr.sub.12], if [lr.sub.1] (or [lr.sub.2]) becomes unconstrained, [lr.sub.1] (or [lr.sub.2]) is removed (and the degree of those nodes that conflict with [lr.sub.1] (or [lr.sub.2]) is lowered). The degree of the interference graph may therefore be lowered because of splitting, allowing simplification to proceed from the point where it blocked. If simplification blocks again, additional live ranges are split.

Experimental results show that choosing the splitable live range that has the smallest degree incurs fewer splittings than choosing the highest degree. The reason is that [lr.sub.1] and [lr.sub.2] are more likely to become unconstrained using the heuristic of choosing the smallest degree. Since the original graphs [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are simplifiable, then at worst all live ranges that span E are split.

Committing Live-Range Unioning Decision. By the time we get to the "committing unioning decision" step, the simplifiability constraint is satisfied, and all necessary spilling (transparent live ranges) and splitting (coalesced live ranges) decisions are made. Some live ranges are coalesced, and some live-range unions are suppressed. The step now commits all coalesced live ranges. Namely, those coalesced live ranges will never be split at edge E. Thus, the attributes of the coalesced live ranges are then propagated. Duplicate interference edges can now be eliminated.

3.3 Placement of Shuffle Code

After the color-assignment phase, shuffle code is inserted as necessary along edges. At an edge E = <[B.sub.1],[B.sub.2]>, shuffle code is inserted for a virtual register x that has been split at E, i.e., if [lr.sub.1](x) and [lr.sub.2](x) have not been assigned the same storage location. Shuffle code can be of three types: shuffle move, shuffle store, and shuffle load.

One simple rule of inserting a shuffle store on E is to store the value of x back to memory when [lr.sub.1](x) is in a register, and [lr.sub.2](x) is spilled. The rule, however, is too conservative because x's memory location may already contain the up-to-date value. When fusing graphs, the fusion operation propagates an attribute (has_def) indicating if the value of a live range is defined within the live range. If this attribute of a live range lr(x) is set, then shuffle stores are needed at the edges that exit to a spilled live range lr'(x). For example, in Figure 13(a), [lr.sub.1](x) is assigned to a register, [r1.sub.x], and [lr.sub.2](x) is spilled. Because x is defined in [lr.sub.1](x) (has_def is set), the value of [lr.sub.1](x) is stored back to memory by inserting a shuffle store on edge [E.sub.1]

[Figure 13 ILLUSTRATION OMITTED]

When the has_def attribute of a live range is not set (no definition in the live range), shuffle stores may not be needed. For instance, in Figure 13(b), [lr.sub.1](x) is also assigned to register [r1.sub.x], and [lr.sub.2](x) is spilled. However, the value of [lr.sub.1](x) is not defined in [lr.sub.1](x) (has_def is not set) and comes from a spilled live range, [lr.sub.3](x). Because the value of x in its memory location is still up-to-date, no shuffle store is needed on edge [E.sub.1].

We have to pay attention if the value of [lr.sub.1](x) comes from another live range that is not spilled and whose has_def attribute is set as shown in Figure 13(c). At the exit edge [E.sub.1] of [lr.sub.1](x), the value of x in its memory location is not up-to-date so a shuffle store is required to update the memory location of x.

A simple flow-analysis technique is used to compute if the value of a live range lr(x) in its memory location is up-to-date or not. The flow analysis operates on shuffle moves unlike typical flow analyses in the global optimization phase that operate on control-flow edges. The attribute flows through shuffle moves (forward direction from sources to destinations). Therefore, in the presence of the flow analysis, we know if a shuffle move brings a value that is not yet stored back to its memory location. If the value of lr(x) comes from a shuffle move and the value in memory is not up-to-date, shuffle stores are required at the edges that exit to spilled live ranges regardless of the has_def attribute of lr(x).

3.3.1 Ordering of Shuffle Code. There is an ordering in which shuffle code is inserted. Shuffle stores are inserted first. Shuffle moves are appended after shuffle stores. Finally shuffle loads are inserted after shuffle moves. Multiple shuffle moves may exist on edge E. Special attention must be given to the ordering among the shuffle moves to prevent the source value of a move being targeted by another move. For instance, consider physical registers r1, r2, r3, and r4 (a subscript x in [r.sub.x] indicates the virtual register) that are involved in two shuffle moves for x and y: [r3.sub.x] [left arrow] [r2.sub.x] and [r4.sub.y] [left arrow] [r3.sub.y]). If the ordering for the two moves is [r3.sub.x] [left arrow] [r2.sub.x] followed by [r4.sub.y] [left arrow] [r3.sub.y], the value of y in [r3.sub.y] is destroyed by the first move of x. If we swap, however, the two moves so that [r4.sub.y] [left arrow] [r3.sub.y] comes first, the two shuffle moves are able to exchange values safely. We find the ordering for shuffle moves by constructing a directed graph whose nodes are shuffle moves. An arc from a move [mv.sub.1] to [mv.sub.2] represents that the destination of [mv.sub.1] and the source of [mv.sub.2] share the same register. Traversing from leaves to roots determines the ordering for shuffle moves. If there is a cycle in the graph, saving/restoring a register to/from a temporary memory location is necessary to replace a move in the cycle so as to break the cycle. To provide an instance, consider two moves [r2.sub.x] [left arrow] [r3.sub.x] and [r3.sub.y] [left arrow] [r2.sub.y] There is a cycle between the two moves. The code sequence for the two shuffle moves is st [r2.sub.y], temp; [r2.sub.x] [left arrow] [r3.sub.x]; 1d [r3.sub.y], temp. We count in this case three overhead operations (1 load, 1 store, and 1 move), not two.

3.3.2 Optimization of Placement. The insertion of shuffle code is simple and straightforward, but this step may result in partial redundancies in the code. Code hoisting [Knoop et al. 1992], code sinking [Knoop et al. 1994], and partial redundancy [Chow 1984] techniques can be used to optimize the placement of shuffle loads/stores [Lueh 1997].

4. EVALUATION

We first discuss how fusion-based register allocation addresses the shortcomings of other approaches to register allocation. Then we present a quantitative evaluation of the benefits of fusion-based register allocation.

4.1 Modeling Different Register Allocation Approaches

We use the framework of Section 2 to structure the discussion (the cmcc compiler used for the quantitative comparison also uses this framework).

Chaitin-style coloring [Chaitin et al. 1981; Bernstein et al. 1989] uses simplification to determine the coloring order. If simplification blocks (all nodes have degree greater or equal N), a least-cost live range is spilled (added to S) so that the simplification process can proceed. The Chaitin-style approach makes out-of-color spilling decisions in the color-ordering phase. The principal shortcoming of this approach is that the register allocator makes "all-or-nothing" decisions: all definitions and uses of a spilled live range go through memory even though some parts of the live range could have been allocated a register. Fine tuning the spilling heuristic [Bernstein et al. 1989] or delaying spilling decisions [Briggs et al. 1989] may improve individual programs but do not address the fundamental problem.

Optimistic coloring [Briggs et al. 1989] also uses simplification to determine the coloring order. When simplification blocks, however, rather than spilling, optimistic coloring pushes a least-cost live range onto C, removing the live range from the residual graph. In this manner, a live range that would have been spilled by the Chaitin-style approach is given another chance at receiving a register in the color-assignment phase. Optimistic coloring defers out-of-color spilling decisions until the color-assignment phase, which spills those live ranges that cannot be assigned legal colors (all colors are taken up by neighbors). However, optimistic coloring does not address the fundamental limitation of Chaitin-style coloring.

Priority-based coloring determines the coloring order according to a priority function, which attempts to assign registers to the most important live ranges and to spill the least important ones if necessary. Priority-based coloring as described by Chow [Chow and Hennessy 1990] splits a live range lr(x) if there is no legal color for lr(x) by forming a new live range segment [lr.sub.R](x) such that the region R contains as many live units (i.e., basic blocks where x is live) as possible yet remains colorable. R is formed in a breadth-first traversal of the control-flow graph, preferably starting from a live unit where the first reference to x is a definition. A live range is spilled (i.e., remains in its home location) when no live units that comprise the live range can be given a register. This priority-based approach neither takes execution frequency nor program structure into account when splitting live ranges, and there is no guarantee that splitting points do not end up along frequently executed edges. Shuffle code induced by a split may end up, for example, on a loop back edge. (Subsequently, code motion may optimize the placement of the shuffle code, but this phase is decoupled from register allocation.) Larus and Hilfinger [1986] implement the priority-based approach in the SPUR Lisp compiler.

The approach to live range splitting described by Chow and Hennessy [1990] cannot be modeled by our framework, but otherwise, priority-based coloring can be implemented: first, unconstrained live ranges are identified and pushed onto C. Then the priority function determines the order in which constrained live ranges are pushed onto C (from the smallest to the biggest). During the color-assignment phase, if a live range cannot be assigned a legal color (because of conflicts with the neighbors' colors), then priority-based coloring spills this live range.

This framework emphasizes what priority-based coloring [Chow and Hennessy 1990] and the Chaitin-based approaches have in common: they all try to determine the color ordering in which live ranges are assigned colors (registers). Priority-based coloring uses cost analysis as the priority to determine the ordering and guarantees that the register assignment phase assigns registers to the most important (high savings of memory accesses) live ranges. Chaitin-style, optimistic, and fusion-style simplify the interference graph to decide the ordering. Optimistic coloring is similar to priority-based coloring in the sense that both approaches delay out-of-color spilling decisions until the color-assignment phase.

Fusion-style register allocation requires the shuffle-code insertion phase to insert code to move values at splitting points. Shuffle-code insertion is just a nop for the Chaitin-style, optimistic, and priority-based approaches because these approaches do not split live ranges.

Table I summarizes how the register allocation framework can model various register-allocation approaches. Several of the approaches that take program structure into account can be viewed as variations of fusion-based register allocation.
Table I. Different Register-Allocation Approaches

                       Register allocation

Phases           Chaitin-style        Priority-based

Graph            per function         per function
construction

Graph            X                    X
reconstruction

Coalescing       X                    X

Color            simplification       priority
ordering         out-of-color spill

Color                                 out-of-color spill
assignment       call-cost spill      call-cost spill

Spill-code       X                    X
insertion

Shuffle-code
insertion

                       Register allocation

Phases           Optimistic           Fusion-style

Graph            per function         per function
construction                          out-of-color spill
                                      split

Graph            X                    X
reconstruction

Coalescing       X                    X

Color            simplification       simplification
ordering

Color            out-of-color spill
assignment       call-cost spill      call-cost spill

Spill-code       X                    X
insertion

Shuffle-code                          X
insertion


4.1.1 Tera-Like Allocation. A Tera-like approach can be modeled by using loops as regions. The edge ordering for the control-flow edges that connect regions are ordered based on the estimated execution frequency using loop depth--the edges in the inner loops appear in front of the edges in the outer loops. We thereby perform graph fusion-based on the edge ordering. The way that we model the Tera approach is not identical to what was described by Callahan and Koblenz [1991] because fusion-style coloring does not construct the tile tree and perform register allocation tile by tile. However, both approaches have the same perspective on register allocation. That is, the inner loops have more freedom in terms of assigning registers. The fusion-based approach should outperform the Tera approach for two reasons. First, no premature binding decisions have been made in the fusion-based approach. Second, the Tera approach does not take into account the edges that connect a parent and a child tile to prevent live ranges from being split on one of the edges that is executed more often than the others. The fusion-based approach can prioritize those edges based on profile information so as to avoid splitting live ranges at frequently executed edges.

4.1.2 PDG-Based Allocation. The RAP compiler [Norris and Pollock 1994] performs register allocation in a hierarchical fashion based on the PDG. A region node in the PDG is a collection of predicates and statements that are executed under the same control conditions. A region node R may consist of subregion nodes. To model the PDG-based approach, we can exclude the subregion nodes from R and consider R as a region in fusion-style coloring. We then traverse the PDG from leaves to the root like the way that the RAP compiler performs register allocation. During traversing the PDG, we order control-flow edges by assigning higher priorities to the control-flow edges that connect the current region to its subregions than to the rest of the control-flow edges that are not yet assigned priorities. The PDG-based approach modeled by fusion-style coloring should outperform the conventional PDG approach for the same two reasons as described in modeling the Tern approach. First, no premature binding decisions have been made in fusionstyle coloring. Second, the conventional PDG-based approach does not take into account the execution frequencies of the edges that connect a parent region and a subregion whereas the fusion-style approach can prioritize the edges based on their estimated execution frequencies.

4.1.3 Multiflow-Like Allocation. The Multiflow compiler integrates scheduling and register allocation. Scheduling and register allocation are done on a trace basis. That is, the compiler picks a trace and passes the trace to the code scheduler; the code scheduler then schedules the code of the trace and performs register allocation. Because the compiler used for this evaluation does not include a framework that integrates scheduling and register allocation like the Multiflow compiler, fusion-style coloring can model only the register allocation part of the Multiflow approach. One key idea of the Multiflow approach is that traces chosen first have more freedom in using registers--they are not constrained by other traces because other traces are not yet compiled. Fusion-style coloring has the flexibility to choose traces as regions, maintains the simplifiability invariant to delay color-assignment decisions, and uses the delayed-spilling technique to defer spilling decisions. Therefore, unlike the Multiflow approach, traces are never constrained by the color-binding decisions of other traces.

There are two counteractive aspects when register allocation chooses traces as regions. The code scheduler requires instructions to exploit ILP. The bigger the trace, the more instructions the code scheduler can exploit. However, the bigger the trace, the more register pressure the trace imposes on register allocation. Once a trace is picked, no live-range splitting could happen using the Multiflow approach. As a result, the spilling decisions made by the register allocator are all-or-nothing spilling. The fusion-based approach, on the other hand, has the flexibility to further divide a trace into smaller regions and applies graph fusion to combine the graphs of the smaller regions into the interference graph of the trace. Live-range splitting is therefore possible within the trace.

4.1.4 SSA. With slight modification, the fusion-based framework can model the approach that uses the Static Single Assignment (SSA) representation to determine splitting points. Constructing [G.sub.SSA] is similar to constructing [G.sub.R] where R is the region that covers the whole function. Splitting a live range at a [Phi] node is the same as splitting the live range at the entry edges of the block that contains the [Phi] node because [Phi] nodes are inserted into the entry of blocks [Cytron et al. 1986]. We apply the fusion operation on every control-flow edge. During the step of unioning nonsplitable live ranges, coalescing two live ranges is suppressed if the current edge is a [Phi]-node splitting point for the two live ranges. The later color assignment eliminates shuffle moves using biased coloring.

Briggs [1992] uses the SSA representation to determine splitting points, and splitting live ranges takes place prior to coloring. A live range is split at a [phi] node when the incoming values to the [Phi] node result from distinct assignments. All live ranges that span a loop are split immediately before and after the loop, i.e., splitting decisions are made prior to coloring.

Making splitting decisions too early (prior to coloring) has several disadvantages. First, decisions regarding which live ranges to be split and where to split them are made prematurely. Thus, live ranges may be split unnecessarily, resulting in a performance degradation due to unnecessary shuffle code. Various heuristics have been developed to eliminate shuffle code by increasing the chance that the same color is given to partner live ranges, e.g., biased-coloring or conservative coalescing [Briggs 1992], but they cannot overcome the principal limitation. Rather than determining the critical regions where splitting and spilling are beneficial, both approaches split live ranges arbitrarily and greedily, with the hope that later heuristic steps will clean up unnecessary splits. Second, the places where live ranges are split are not necessarily the places of low execution probability. Although it is possible to use execution probabilities to select splitting points, there is no guarantee that the resulting live ranges fit into registers without spilling. That is, the register allocator runs the risk of either splitting too much (leading to unnecessary shuffle code), or not enough (with the consequence that high-frequency live ranges are spilled).

4.2 Other Approaches

Probabilistic register allocation [Proebsting and Fischer 1992] is a hybrid of the priority-based and program-structure-based approaches. The approach consists of three steps, local register allocation, global register allocation, and register assignment. The global register allocation step partitions a program into regions based on the loop hierarchy and proceeds from the innermost loops to the outermost loops. Like priority-based coloring, probabilistic register allocation also assigns registers to live ranges in a priority manner. Variables initially reside in memory instead of virtual registers (with load/store for every use/definition). The priority function is the probability that a variable x will reside in a register at a given use (load) multiplied by the benefit (savings) that this use accesses x's value from a register rather than memory. The inverse of the probability roughly indicates the distance between definitions of the value and its use. The longer the distance, the more likely it is that the register allocator assigns registers to other live ranges between the definitions and the use--therefore the lower the probability that the value resides in a register. Although this probability function takes the program structure into account, this approach provides in its choice of regions less flexibility than fusion-based allocation.

The RAP compiler [Norris and Pollock 1994] colors the region nodes in a function's Program Dependence Graph (PDG), proceeding in a hierarchical manner from the leaves to the root. Chaitin's algorithm is used at each region node. This approach splits live ranges on region boundaries. A region is a collection of predicates and statements that are executed under the same control conditions.

The IMPACT compiler takes a similar approach as the Multiflow compiler [Hank 1996]. The compiler selects a region and performs classical global optimizations, ILP optimization, code scheduling, and register allocation. Like the Multiflow approach, regions that are compiled first have more freedom in using registers; a similar binding technique is used to eliminate the shuffle code on region boundaries.

Coagulation code generation [Morris 1991] integrates code generation and register allocation based on the ordering of prioritized control-flow edges. The most important (frequently executed) regions have the maximum freedom in using registers, like in the Multiflow compiler.

These approaches [Callahan and Koblenz 1991; Norris and Pollock 1994; Freudenberger and Ruttenberg 1992; Hank 1996; Morris 1991; Proebsting and Fischer 1992] attempt to make graph coloring sensitive to program structure by dividing a program into regions and prioritizing the regions according to execution probabilities. The register allocator then colors regions in the order of their priorities, and shuffle code is inserted at the boundaries of these regions. As coloring is performed hierarchically, the higher-priority regions have more freedom in using registers, and shuffle code tends to be outside the higher-priority regions. As long as the regions reflect the execution frequency, they favor the most frequently executed parts of a program.

However, those approaches all make early binding decisions when coloring is performed to a region--either assign real physical registers to live ranges or determine which live ranges should share the same registers. Early binding decisions impose unnecessary constraints to lower-priority regions because the lower-priority regions must bind to the already made coloring decisions.

Kurlander and Fischer [1994] perform live-range splitting register allocation to free up registers that can be used to improve code scheduling. Empty delay slots in the final schedule are filled with shuffle code to split and spill live ranges. Spilling frees up registers, and these additional registers are used to remove false dependencies induced by the reuse of registers.

Live range splitting can be avoided if the compiler uses minimal live ranges, i.e., live ranges have a very fine granularity. Essentially every reference generates a new live range [Kolte and Harrold 1993]. However, such live ranges may be long in some circumstances. Consider a program such that x contains only a definition and one use. Only one live range is constructed for x. The definition and use may be far apart, and the live range contains a lot of blocks with high register pressure. Splitting the live range is not possible, even though splitting the parts of the live range that have high register pressure is desirable.

Rematerialization [Briggs et al. 1992] spills live ranges that are cheaper to recompute than to store/load them back/from memory, e.g., loading constant values or computing addresses. This optimization is orthogonal to the issues discussed in this article, and rematerialization can also be included in a fusion-based register allocator [Lueh 1997].

Bergner et al. [1997] address directly the problem of spilling high-pressure regions by identifying the part of a live range that interferes with another live range. This interference region is then spilled. This approach is successful in identifying the regions of the program that benefit from spilling a live range, but the boundary between a high-pressure region and a low-pressure region may be frequently executed. In this case shuffle code is inserted into a frequently executed block. Fusion-based register allocation takes the structure of the program into account when growing the spill region.

4.3 Performance

We measured the benefit of the register allocation strategy for various SPEC92 programs (alvinn, compress, ear, eqntott, espresso, gcc, li, sc, doduc, fpppp, matrix300, nasa7, spice, and tomcatv) using the cmcc compiler [Adl-Tabatabai et al. 1996; Lueh et al. 1997]. Our evaluation employs the framework described in Section 2. We contrast the fusion-based approach with the results using an improved Chaitin-style approach. This compiler does not implement clean spilling, but uses call-cost-directed optimizations [Lueh and Gross 1997]. The spilling heuristic is Bernstein's [Bernstein et al. 1989]; it is optimistic although this aspect contributes little to overall performance (see Lueh and Gross [1997] for a detailed comparison with optimistic coloring). The compiler implements biased coloring [Briggs 1992]. To avoid confusion with register allocation as proposed by Chaitin [Chaitin et al. 1981], we use the term "improved Chaitin-style". Our data are for the MIPS architecture, using all registers and adhering to the standard calling convention. We start with the smallest region size, one basic block, and grow these until we have the interference graph for the function. All functions within the main loop of alvinn are inlined.

For both approaches, we run two experiments. First, we use only static information to guide register allocation. That is, we use loop depth to estimate the execution frequency of a block. In the fusion-based approach, when considering the basic blocks inside a loop, we use breadth-first order to select edges for fusing. In a second experiment, we use profile information from a prior execution with the same input set.

As expected, for programs with many small functions, Chaitin's algorithm works fairly well, and the fusion-based approach produces identical results. For most of these programs, use of profile information improves the result, independent of the register allocation strategy. Let us, therefore, turn our attention to programs with significant register pressure.

alvinn, nasa7, tomcatv, doduc, and fpppp are the programs with high register pressure, and here we see the limitations of the all-or-nothing approach to spilling that is the foundation of Chaitin-style register allocation. Figure 14 shows that, using profile information, register allocation based on graph fusion is able to split live ranges properly; it cuts for alvinn nearly 52% of the overhead required by Chaitin-style allocation. Using static information, Chaitin-style allocation already finds the "best" complete live ranges to put into registers; therefore, the result is not improved by using profile information. Our approach breaks live ranges, as can be seen by the lower number of spill loads. And profile information provides a better cost function for the color-assignment phase (when deciding if a caller-save or a callee-save register should be assigned to a live range, or the live range should be spilled), resulting in a further improvement. Looking at the results for fpppp using static information, we see the same story. The overhead of fpppp is reduced by 28%; the large contribution of shuffle code indicates that live ranges have been split. If we use profile information, both allocators can improve their choice of live ranges that are allocated a register and identify the most promising live ranges. Chaitin-based allocation produces a slightly better result; since both approaches include heuristics (e.g., if there are more live ranges that must be spilled than there are transparent live ranges), these heuristics can introduce minor differences. For tomcatv, our approach removes 40% of the overhead operations. Because the estimated static execution frequency provides a good estimation, the profile information does not reduce overhead operations relative to the static case.

[Figure 14 ILLUSTRATION OMITTED]

There are three big functions in doduc: subb, supp, and ddeflu. These three account for the majority of data movement operations, subb and supp consist of only one big block. For those types of functions, our register allocator produces the same amount of data movement operations as a Chaitin-style allocator, since the sole basic block is the complete compilation unit. One simple way to deal with such blocks is to partition the big block into smaller blocks so that our approach can be applied to make better spilling and splitting decisions. We show in Figure 14 the result for ddeflu, where large portions of the spill loads and stores are replaced by shuffle code.

4.4 Choice of Base Region

One of the parameters of fusion-style register allocation is the shape of the base regions (for which we compute the initial interference graphs). If we choose a complete function as the base region, then fusion-style register allocation is identical to the (function-wide) Chaitin-style register allocation. The size of the base region affects spill and shuffle costs. Lueh [1997] includes an example of a program that favors a small size of the base region, as well as an example that favors a big size of the base region. Since there is no a priori best decision, the compiler may provide the option of different base regions. Lueh et al. [1997] discuss the influence of the choice of base region in the presence of region-enhancing transformations like loop unrolling.

4.5 Sensitivity Analysis

We now report results on the effectiveness of fusion-based register allocation when we vary the number of registers or use different region models.

4.5.1 Number of Registers. The transformations of region-based compilers increase register pressure, and another way to assess how a register allocator deals with register pressure is to investigate how sensitive the allocator is to the number of registers that are available. Clearly, as we increase the number of registers, there comes the point of diminishing return. The next figures depict the number of overhead operations as we allow the register allocator to use more registers (the number of additional registers is shown on the x-axis). Recall that the MIPS architecture has separate register banks for integer and floating-point values. The notation ([R.sub.i], [R.sub.f], [E.sub.i], [E.sub.f]) of the next figures shows the combination of caller-save and callee-save registers for the two register banks. [R.sub.i] and [R.sub.f] are the number of caller-save registers for integer and floating-point values, respectively. [E.sub.i] and [E.sub.f] are the number of callee-save registers for integer and floating-point values. The standard calling convention of the MIPS machine uses 4 registers of the integer bank to pass arguments and 2 registers to pass function-return values; in addition, 2 floating-point registers pass arguments, and 2 floating-point registers can hold a function-return value. All these registers are caller-save registers. Hence, the register allocator uses at least (6,4,0,0).

Looking at the SPEC92 programs, we can classify them roughly into five classes:

(1) Fusion-style outperforms Chaitin-style: Fusion-style register allocation produces consistently better results than Chaitin-style allocation. Examples are alvinn, espresso, nasa7, and tomcatv (Figure 15).

[Figure 15 ILLUSTRATION OMITTED]

(2) Fusion-style is similar to Chaitin-style (profile sensitive): Programs sensitive to getting accurate frequency information (i.e., given the same frequency information, both fusion-style and Chaitin-style register allocation perform about the same), compress, sc, li, and gcc are four examples (Figure 16).

[Figure 16 ILLUSTRATION OMITTED]

(3) Fusion-style is better than Chaitin-style (static case): The heuristics that determine which live range to spill are sometimes helped by profile information, and in this case, a Chaitin-style allocator may obtain the same results as fusion-style allocation; see Figure 17 for an example (fpppp). The results for the Chaitin-style allocator are marginally better than the result for fusion-style allocation. Even with profile information, the register allocator must make many guesses, and results like those depicted in Figure 17 indicate that the spilling decision still depends on some heuristics.

[Figure 17 ILLUSTRATION OMITTED]

(4) Fusion-style is better than Chaitin-style (small number of registers): Fusion-style register allocation produces fewer overhead operations than Chaitin-style allocation when the register allocator uses a small number of registers. As the number of register increases, the difference of register overhead diminishes, ear and matrix300 are two examples (Figure 18).

[Figure 18 ILLUSTRATION OMITTED]

(5) Fusion-style and Chaitin-style are about the same: Finally, there are programs where neither the choice of register allocation strategy nor the use of profile information seems to make a difference. Examples of this class of programs are eqntott, doduc, and spice shown in Figure 19.

[Figure 19 ILLUSTRATION OMITTED]

4.5.2 Different Region Sizes. In this section, we evaluate the influence of different region sizes to fusion-based coloring. Three kinds of region are investigated, blocks, loops, and traces. The region-formation phase determines a trace by selecting a seed block and growing the trace along desirable paths. A trace T is expanded in the forward, as well as the backward, direction:

--forward: T is expanded forward by adding block [B.sub.k] to T if the condition is satisfied.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

--backward: T is expanded backward by adding block [B.sub.i] to T if the condition is satisfied.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

freq([B.sub.j]) and freq(E) are the estimated execution frequencies of block [B.sub.j] and edge E respectively. The two constants, [Beta] and [Gamma], are set to 0.5. These two conditions are the conditions used in the IMPACT compiler to form regions [Hank 1996].

Figures 20 and 21 depict the experimental results of fusion-style coloring with three different region sizes (loops, traces, and basic blocks). There are several noticeable aspects. None of the region sizes always outperforms the others because the results depend highly on program characteristics and combination of registers. The influence diminishes as the number of registers increases because the number of registers is then sufficient for all different region sizes. In general, alvinn, espresso, gcc, and doduc prefer a small size of the base region. For gcc and doduc, the results of using traces as regions are as good as using blocks, fpppp, nasa7, and tomcatv prefer a big size of the base region (loops or traces).

[Figures 20-21 ILLUSTRATION OMITTED]

5. EXECUTION TIME AND COMPILATION TIME

There are two other dimensions of interest to compiler designers: execution time related to register overhead and compilation time. Table II shows the speedup due to fusion-style register allocation (compared to an improved Chaitin-style allocator). The table shows that fusion-style coloring can speed-up execution time by up to 8.4% over Chaitin-style coloring.

Table II. Speed-Up of Execution Time over Chaitin-Style Coloring (percentage)
 program   All registers   14 int, 14 float

espresso             0.0                7.5
   nasa7             2.4                5.6
 tomcatv             8.4                7.0


Table III shows the cost of fusion-style register allocation; this table lists the contribution of register allocation to overall compilation time (seconds) for the cmcc compiler, cmcc is a research compiler and not fine-tuned for compilation speed; overall cmcc takes about twice as long as the native compiler. It is a cross-compiler and executes on various platforms; the data presented here are collected on a DEC AlphaStation (333MHz). We compile the SPEC92 programs and measure the compilation time for Chaitin-style and fusion-style register allocation, using loops as well as single basic blocks as base regions. We also stress register allocation by performing the unrolling transformation (unroll three times) to enlarge the code size for some of the programs.

The complexity of simplification is O(V + E) where V is the number of live ranges, and E is the number of interference edges. After fusing graphs along an edge, fusion-based register allocation performs simplification to maintain the simplifiability invariant. Hence, at first sight, we expect to see that the fusion-based approach has a big increase of compilation time because the complexity of maintaining the simplifiability constraint is O(F * (V + E)) where F is the number of fusion operations. From Table III, however, we see that register allocation time increases only by a small constant factor ([is less than] 3). Even when the base regions are basic blocks, and the compiler applies a fusion operation on every control-flow edge, only for gcc, sc, and spice we see a larger increase ([is less than] 4.5). Unrolling inner loops increases the work for the register allocator; for gcc (c-parse.tab.c), the size of the control-flow graph of the yylex routine is tripled by loop unrolling (unroll three times). The size grows from (415 blocks, 660 edges) to (1426 blocks, 2246 edges). Even in such a big control-flow graph, the cost of the fusion operations increases only about by a factor of 3 if the base regions are loops. To put this data into perspective, the complete code generator including register allocation is responsible for less than 20% of the total compilation time.
Table III. Compilation Time of Register Allocation (seconds)

                              Standard

                                    Region

Program               Chaitin    Loop      BB

alvinn inline             1.3     1.6     1.6
compress                  1.7     2.0     4.4
ear                       4.2     5.5     8.2
eqntott                   4.7     6.1    10.7
espresso                 29.3    39.6    71.6
li                        5.0     5.9     9.4
sc                       15.1    25.5    62.8
gcc                     148.7   211.7   669.7
gcc (c-parse.tab.c)       7.0    14.7    61.2
doduc                    51.4    78.1   128.8
fpppp                    60.6    70.8    83.6
matrix300                 1.1     1.7     2.5
nasa7                     7.4    10.9    13.8
spice                   118.2   311.0   392.4
tomcatv                   3.4     3.7     3.7

                              Unrolled

                                    Region

Program               Chaitin    Loop      BB

alvinn inline             4.8     5.2     5.7
compress                  3.9     5.2    11.9
ear                      15.7    26.6    49.5
eqntott                  18.3    29.3    74.4
espresso                104.0   175.6   417.2
li                       11.7    16.5    32.8
sc                       33.2    71.2   160.1
gcc
gcc (c-parse.tab.c)      22.1    55.0   166.7
doduc
fpppp
matrix300                 4.1     7.5    11.0
nasa7                    55.2    80.1   100.9
spice
tomcatv                  12.7    17.2    30.7


There are two major factors that keep the compilation time spent on register allocation low: partial graphs and cliques. When graphs are fused along edge E that connects two regions, [R.sub.1] and [R.sub.2], the fusion operation needs to deal with only [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and not all the graphs in the program. Clique summary nodes also help to speed-up the compilation time spent on simplification because simplifying a clique removes all transparent live ranges in the clique from the graph. Consider that a clique C contains T transparent live ranges. Each transparent live range in C interferes with T - 1 transparent live ranges in C and neighbor(C). Therefore C represents T live ranges and T *(T-1))/2 + T* | neighbor(C) | edges, where | neighbor(C) | is the number of neighbors of C. Removing C and [Psi](C) from the interference graph is equivalent to removing T live ranges and T*(T-1))/2 + T* | neighbor(C) | interference edges. These two reasons keep the price of compilation time within a reasonable and acceptable range.

Fusion-based coloring provides a nice framework to tackle the compilation time problem if the compilation time ever becomes unacceptable or is a big concern to the compiler. The fusion operation takes longer and longer as more and more graphs are fused because the fusion operation gradually deals with bigger graphs, which cover larger regions, and makes all delayed spilling decisions (live ranges are removed from cliques). Graph fusion is based on the edge ordering, so fusing graphs on the edges that are less important (less frequently executed) is more expensive in term of compilation time, yet has less influence on the overall register allocation overhead. To curtail the time spent on graph fusion, we can set a threshold to control the number of control-flow edges along which graphs are fused. The threshold can be based on the time that has already been spent on register allocation or the number of the total control-flow edges that connect regions. As soon as register allocation passes the threshold, the graphs of the regions that are not yet fused with any other graphs are all lumped into one big region [R.sub.remaining] (splitting live ranges is not allowed within the region). Building the graph [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is as fast as building the graph for Chaitin-style coloring. We then apply the fusion operation to the edges that connect [R.sub.remaining] and other regions.

Table IV shows experimental results to assess a trade-off between compilation time versus register overhead operations for espresso. For the experiment, fusionstyle coloring uses blocks as the base regions and profile information to guide register allocation. Data for two register combinations, 14 integer and 14 floating-point, and 8 integer and 11 floating-point registers, are collected. The "Time" columns indicate the percentage increase of compilation time of fusion-style coloring relative to Chaitin-style coloring. The "Overhead" columns are the percentage reduction of overhead operation over Chaitin-style coloring when we use fusion-style coloring. The leftmost column is the percentage k of total control-flow edges that defines a region; the compiler collects the blocks connected by the first k percentage of edges into one region. Table IV indicates that fusion-based register allocation can maintain good overhead improvement while reducing compilation time. The framework allows compiler writers that are concerned about compilation speed to make a trade-off decision that is consistent with the basis for register allocation.

Table IV. Percentage Increase of Compilation Time and Improvement of Overhead Operations (espresso)
                   Time                   Overhead

          (Fusion/Chaitin - 1)   (Fusion/Chaitin - 1)
                  * 100                  * 100

Fusion%   (14i,14f)   (11i,8f)   (14i,14f)   (11i,8f)

  100       128.0      120.6       62.6        59.5
   90       126.8      118.0       62.6        56.6
   80       117.2      110.6       59.2        42.5
   70       107.5      100.9       45.1        36.0
   60        97.0       91.7       31.5        25.4
   50        85.2       77.7       26.9        24.3
   40        72.0       65.6       24.8        18.4
   30        57.2       51.3       16.6        14.9
   20        41.9       38.7       16.6        12.7
   10        25.6       23.8       11.6        13.0


6. CONCLUSIONS

Fusion-based register allocation integrates spilling and splitting in a single framework. Therefore allocation of a live range is no longer subject to the "all-or-nothing" rule of Chaitin-style register allocation. Chaitin-style simplification either allocates a live range to a register or spills the live range; fusion-based register allocation overcomes this restriction by looking at individual live range segments, and the size of these live range segments is determined by the compiler's model of a region. A region can be, for example, a basic block, a trace, a superblock, or a function. The compiler orders the regions by a priority function and builds up a function's interference graph by processing one region after the other. As long as the most frequently executed part of a program has the highest priority, the register allocator moves overhead operations (caused by spills or splits) into the less frequently executed parts of the program.

A compiler obtains several benefits from employing fusion-based register allocation. Since this register allocation method takes regions into account, the compiler can ensure that the register allocator spends its resources (registers) on those parts of the programs that are considered important by the compiler. A consequence of this design is that the compiler can control the amount of time spent on register allocation. These properties are important for compilers that employ scope-enlarging optimizations like loop unrolling or function inlining. As the degree of instruction-level parallelism increases in processor architectures, we expect that more and more compilers have to employ such optimizations and at the same time have to pay attention to compilation time.

The effect of register allocation on total execution time depends on many aspects of the program and the target machine. There is no easy way to account for instruction-level parallelism, the number of registers, and the accuracy of cost functions that determine if a value is allocated a register. In summary, register allocation based on fusion produces good results; it either matches or improves upon the results obtained from a Chaitin-style register allocator; e.g., for the benchmarks from the SPEC92 suite this register allocator improves execution time up to 8.4% over Chaitin-style allocation.

Good performance, sensitivity to regions, and the ability to control the effort spent on the register allocation phase are three crucial properties of advanced compilers. A fusion-based register allocator has these properties and is therefore a suitable component for modern compilers.

ACKNOWLEDGMENTS

The detailed and insightful comments of the referees suggested significant improvements of this article.

REFERENCES

ADL-TABATABAI, A., GROSS, T., AND LUEH, G. 1996. Code reuse in an optimizing compiler. In Proc. OOPSLA '96. ACM, 51-68.

BERGNER, P., DAHL, P., ENGEBRETSEN, D., AND O'KEEFE, M. 1997. Spill code minimization via interference region spilling. In Proc. ACM SIGPLAN '97 Conf. on Prog. Language Design and Implementation. ACM, 287-295.

BERNSTEIN, D., GOLDIN, D. Q., GOLUMBIC, M. C., KRAWCZYK, H., MANSOUR, Y., NAHSHON, I., AND PINTER, R. Y. 1989. Spill code minimization techniques for optimizing compilers. In Proc. ACM SIGPLAN '89 Conf. on Prog. Lang. Design and Implementation. ACM, 258-263.

BRIGGS, P. 1992. Register allocation via graph coloring. Ph.D. thesis, Rice University.

BRIGGS, P., COOPER, K. D., KENNEDY, K., AND TORCZON, L. 1989. Coloring heuristics for register allocation. In Proc. ACM SIGPLAN '89 Conf. on Prog. Language Design and Implementation. ACM, 275-284.

BRIGGS, P., COOPER, K. D., AND TORCZON, L. 1992. Rematerialization. In Proc. ACM SIGPLAN '92 Conf. on Prog. Language Design and Implementation ACM, 311-321.

CALLAHAN, D. AND KOBLENZ, B. 1991. Register allocation via hierarchical graph coloring. In Proc. ACM SIGPLAN '91 Conf. on Prog. Lang. Design and Implementation ACM, Toronto, 192-203.

CHAITIN, G. J., AUSLANDER, M. A., CHANDRA, A. K., COCKE, J., HOPKINS, M. E., AND MARKSTEIN, P. W. 1981. Register allocation by coloring. Research Report 8395, IBM Watson Research Center.

CHOW, F. 1984. A portable, machine-independent global optimizer - design and measurements. Ph.D. thesis, Stanford University.

CHOW, F. C. AND HENNESSY, J. L. 1990. A priority-based coloring approach to register allocation. ACM Trans. on Prog. Lang. Syst. 12, 501-535.

CORMEN, T., LEISERSON, C., AND RIVEST, R. 1990. Introduction to Algorithms. The MIT Press.

CYTRON, R., FERRANTE, J., ROSEN, B., WEGMAN, M., AND ZADECK, F. 1986. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. on Prog. Lang. Syst. 13, 4 (December), 451-490.

FISHER, J. A. AND FREUDENBERGER, S. M. 1992. Predicting conditional branch direction from previous runs of a program. In Proc. Fifth Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems (ASPLOS V). ACM, 85-97.

FREUDENBERGER, S. AND RUTTENBERG, J. 1992. Phase ordering of register allocation and instruction scheduling. In Code Generation - Concepts, Tools, Techniques, R. Giegerich and S. L. Graham, Eds. Springer Verlag, 146-170.

GAMMA, E., HELM, R., JOHNSON, R., AND VLISSIDES, J. 1995. Design Patterns, Elements of Object-Oriented Software. Addison Wesley.

GAREY, M. AND JOHNSON, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, San Francisco.

HANK, R., HWU, W., AND RAU, B. 1995. Region-based compilation: An introduction and motivation. In Proc. 28th Annual ACM/IEEE Intl. Syrup. on Microarchitecture. ACM/IEEE, Ann Arbor, 158-168.

HANK, R. E. 1996. Region-based compilation. Ph.D. thesis, University of Illinois at Urbana -Champaign.

HENNESSY, J. L. AND PATTERSON, D. A. 1990. Computer Architecture A Quantitative Approach Morgan Kaufman.

HWU, W. W., MAHLKE, S. A., CHEN, W. Y., CHANG, P. P., WARTER, N. J., BRINGMANN, R. A., OUELLETTE, R. O., HANK, R. E., KIYOHARA, T., HAAB, G. E., HOLM, J. G., AND LAVERY, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. Journal of Supercomputing 7, 1,2 (March), 229-248.

KNOOP, J., RUTHING, O., AND STEFFEN, B. 1992. Lazy code motion. In Proc. ACM SIGPLAN '92 Conf. on Prog. Lang. Design and Implementation. ACM, 224-234.

KNOOP, J., RUTHING, O., AND STEFFEN, B. 1994. Partial dead code elimination. In Proc. ACM SIGPLAN '94 Conf. on Prog. Lang. Design and Implementation. ACM, 147-158.

KOLTE, P. AND HARROLD, M. J. 1993. Load/store range analysis for global register allocation. In Proc. ACM SIGPLAN '93 Conf. on Prog. Lang. Design and Implementation" ACM, 268-277.

KURLANDER, S. M. AND FISCHER, C. N. 1994. Zero-cost range splitting. In Proc. ACM SIGPLAN '94 Conf. on Prog. Lang. Design and Implementation. ACM, 257-265.

LARUS, J. R. AND HILFINGER, P. N. 1986. Register allocation in the SPUR Lisp compiler. In Proc. ACM SIGPLAN Sym. on Compiler Construction. ACM, 255-263.

LOWNEY, P. G., FREUDENBERGER, S. M., KARZES, T. J., LICHTENSTEIN, W. D., NIX, R. P., O'DONNELL, J., AND RUTTENBERG, J. C. 1993. The Multiflow trace scheduling compiler. Journal of Supercomputing 7, 1,2 (March), 51-142.

LUEH, G. AND GROSS, T. 1997. Call-cost directed register allocation. In Proc. ACM SIGPLAN'97 Conf. on Prog. Lang. Design and Implementation" ACM, 296-307.

LUEH, G., GROSS, T., AND ADL-TABATABAI A. 1997. Global register allocation based on graph fusion. In Languages and Compilers for Parallel Computing (9th Intl. Workshop, LCPC'96). Springer Verlag, San Jose, CA, 246-265. LNCS 1239.

LUEH, G.-Y. 1997. Fusion-based register allocation. Ph.D. thesis, Carnegie Mellon University. CMU-CS-97-135.

MORRIS, W. 1991. Ccg: A prototype coagulating code generator. In Proc. ACM SIGPLAN '91 Conf. on Prog. Lang. Design and Implementation. ACM, 45-58.

NICKERSON, B. R. 1990. Graph coloring register allocation for processors with multi-register operands. In Proc. ACM SlGPLAN'90 Conf. on Prog. Lang. Design and Implementation. ACM, 40-52.

NORRIS, C. AND POLLOCK, L. L. 1994. Register allocation over the program dependence graph. In Proc. ACM SIGPLAN '94 Conf. on Prog. Lang. Design and Implementation. ACM, 266-277.

PROEBSTING, T. A. AND FISCHER, C. N. 1992. Probablistic register allocation. In Proc. ACM SIGPLAN '92 Conf. on Prog. Lang. Design and Implementation ACM, 300-310.

WALL, D. 1991. Predicting program behavior using real or estimated profiles. In Proc. ACM SlGPLAN '91 Conf. on Prog. Lang. Design and Implementation. ACM, 59-70.

Received July 1998; revised May 1999; accepted November 1999

APPENDIX

EDGE ORDERING AND SHUFFLE CODE PLACEMENT: DETAILED EXAMPLE

In this section, we provide an example to show the interaction between edge ordering and shuffle code placement. The code fragment of Figure 22 is taken from the function update_weights of alvinn. Presumably there are three registers.

[Figure 22 ILLUSTRATION OMITTED]

Consider an edge ordering based on loop hierarchy: <[B.sub.3], [B.sub.3]>, <[B.sub.4], [B.sub.2]>, <[B.sub.3], [B.sub.4]>, <[B.sub.2], [B.sub.3]>, <[B.sub.5], [B.sub.5]>, <[B.sub.4], [B.sub.5]>.

[B.sub.3] requires all three registers for tmp74, tmp76, and tmp77, which are all referenced within the block. In addition, there are two transparent live ranges induced by tmp73 and tmp75. These two live ranges are spilled when the interference graph [G.sub.3] is simplified before graph fusion takes place. [B.sub.2] also needs all three registers; therefore, the only transparent live range (tmp74) for [B.sub.2] is spilled.

As we fuse [G.sub.3] along the first edge <[B.sub.3], [B.sub.3]>, the resulting graph does not require any other splitting or spilling decisions, since the simplifiability invariant still holds. While fusing [G.sub.2] and [G.sub.4] along edge <[B.sub.4], [B.sub.2]>, the spilled live range of tmp74 grows (to cover [B.sub.4] and [B.sub.2]). After fusing along the remaining edges, the final resulting graph requires shuffle code to move values among different locations. The shuffle code is highlighted in bold face in Figure 22(b). For tmp75, because there is no definition within the loop; the shuffle store does not have to be on edge <[B.sub.2], [B.sub.3]> and is placed right after its definition, which is outside the loops.

If we change the edge ordering so that <[B.sub.4], [B.sub.2]> and <[B.sub.3], [B.sub.4]> are switched, then the register allocator ends up with a different spilling decision, as seen in Figure 22(c). The shuffle load of tmp74 on <[B.sub.4], [B.sub.5]> in Figure 22(b) is not needed because [B.sub.4] is no longer in tmp74's spill region. Furthermore, the shuffle load of tmp75 is placed on <[B.sub.4], [B.sub.2]> instead of <[B.sub.3], [B.sub.4]>, because the spill region for tmp75, indicated by a dotted line, contains [B.sub.4]. (So instead of being in the loop, it is now on the loop back edge.)

This research was supported in part by Digital Equipment Corp., Intel Corp., as well as the Advanced Research Projects Agency and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-96-1-0287.

The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Advanced Research Projects Agency, Rome Laboratory, or the U.S. Government.

Authors' addresses: T. Gross, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15123; G. Lueh, Intel Corp., MicroComputer Research Lab, RN6-18, 2200 Mission College Blvd., Santa Clara, CA 95052; A. Adl-Tabatabai, Rubric, Inc. 2121 S. El Camino Way, 10th Fl., San Mateo, CA 94403;

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
COPYRIGHT 2000 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2000 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:LUEH, GUEI-YUAN; GROSS, THOMAS; ADL-TABATABAI, ALI-REZA
Publication:ACM Transactions on Programming Languages & Systems
Geographic Code:1USA
Date:May 1, 2000
Words:17251
Previous Article:Context-Sensitive Synchronization-Sensitive Analysis Is Undecidable.
Next Article:Java Bytecode Compression for Low-End Embedded Systems.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters