Evidence-based static branch prediction using machine learning.
Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems -- measurement techniques; D.3.4 [Programming Languages]: Processors -- compilers; optimization; 1.2.6 [Artificial Intelligence]: Learning -- connectionism and neural nets; parameter learning
General Terms: Algorithms, Languages, Measurement, Performance
Additional Key Words and Phrases: Branch prediction, decision trees, machine learning, neural networks, performance evaluation, program optimization
In this article, we propose a new technique for program-based branch prediction based on a general approach that we have invented, called Evidence-based Static Prediction (ESP). Our results show that using our new approach results in better branch prediction than all existing program-based techniques. In addition, our ESP approach is very general, and can be applied to a wide range of program behavior estimation problems. In this article, we describe ESP and its successful application to the problem of program-based branch prediction.
Branch prediction is the process of correctly predicting whether branches will be taken or not before they are actually executed. Branch prediction is important, both for computer architectures and compilers. Compilers rely on branch prediction and execution estimation to implement optimizations such as trace scheduling [Fisher 1981; Hank et al. 1993; Hwu and Chang 19891 and other profile-based optimizations [Chang and Hwu 1992; Chang et al. 1991].
Wide-issue computer architectures rely on predictable control flow, and failure to correctly predict a branch results in delays for fetching and decoding the instructions along the incorrect path of execution. The penalty for a mispredicted branch may be several cycles long. For example, the mispredict penalty is four cycles on the Digital Alpha AXP 21064 processor and five cycles in the Alpha AXP 21164 processor. In previous studies, we found that conditional branches in C programs were executed approximately every eight instructions on the Alpha architecture [Calder et al. 1994]. Current wide-issue architectures can execute four or more instructions per cycle. As a result, such architectures are likely to execute branch instructions every two cycles or less, and effective branch prediction on such architectures is extremely important. Many approaches have been taken toward branch prediction, some of which involve hardware [Calder and Grunwald 1994a; Yeh and Patt 1993] while others involve software [Ball and Larus 1993; Calder and Grunwald 1994b; Fisher and Freudenberger 1992]. Software methods usually work in tandem with hardware methods. For example, some architectures have a "likely" bit that can be set by a compiler if a branch is determined to be likely taken by a compiler.
Compilers typically rely on two general approaches for estimating the behavior of branches at compile time: profile-based and program-based branch prediction. Profile-based methods use program profiles to determine the frequency that branch paths are executed. Fisher and Freudenberger  showed that profile-based branch prediction can be extremely successful at predicting the future behavior of branches. The main drawback of profile-based methods is that additional work is required on the part of the programmer to generate the program profiles. Program-based branch prediction methods attempt to predict branch behavior in the absence of profile information and are based only on a program's structure. Some of these techniques use heuristics based on local knowledge that can be encoded in the architecture [McFarling and Hennessy 1986]. Other techniques rely on applying heuristics based on less local program structure in an effort to predict branch behavior [Ball and Larus 1993]. Hank et al.  showed that these program-based heuristics can be used to accurately guide profile-based compiler optimization achieving performance improvements close to what is achieved if real profiles were used.
In this article, we describe a new approach to program-based branch prediction that does not rely on such heuristics. Our branch prediction relies on a general program-based prediction framework that we call ESP. The main idea of ESP is that the behavior of a corpus of programs can be used to infer the behavior of new programs. That is, instead of using a different execution of a program to predict its own behavior (as is done with profile-based methods), we use the behavior of a large body of different programs (the training set, or corpus) to identify and infer common behavior. Then we use this knowledge to predict branches for programs that were not included in the training set. Thus, our technique has two phases. The first phase, performed once (e.g., when the compiler is constructed) involves profiling the corpus of programs. The second phase, which is entirely program based but which uses the results of the first phase, predicts the branches in a new program. In this article we use neural networks and decision trees to map static features associated with each branch to a prediction that the branch will be taken.
Branch prediction using ESP has several important advantages over existing program-based branch prediction methods. First, because the technique generates predictions automatically, the predictions can be specialized based on specific languages, compilers, and computer architectures. Existing techniques rely on heuristics defined by compiler writers that are based on intuition and empirical studies (e.g., Smith ) about common programming idioms. Second, given a large amount of static information about each branch, the technique automatically determines what parts of that information are useful. Thus, it does not rely on trial and error on the part of the compiler writer searching for good heuristics. Finally, our results show that ESP branch prediction outperforms existing heuristic program-based branch prediction techniques over a body of 43 C and Fortran programs. In particular, our heuristics have an average overall miss rate of 20%, which compares to the 25% miss rate of the best existing heuristic technique and the 8% miss rate of the perfect static predictor.
This article has the following organization. In Section 2 we discuss previous approaches to program-based branch prediction and other knowledge-based approaches to program optimization. We discuss some problems that occur with the previous program-based techniques in Section 3. In Section 4 we discuss the details of our ESP branch prediction method. Section 5 describes the methods we used to evaluate and compare ESP prediction with previous approaches, and Section 6 presents our results. We summarize our conclusions in Section 7 and discuss possible future directions to take with this research.
B. Calder was supported by an ARPA Fellowship in High Performance Computing administered by the Institute for Advanced Computer Studies, University of Maryland. This work was funded in part by NSF grant ASC-9217394, NSF grant CCR-9404669, ARPA contract ARMY DABT63-94-C-0029, a hardware and software grant from Digital Equipment Corp., and an equipment gift from Hewlett-Packard.
In this section, we discuss the existing approaches to program-based static branch prediction and other knowledge-based approaches to compiler optimization. Most of the previous work in program-based branch prediction uses the final program binary to determine the direction of branches; following those decisions, the program can be reorganized [Calder and Grunwald 1994b; Pettis and Hansen 19901 to take advantage of the program-based branch prediction information.
2.1 Program-Based Branch Prediction Methods
One of the simplest program-based methods for branch prediction is called "backward-taken/forward-not-taken" (BTFNT). This technique relies on the heuristic that backward branches are usually loop branches, and as such are likely to be taken. One of the main advantages of this technique is that it relies solely on the sign bit of the branch displacement, which is already encoded in the instruction. Our results in Section 6 show it has an overall miss rate in our experiments of 34%. While simple, BTFNT is also quite successful, since many programs spend a lot of time executing inside of loops, and the backward branch in a loop is correctly predicted as taken when using the BTFNT heuristic.
To facilitate program-based methods for branch prediction, some modern architectures provide a "branch-likely" bit in each branch instruction [Alverson et al. 1990]. In these architectures, compilers can employ either profile-based [Fisher and Freudenberger 19921 or program-based techniques to predict what branches are likely to be taken. In recent work, Ball and Larus  showed that applying a number of simple program-based heuristics can significantly improve the branch prediction miss rate over BTFNT on tests based on the conditional branch operation. A complete summary of the Ball and Larus heuristics is given in Table I (as described by Wu and Larus ). Their heuristics use information about the branch opcode, operands, and characteristics of the branch successor blocks, and the heuristics encode knowledge about common programming idioms.
Table I. Summary of the Ball/Larus Heuristics Heuristic Name Heuristic Description Loop Branch Predicting that the edge back to the loop's head is taken and that the edge exiting the loop is not taken. Pointer If a branch compares a pointer against null or compares two pointers, predict the branch on false condition as taken. OpCode If a branch checks an integer for less than zero, less than or equal to zero, or equal to a constant, predict the branch on false condition. Guard If a register is an operand of the branch comparison, the register is used before being defined in a successor block, and the successor block does not postdominate the branch, predict the successor, block as taken. Loop Exit If a comparison is inside a loop, and no successor is a loop head, predict an edge exiting the loop as not taken. Loop Header Predict the successor that does not postdominate and is a loop header or a loop preheader as taken. Call Predict the successor that contains a call and does not postdominate the branch as not taken. Store Predict the successor that contains a store instruction and does not postdominate the branch as not taken. Return Predict the successor that contains a return as not taken.
Two questions arise when employing an approach like that taken by Ball and Larus. First, an important question is "which heuristics should be used?" In their paper, they describe seven heuristics that they considered successful, but also noted that "We tried many heuristics that were unsuccessful" [Ball and Larus 1993]. A second issue that arises with heuristic methods is "how to decide what to do when more than one heuristic applies to a given branch." This problem has existed in the artificial intelligence community for many years and is commonly known as the "evidence combination" problem. Ball and Larus considered this problem in their paper and decided that the heuristics should be applied in a fixed order; thus the first heuristic that applied to a particular branch was used to determine what direction it would take. They determined the "best" fixed order by conducting an experiment in which all possible orders were considered. We call using this predetermined order for heuristic combination A Priori Heuristic Combination (APHC). Using the APHC method, Ball and Larus report an average overall miss rate on the MIPS architecture of 20%; when their technique is applied to the DEC Alpha architecture, the prediction accuracy worsens, becoming 25%.
In a related paper, Wu and Larus  refined the APHC method of Ball and Larus. In that paper, their goal was to determine branch probabilities instead of simple branch prediction. With branch prediction, the goal is to determine a single bit of information per branch (likely versus unlikely). With branch probabilities, the goal is to determine the numeric probability that a branch is taken or not taken. Wu and Larus abandoned the simplistic evidence combination function of APHC in favor of an evidence combination function borrowed from Dempster-Shafer theory [Dempster 1968; Shafer 1976]. We call this form of evidence combination Dempster-Shafer Heuristic Combination (DSHC). By making some fairly strong assumptions concerning the independence of different attributes, the Dempster-Shafer evidence combination function can produce an estimate of the branch probability from any number of sources of evidence. For example, if one heuristic indicates that a branch is likely to be taken with probability X%, while another says it is likely to be taken with probability Y%, then DSHC allows these two probabilities to be combined.
Unless it is shown that the data are truly independent, and that other restrictions described by Dempster  and Shafer  are observed, the Dempster-Shafer mechanism may not provide any useful information. The probabilities that Wu and Larus  use are taken directly from Ball and Larus . We refer to a DSHC algorithm based on this data as DSHC(B&L). Because the goal of Wu and Larus was to perform program-based profile estimation, they give no results about how the DSHC method works for program-based branch prediction. One of the contributions of our article is that we quantify the effectiveness of the DSHC method for branch prediction. As we show later, the DSHC method provides worse prediction than the simple APHC method.
Wagner et al.  also used heuristics similar to those of Ball and Larus to perform program-based profile estimation. They also applied the heuristics in a fixed order. They report branch prediction miss rate results similar to those of Ball and Larus.
2.2 Knowledge-Based Approaches to Optimization
Our ESP method relies on collecting data from a corpus of program behavior and using that data to perform program-based prediction. There is little other work in compiler optimization that has taken this approach. We summarize the work we are aware of here.
Balasundaram et al.  address a somewhat different program-based estimation problem. They wanted to make compile-time decisions about data partitioning across a parallel computer, and they report on the idea of using profile data to "train" an estimator. This training, an off-line step, generates code which is then incorporated into their compiler. Training only needs to be done once per compilation target and is reported to be better than using a parameterized theoretical model. While the strategy they employ is similar to ESP, their application domain is quite different. In addition, our results show that this general approach of knowledge-based "training" can be used to enhance a wide class of optimizations based on program behavior estimation.
3. PROBLEMS WITH A PRIORI PROGRAM ESTIMATION
Ball and Larus  describe a number of prediction heuristics. These heuristics were the foundation for the prediction scheme in both the study by Ball and Larus and the study by Wu and Larus. In Wu and Larus  the values given in Ball and Larus  were used for the Dempster-Shafer combination method, even though the study by Wu and Larus used a different architecture, compiler, and run-time system. In this section, we show that the a priori heuristics described by Ball and Larus are sensitive to differences in architecture, compiler, run-time system, and selection of programs. This sensitivity implies that it is important to develop automatic techniques for branch estimation and an understanding of the sensitivity of those automatic techniques.
We use the CFG, dominator, postdominator, and loop information to implement the same heuristics used by Ball and Larus, summarized in Table I. Our implementation results for these heuristics on the Alpha architecture are shown in Table II. The first column lists the miss rate for loop branches. The second column shows the percentage of nonloop branches. The third column shows the dynamic percentage of nonloop branches that can be predicted using one of the heuristics, while the fourth column shows the miss rate achieved when using those heuristics. For
[TABULAR DATA II NOT REPRODUCIBLE IN ASCII] example, 80% of the nonloop branches in compress can be predicted using some heuristic, and those heuristics have a 38% miss rate. Branches that cannot be predicted using the heuristics are predicted using a uniform random distribution. The fifth column shows the prediction miss rate for the execution of all nonloop branches, combining the predictions from the heuristics and the random predictions. Last, the sixth column lists the misprediction rate when both loop and nonloop branches are included.
This table shows detailed information about how the branch heuristics performed for each program. Some of the programs in our suite were also used in the earlier study by Ball and Larus, and the values in parentheses show the equivalent metrics recorded in that study. In general, the values are quite similar, but there are some small differences that we believe arise from different run-time libraries. For example, a binary-buddy memory allocator would not contain any loops, while a coalescing implementation may contain several loops. These library routines are part of the native operating system and are not part of the distributed benchmark suite. Note that there are considerable differences in the percentage of nonloop branches, particularly in eqntott. Some of these differences are caused by libraries and run-time systems, but others can be attributed to architectural features. For example, the Alpha has a "conditional move" that avoids the need for many short conditional branches, reducing the number of conditional branches that are executed. Table II further demonstrates that our implementation of the heuristics listed in Ball and Larus  appear to be correctly implemented. The loop miss rates are roughly the same; the heuristics cover approximately the same percentage of branches; and the overall branch prediction miss rates are similar.
Table III shows the comparison of the overall averages for the heuristics comparing the Ball and Larus results on the MIPS architecture to our results on the Alpha. We felt that the differences seen in Table III were to be expected, because the two studies used a different collection of programs with different compilers that implement different optimizations for different architectures and used different run-time libraries. Table III supports our position that at least some of Ball and Larus' heuristics are quite language dependent. First, we point out that pointers are very rare in Fortran, and as such the great success of the Pointer heuristic in Fortran is of little consequence because it applies to very few branches. Next, we see that while the Store heuristic appears successful in our Fortran programs, it performs much worse in our C programs. Conversely, the Loop Header heuristic performs well in C programs, but poorly in Fortran programs.
Table III. Comparison of Branch Miss Rates for Prediction Heuristics
Branch Prediction Miss Rates Our Implementation (ALPHA) Heuristic B&L (MIPS) C Fortran Overall Loop Branch 12% 17% 12% 15% Pointer 40% 58% 1% 55% Call 22% 23% 44% 31% Opcode 16% 33% 29% 32% Loop Exit 20% 28% 30% 29% Return 28% 29% 30% 30% Store 45% 52% 30% 42% Loop Header 25% 33% 48%, 30% Guard 38% 34% 31% 33%
These averages are for all the programs we simulated, and a program is only included in a heuristics's average if the heuristic applies to at least 1% of the dynamic branches in the program.
3.1 The Influence of Architectures
In certain cases, we had slightly different implementations of heuristics than Ball and Larus because the Alpha architecture did not allow us to implement the heuristics as originally stated. For example, consider implementing the Opcode heuristic. The Alpha architecture has two types of branch instructions; one compares floating-point numbers to zero, and the other compares integer numbers to zero. The conditional branch instructions always compare a register to zero. On the MIPS architecture, the instructions "branch if equal" (BEQ) and "branch if not-equal" (BNE) compare two registers. To accomplish the same task on the Alpha, an earlier comparison must be made between the two registers, and the resulting value is then compared to zero.
Our implementation of the heuristics took these factors into account, constructing an abstract syntax tree from the program binary and using that to determine the outcome of the conditional branch. Clearly, determining this information at compile time would simplify the analysis, since more program information would be available. Both Ball and Larus  and our study used binary instrumentation, so we felt that other factors must also contribute to the prediction differences. We examined one program for which the Ball and Larus heuristics provided good prediction accuracy, tomcatv, in more detail, since our implementation of those heuristics provided worse prediction accuracy (see Table II). On the Alpha, tomcatv spends 99% of its execution in one procedure. Furthermore, most of the basic-block transitions in that procedure involve three basic blocks, shown in Figure 1. The edge from block 32 [right arrow] 28 is a loop back edge, and our implementations identify and predict this correctly. Of the remaining three conditional branches, both of the two important ones (labeled 28 and 30 in the figure) only match the "guard" heuristic in the heuristics described by Ball and Larus (see Table I). However, their study indicated that tomcatv benefited from the "store" heuristic, which predicts that basic blocks with store instructions following a conditional branch that do not postdominate the branch are not taken. By comparison, on the Alpha, none of the successors of block 28 (blocks 29 and 30) or block 30 (blocks 31 and 32) contain store instructions. This difference may be attributed to different register-scheduling or register-saving conventions. requiring a store on the MIPS, but not on the Alpha. The "guard" heuristic still applies, but predicts both branches in blocks 28 and 30 incorrectly.
[FIGURE 1, ILLUSTRATION OMITTED]
3.2 The Influence of Compilers and Optimizations
To further validate our belief that the choice of compilers influences the prediction accuracy of the various heuristics, we compiled one program, espresso, with the following compilers: cc on OSF/1 Vl.2, cc on OSF/1 V2.0, the DEC GEM C compiler, and the Gnu C compiler. The results are shown in Table IV. The last column of the table is the miss rate for perfect static profile prediction. In terms of the overall miss rate, the compilers all show different behavior. The DEC GEM C compiler produced significantly fewer loop branches and resulted in a program approximately 15% faster than the other compilers. The GEM compiler unrolled one loop in the main routine, inserting more forward branches and reducing the dynamic frequency of loop edges.
Table IV. Comparison of Accuracy of Prediction Heuristics Using Different Compilers
Loop Branches Nonloop Branches % Nonloop Heuristic Branches Miss Program O/S Compiler Miss Rate (Dynamic) Rate Espresso 1.2 cc 17 45 26 Espresso 2.0 cc 18 46 27 Espresso 2.0 GEM C 25 57 26 Espresso 2.0 Gnu C 17 46 23 Overall Perfect Miss Miss Program Rate Rate Espresso 24 15 Espresso 25 15 Espresso 32 12 Espresso 22 15
This simple optimization changed the characteristics of the branches in the program and the efficacy of the APHC branch prediction technique. The difference caused by loop unrolling is significant if we want to use branch probabilities after traditional optimizations have been applied. However, many programmers unroll loops "by hand," and other programmers use source-to-source restructuring tools, such as KAP [Hudson et al. 1986] or VAST [Arnold 1982]. The differences evinced by these applications may render the fixed ordering of heuristics ineffective for some programs.
4. EVIDENCE-BASED STATIC BRANCH PREDICTION
In this section, we propose a general framework for program-based prediction. Our method, ESP, is generally described as follows. A body of programs and program input is gathered (the corpus). Particular static information (the static feature set) about important static elements of the corpus (e.g., instructions) are recorded. The programs in the corpus are executed, and the corresponding dynamic behavior is associated with each static element (e.g., the number of times a branch is taken and not taken is associated with each branch). At this point, we have accumulated a body of knowledge about the relationship between static program elements and dynamic behavior. This body of knowledge can then be used at a later time to predict the behavior of instructions with similar static features for programs not in the corpus.
With this broad definition of our framework in mind, we now describe how we apply this general framework to the specific problem of static branch prediction. We first describe the features we extract from the test programs for branch prediction. We then describe two machine-learning systems (based on neural networks and decision trees) that use the data to make decisions on branches seen in other programs.
4.1 ESP Branch Prediction
In applying ESP to the problem of branch prediction, we instantiate the above framework in the following way. For each branch instruction in the program text, we record a large static feature set (see Table V). Some of the features are properties of the branch instruction itself (e.g., the branch opcode); others are properties of the registers used to define the register in the branch instruction (e.g., the opcode of the instructions that defined them); and others are properties of the procedure that the branch is in (leaf versus nonleaf). Despite being a RISC processor, the Alpha instruction set contains a wealth of instructions. Some machine-learning algorithms are confused by too many choices when fitting empirical data, and this results in no apparent pattern emerging from the data. For example, the instructions subq and sub1 both perform a subtraction, but the first uses quadwords and the latter long-words. It may be that the important aspect of those instructions is that a subtraction occurs. We aggregated certain instructions into an "operand function" that captures that information. Likewise, the important aspect may be that a quadword is used, rather than the specific arithmetic function that is applied. The "operand type" field encodes the type of the operand. A conditional branch may arise from a complex instruction sequence used to produce the left and right-hand side of the comparisons. We constructed and encoded an abstract syntax tree representing those comparisons as the "RA" and "RB" features. Although not precisely correct, it is useful to think of this as encoding comparisons of the form "if (([RA.sub.1] [op.sub.1] [RA.sub.2])[op.sub.2]([RB.sub.1][op.sub.3][RB.sub.2]))..." -- the abstract syntax tree includes information establishing the "context" of the conditional branch.
The majority of the features are related to the instructions following the branch (features 15-30). It is easiest to understand those features by considering the control flow graph in Figure 2. In that figure, the vertical ordering of the nodes indicates their order in the program binary. The conditional branch is followed by the "successor not taken" -- the node in the CFG that is the "fall through" of the conditional branch. Likewise, the "successor taken" is the node representing the target of the branch. The feature "Successor not taken ends" encodes the branch type of the instruction ending the "successor not taken" node. The existence of some features is dependent on the values of other features. For example, feature 4 is only meaningful if feature 3 has an RA operand. We call such features dependent static features.
[FIGURE 2, ILLUSTRATION OMITTED]
Table V. Static Feature Set Used in the ESP Branch Prediction Study
Feature Number Feature Name 1 Branch Instruction 2 Branch Direction 3 Branch Operand Opcode 4 Branch Operand Function 5 Branch Operand Type 6 RA Opcode 7 RA Function 8 RA Type 9 RB Opcode 10 RB Function 11 RB Type 12 Loop Header 13 Language 14 Procedure Type 15-22 Features of the Taken Successor of the Branch 15 Branch Dominates 16 Branch Postdominates 17 Successor Ends 18 Successor Loop 19 Successor Backedge 20 Successor Exit 21 Successor UseDef 22 Successor Call 23-30 Features of the Not Taken Successor of the Branch Same as for features 15-22 above Feature Number Feature Description 1 The opcode of branch instruction 2 F -- Forward branch, B -- Backward branch 3 The opcode of the instruction that defines the register used in the branch instruction (or ?, if the branch operand is defined in a previous basic block) 4 The branch function (see text) 5 The operation type (see text) 6 If the instruction in (3) uses an RA register, this is the opcode of the instruction that defines that register (? otherwise) 7 The RA function (see text) 8 The RA type (see text) 9 If the instruction in (3) uses an RB register, this is the opcode of the instruction that defines that register (? otherwise) 10 (As "RA Function" above, but for RB register) 11 (As "RA Type" above, but for RB register) 12 LH -- the basic block is a loop header, NLH -- not a loop header 13 The language of the procedure is in C or Fortran 14 The branch's procedure is a Leaf, NonLeaf, or calls itself recursively (CallSelf) 15-22 Features of the Taken Successor of the Branch 15 D -- basic block dominates this successor, or ND -- does not dominate 16 PD -- the successor basic block postdominates the basic block with the branch, or NPD-does not postdominate 17 Branch type ending successor basic block, possible values (Ft -- fall through, CBR -- conditional branch, UBR -- unconditional branch, BSR -- branch subroutines, JUMP -- jump, IJUMP -- indirect jump, JSR -- jump subroutines, IJSR -- indirect jump subroutine, RETURN, or NOTHING) 18 LH -- the successor basic block is a loop header or unconditionally passes control to a basic block which is a loop header, NLH -- not a loop header 19 LB -- the edge getting to the successor is a loop backedge, NLB -- not a loop exit backedge 20 LE -- the edge getting to the successor is a loop exit edge, NLE -- not a loop exit edge 21 UBD -- the successor basic block has a use of a register before defining it and that register was used to determine the destination of the current conditional branch instruction, NU -- no use before def in successor 22 PC -- the successor basic block contains a procedure call or unconditionally passes control to a basic block with a procedure call, NPC -- no procedure call down here 23-30 Features of the Not Taken Successor of the Branch Same as for features 15-22 above
We chose the feature set shown in Table V, based on several criteria. First, we encoded information that we believed would likely be predictive of behavior, even though we conducted no initial experimentation to confirm our intuition. This information included some of the information used to define the Ball/Larus heuristics (e.g., information about whether a call appears in a successor of the branch). Second, we encoded other information that was easily available. For example, since the opcodes that define the branch instruction register are readily available, we include them as well. Similarly, information about the procedure type is readily available. Later in the article, we examine the predictive power of individual features in the feature set and combinations of a subset of those features. We will show that the machine-learning algorithms are able to effectively select and combine the useful features from the set we have chosen. We note that the feature set listed here is the only one we have tried. It may be possible that extending the number of available features would improve the resulting branch prediction.
Having defined the static feature set, we then determine the static feature set for each branch in the corpus of programs. We next run the programs in the corpus and collect information about how often each branch is taken and not taken. The goal is to associate two pieces of dynamic information with each branch instruction: how frequently the branch was executed and how often it was taken. Because execution frequency is program dependent, we normalize the branch frequency by the total number of branches executed in the program. We compute the normalized branch weight by dividing how many times the branch was executed by the total number of branches executed by the program (resulting in a number between zero and one). Finally, we associate the static feature set, the normalized branch weight, and the branch probability (percentage of the time the branch was taken) with each branch instruction in the corpus.
4.2 Prediction Using Neural Nets
Our goal is to have a system that can effectively predict that a branch will be taken based on its static feature set. This system should accurately predict not just for the programs in the corpus, but also for previously unseen programs.
One way of doing such prediction is via a feedforward neural network [Smolensky et al. 19961. A feedforward neural network maps a numerical input vector to a numerical output. Here, the input vector consists of the feature values in the static feature set, and the output is a scalar indicating whether the branch will be taken.
Figure 3 depicts the branch prediction neural network. A neural network is composed of processing units, depicted in the figure by circles. Each processing unit conveys a scalar value known as its activity. The activity pattern over the bottom row of units is the input to the network. The activity of the top unit is the output of the network. Activity in the network flows from input to output, through a layer of intermediate or hidden units. via weighted connections. These connections are depicted in the figure by links with arrows indicating the direction of activity flow.
[FIGURE 3, ILLUSTRATION OMITTED]
This is a standard neural network architecture. We also use a fairly standard neural network dynamics in which the activity of hidden unit i, denoted [h.sub.i], is computed as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [x.sub.j] is the activity of input unit j; [w.sub.ij] is the connection weight from input unit j to hidden unit 1; [b.sub.i] is a bias weight associated with the unit; and tanh is the hyperbolic tangent function
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Similarly, the output unit activity, denoted y, is computed from the hidden-unit activities:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [v.sub.i] is the connection weight from hidden unit i to the output unit, and a is a bias weight associated with the output unit. The tanh function is normalized to achieve an activity range of [0, 1] for the output unit.
The input-output behavior of the neural network is determined by its free parameters: the weights w and v and biases b and a. These parameters are set by an algorithm known as back propagation [Rumelhart et al. 1986]. This is a gradient descent procedure for adjusting the parameters such that performance of the network on a training corpus is optimized. The standard measure of performance is the sum of squared errors,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where k is an index over examples in the training corpus; [y.sub.k] is the actual output of the network when training input k is presented; and [t.sub.k] is the target output-the output indicated for that example in the training corpus.
In this application, however, we have a different criterion for good performance. We want to minimize two sorts of errors: missed branches (MB) and branches incorrectly taken (BIT). MBs occur when the predictor says that the branch will be taken with probability less than 0.5 when the branch is in reality taken; BITs occur when the predictor says that the branch will be taken with probability greater than 0.5 when the branch is in reality not taken. If the network output for example k is binary -- 1 if the predicate "the branch probability is greater than 0.5" is believed to be true, 0 otherwise -- then the relative number of errors due to MB for example k is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [n.sub.k] is the normalized branch weight (e.g., the proportion of branches with that particular feature set in the corpus). The product [t.sub.k] [n.sub.k] gives the relative number of cases where the branch is taken. All of these branches are missed if [y.sub.k] = 0 (or equivalently, 1 - [Y.sub.k] = 1). Similarly, the relative number of errors due to BIT is
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Because these two types of errors have equal cost, the overall error is simply
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
This is used as the error measure to be minimized by the neural net training procedure. That is, the free parameters in the neural net are adjusted such that the network will produce outputs [y.sub.k] such that E is minimized. Note that this does not require that the network accurately predict branch probabilities per se, as we were assuming previously.(1)
Each input unit's activity is normalized over the training set to have 0 mean and standard deviation 1. The same normalization is applied for test cases. We deal with nonmeaningful dependent static features by setting their input activity to 0 after the normalization step; this prevents the nonmeaningful features from having any effect on the computation and is equivalent to gating the flow of activity from these features by another feature that indicates the relevance of the dependent features for a particular example. We use a "batch" training procedure in which weights are updated following a sweep through the entire training corpus, and we use an adaptive learning rate procedure wherein the learning rate for the network is increased if error drops regularly or is decreased otherwise. Momentum is not used. Training of the network continues until the threshold error of the net no longer decreases. By thresholded error, we mean the error computed when the output is first thresholded to values 0 or 1. This achieves a form of early stopping and thereby helps to prevent overfitting.
4.3 Prediction Using Decision Trees
As described in the previous section, the branch prediction task can be formulated as a probability prediction task. Given a branch characterized as a feature vector, we would like to predict the percentage of times such a branch is likely to be taken. It is, however, also possible to cast the branch prediction problem as a classification problem; given a branch characterized as a feature vector, we would like to assign that branch to either the class "branch" (taken) or "no branch" (not taken) based on the classification of similar branches in the training corpus. Casting the problem in these terms permits the use of some simple, yet effective, techniques to build classifiers from training data. Decision tree induction systems represent one such approach.
Decision trees consist of internal test nodes that examine single features of objects, branches emanating from these internal nodes that correspond to each possible result of the test, and leaf nodes that denote object classifications. Operationally, decision trees are used to classify objects by first performing the test specified at the root node and then following the branch indicated by the result of the test down to the next level of the tree. The result of the tests at each successive level determines an object's unique path to a leaf node. When a leaf node is reached, the classification associated with that leaf is assigned to the object.
Note that such decision trees are directly interpretable by experts in the domain of interest, since they are expressed in terms of the features used to describe the objects. Indeed, each path in the tree from the root to a leaf can be thought of as a logical rule where a conjunction of the tests forms the antecedent and where the classification specified at the leaf is the consequent. This direct interpretability was one of our primary motivations for investigating the use of decision trees.
Fortunately, there are effective and efficient algorithms for learning decision trees directly from a corpus of data represented as feature vectors with assigned categories. (See Quinlan  for a survey of such methods.) There are two key notions underlying all of these algorithms. The first is the notion that any diverse collection of objects can be assigned a value based on the heterogeneity of the collection. The second is that a given collection can be partitioned into two or more subpartitions based on the results of testing a single feature of all the objects in the collection.
Taken together, these two notions suggest a greedy divide-and-conquer approach to tree construction given a training corpus. The algorithm first assigns as the root node the feature test that results in subpartitions that are maximally homogeneous -- in other words, the single feature that best creates a set of partitions that groups objects with the same category together and separates objects with differing classifications. Since the resulting partitions are still likely to be diverse, it next recursively builds subtrees for each partition created by the application of a test to an earlier partition. This recursion typically halts when all the partitions at the leaves of the tree have uniform membership, when there are no more features left to be tested, or when the size of the subpartition is so small that it is unlikely that any further splitting would be reliable.
As with most data-intensive machine-learning methods, it is important to avoid having the system memorize, or overfit, the training data. Two approaches have traditionally been taken to address this problem: the system is only allowed to view a fraction of the available training data with the remainder being held back as a test set, and the tree resulting from the training is "pruned" back from the leaves to avoid idiosyncratic overfitting of the data.
The decision tree induction experiments reported here were performed using the well-studied, and widely distributed, C4.5 system [Quinlan 1993]. The results reported here were achieved using C4.5's default training settings, along with its standard tree-pruning mechanism.
To be more specific, C4.5 requires a training set made up of individual feature vectors labeled with correct classifications and outputs a decision tree capable of classifying such vectors. In our experiments, the first step taken was to assign each feature vector to the category "branch" or "nobranch" based on whether its recorded branch probability was greater than or less than 0.5. The next task was to capture the fact that the branches contained in the training set displayed widely varying frequencies during the original data gathering. To reflect these differences, copies of each labeled branch were created in sufficient numbers to reflect the proportion of dynamic branches that each branch was responsible for. Due to practical limitations, we limited this copying to 1000 times the normalized frequency. This has the effect of eliminating a number of low-frequency branches from the training set while still allowing C4.5 to focus on the important frequent branches.
4.4 Discussion and Summary
The ESP method automatically extracts the important features that can be used to predict the direction that branches will take. Because the method is automatic, it can be used to specialize predictions based on specific programming languages, compilers, run-time systems, architectures, and application domains. Thus, ESP has advantages over existing methods based on heuristics. In particular, the effectiveness of heuristics may depend on the context in which they are developed and evaluated, such as the programming language or architecture (as we have shown), and given a set of heuristics, there is no clear insight that guides how they can be combined effectively.
The ESP method has disadvantages, as well. First, a corpus of programs must be available. For our results in Section 6, we initially had only eight C programs to examine. On average, ESP prediction results for these eight programs were the same as the APHC and DSHC results. After we increased the corpus of eight C programs to 23 C programs, the average misprediction rate for ESP was 5% lower than the average miss rates for the APHC and DSHC techniques. Second, our approach requires that the feature set be defined. Our results in Section 6.3.2 show that as more features are used the miss rate improves, although not monotonically. Finally, if the neural net ESP approach is to be used, it requires someone who understands neural nets fairly well, probably at the level of a person who has taken a course in neural nets. We envision that if this approach becomes sufficiently widespread, then tools that facilitate such training would be made available. In contrast, decision trees are easier to use for ESP, and the knowledge they encode can be automatically translated into relatively transparent if-then rules. The results we have obtained using decision trees are comparable to the neural net results. The C4.5 system was able to build trees with performance on average 1% worse than the corresponding neural net. Therefore, a decision tree approach may be more practical for most users of ESP.
5. EVALUATION METHODS
To perform our evaluation, we collected information from 43 C and Fortran programs. During our study, we instrumented the programs from the SPEC92 benchmark suite and other programs, including many from the Perfect Club [Berry 1989] suite. We used ATOM [Srivastava and Eustace 1994] to instrument the programs. Due to the structure of ATOM, we did not need to record traces and could trace very long-running programs. The programs were compiled on a DEC 3000-400 using the Alpha AXP-21064 processor using either the DEC C or Fortran compilers. We used versions of the SPEC benchmark suite compiled on the OSF/1 V1.2 operating system using the same compiler options used to report the official SPEC benchmark suite. During the period we conducted this study, we migrated from the OSF/1 V1.2 operating system to the OSF/1 V2.0 system, and most of the non-SPEC programs were compiled on that system. Unless otherwise noted, the programs were compiled using the normal C compiler, which was a derivative of the MIPS C compiler, or the DEC Fortran compiler, which uses a different code generator. All programs were compiled with standard optimization (-[Omicron]). Each program was run once to collect information about branch frequency and the percentage of "taken" branches. For the SPEC92 programs, we used the largest input distributed with the SPEC92 suite.
Table VI shows the basic statistics for the programs we instrumented. The first column lists the number of instructions traced, and the second column gives the percentage of instructions that are conditional branches. The third column gives the percentage of conditional branches that are taken. The Q-50, Q-75, Q-90, Q-95, Q-99, and Q-100 columns show the number of branch instruction sites that contribute 50, 75, 90, 95, 99, and 100% of all the executed conditional branches in the program. The column "Static" shows the total number of conditional branch sites in each program. Thus, in alvinn, two branch instructions constitute over 90% of all executed branches, and correctly predicting these two conditional branches is very important. The last column contains the percentage represented by Q-100/static, which indicates the fraction of total static branches exercised by the particular input data used.
[TABULAR DATA VI NOT REPRODUCIBLE IN ASCII]
The ATOM instrumentation tool has an interface that allows the elements of the program executable (including all libraries), such as instructions, basic blocks, and procedures, to be queried and manipulated. In particular, ATOM allows an "instrumentation" program to navigate through the basic blocks of a program executable and then collect information about registers used, opcodes, branch conditions, control flow, etc. During this instrumentation phase, our tools collect all the static feature information described in Table V. By gathering information about the targets of branches, we used this information to construct a control flow graph. Using the control flow graph, we computed the dominator and postdominator trees. Following this, we determined the natural loop headers and applied the same definition of natural loops used by Ball and Larus  to determine the loop bodies.
As the instrumentation program runs, it can also add procedure calls to the executable that allow information about the executing program to be collected. Thus, we also use ATOM to determine the execution frequency and branch probability for each branch in the corpus for the purpose of training and, in the test programs, for the purpose of determining the miss rate of the methods.
For ESP, we did not use the information gathered about a given program to predict the branches for that same program; rather we used a cross-validation study. We took all of the programs, except the one program for which we want to gather prediction results, and fed the corpus of programs into the neural net and decision tree. We then used the neural net's and decision tree's branch classifications to predict branches for that program not included in the corpus. This provides a conservative estimate of how well ESP will perform, since we are predicting the behavior of a program that the neural net and decision tree have not seen. For the ESP results shown in Section 6, we performed the cross-validation study breaking the programs into two groups-C programs and Fortran programs. We performed cross-validation feeding the feature sets for 22 of the C programs at a time into the neural net and decision tree, predicting branches for the 23rd C program not included in the initial 22. We did the same for Fortran programs feeding into the neural net and decision tree the feature sets for 19 of the 20 programs in order to predict branches for the 20th program.
We now compare the prediction accuracy of a priori heuristic combination (APHC) branch prediction [Ball and Larus 1993], the Dempster-Shafer heuristic combination (DSHC) proposed by Wu and Larus [19941, and our ESP techniques.
6.1 Comparison: APHC, DSHC, and ESP
Table VII shows the branch misprediction rate for the methods we implemented. The first column shows the results for the BTFNT architecture. The second column shows the results for our implementation of the Ball and Larus heuristics, and the third and fourth columns show the results when applying Dempster-Shafer to those heuristics. In implementing DSHC, we use both the original prediction rates specified by Ball and Larus, DSHC(B&L), and the prediction rates produced by our implementation, DSHC(Ours). The fifth column in Table VII shows the results for ESP using neural nets, and the sixth column shows results for ESP using decision trees. The last column shows the results for the perfect static profile prediction. By perfect profile prediction, we mean the miss rate achieved when the same input is used to both profile a program, in order to predict the program's branches, and used when recording the miss rate. Table VII reveals several interesting points. First, the overall average shows that the Dempster-Shafer method performs no better than the fixed order of heuristics. Wu and Larus said
[TABULAR DATA VII NOT REPRODUCIBLE IN ASCII]
When more than one heuristic applies to a branch, combining the probabilities
estimated by the applicable heuristics should produce an overall branch probability
that is more accurate than the individual probabilities [Wu and Larus
However, there was no comparison to the earlier results of Ball and Larus. In six cases (flex, sort, mdljsp2, CSS, NAS, TFS), the Dempster-Shafer's miss rate is more than 5% higher (worse) than the simple APHC ordering, while the APHC ordering method is 5% worse in only three cases (wdiff, SDS, LWS). The intuition of Wu and Larus was correct; however, the Dempster-Shafer theory does not combine the evidence well enough to improve branch prediction. The ESP techniques perform significantly better than the Dempster-Shafer and the APHC methods for most of the programs.
The ESP DT (self) column provides insight into the upper limit of the effectiveness of the ESP approach. In this case, the same data set was used for training as well as testing. Ideally, such a method should approach the performance of the perfect static predictor, and in many cases, but not all, it does. In most cases, self-training performs significantly better than cross-validation training, suggesting that the choice of training set and its closeness to the application being predicted can have a significant effect on the performance of the method.
Table VII also allows us to compare the effectiveness of the decision tree and neural network prediction methods. In many cases the methods produce quite similar results, and this similarity is also reflected in the values of the overall average miss rates. The largest difference seen between the methods occurs in the tomcatv application, which we have already discussed in the context of the heuristic prediction methods. Tomcatv exhibits such large differences between methods because there are a very small number of important branches that must be correctly predicted for good performance. The table shows that the neural network correctly predicts all of these important branches, whereas the decision tree method misses two of them. One reason that the neural network can perform better than the decision tree is that the decision tree method uses a greedy algorithm and selects one feature at a time in determining what features to use in categorizing branches. The neural net, on the other hand, is capable of identifying multifeature combinations that could correlate highly with the branch direction.
To better understand why the decision tree method fails to predict the branches of tomcatv, we can look in detail at the different programs used to predict its behavior. In Figures 4 and 5 (discussed in more detail in Section 6.3.1), we show the miss rates that result from using one program to predict another (i.e., using one program as the training set and another as the test set). In Figure 5, in the "to" column (3rd to last) we show how effective the other Fortran programs in the corpus are at predicting tomcatv using a decision tree. It is clear from this figure that only five of the other 19 programs contain information that the decision tree method can use to predict the branches of tomcatv. This suggests that the branch behavior in tomcatv is not highly typical of the other Fortran programs we used.
[FIGURE 4, 5 GRAPH OMITTED]
6.2 Predicting Library and Main-Program Branches
The miss rates for ESP in Table VII were gathered using a cross-validation study in order to achieve a realistic estimate of ESP's performance. A cross-validation study allows this, since the program being measured is not included in the set of programs used to create the cross-validation training set. Therefore, the code for the program being examined is not profiled in the cross-validation study. This is true for the main program branches, but not necessarily true for the library code, since the same library routines may have been used by the programs included in the cross-validation study. If a majority of the library branches are executed in the cross-validation training set, and the library routines have similar behavior between different programs, then ESP may be achieving a lot of its performance improvement by more accurately predicting the library branches than main-program branches.
Tables VIII and IX show the breakdown of the average miss rates in terms of library branches and branches executed in the main program. For ESP, results are only shown for using the decision tree approach (ESP-DT). The library branches are all the branches executed by the programs for the standard Digital-Unix libraries. The libraries used by the programs we examined are libc, libm, libFutil, libUfor, libfor, libcursors, libots, and libtermcap. In Table VIII, Lib shows the miss rates for library branches for ESP-DT when only library branches are included in the cross-validation training set, and All represents when all branches are included in the cross-validation training set. Table IX shows the miss rates for the main-program branches for ESP-DT when only Main program branches are included in the training set and when All branches are included in the cross-validation training set. The reason for using two different training sets was to examine the difference in performance when excluding either the main-program branches or the library branches from the training profiles.
[TABULAR DATA VIII & IX NOT REPRODUCIBLE IN ASCII]
Tables VIII and IX show that the miss rate difference between the B&L heuristics and ESP-DT is significantly larger for the library branches. from 35% down to 23%, than for the main-program branches, from 23% down to 22%. This difference is largest for the branches occurring in Fortran library routines. This decrease in miss rate for the library branches most likely comes from the library branches having very similar behavior between different programs. In a related study [Calder et al. 19951, we found this to be the case. In that study we found that when using a cross-validation profile of library code to predict library branches for a program not included in the profile, the miss rate for the library branches was 12%, which was close to the perfect profile miss rate of 6% and significantly lower than the B&L heuristic miss rate of 47%. This indicates that these library branches had very predictable behavior between different programs, and the results in Table VIII show that ESP was also able to provide a better automatic identification of the heuristics that identify these predictable library branches.
Table X shows the overall average miss rates from Table VII. The ESP-DT results are shown for when separate (Sep) cross-validation training sets are used to predict the library and main-program branches and when one combined cross-validation training set is used to predict all (All) branches. These results and the results in Tables VIII and IX show that it is better to use the combined training including all the branches when predicting either library or main-program branches, especially for the C programs. When using the combined training set in Table X the miss rate is lowered from 22% to 21% in comparison to using separate training sets.
[TABULAR DATA X NOT REPRODUCIBLE IN ASCII]
Table VIII shows that including the main program branch feature sets into the cross-validation training set when predicting library branches reduces the miss rate from 23% down to 21%. These results indicate that it is better to include all program branches in the training sets in order to achieve the best prediction accuracy for either library or main program branches.
Even though ESP shows only a small improvement in miss rate, 23% down to 22%, over the APHC heuristics in Table IX for main-program branches, ESP is still a much more attractive automated solution for finding program-based static branch prediction for a given architecture, programming language, and run-time system over using expert-defined heuristics.
6.3 Sensitivity of Decision Tree ESP Results
We now examine how sensitive ESP is to the number of programs used in the training set for prediction and the types of features included in the feature set when using decision tree ESP prediction.
6.3.1 Effect of Size and Content of Training Set. Figures 4 and 5 show a matrix of miss rates using the ESP training set from one program to predict branches in all the other programs. The miss rate for each box in the matrix is for the corresponding program listed at the top of the matrix when the ESP training set from the program listed on the left-hand side of the matrix is used to predict the branches. For example in Figure 4, when using the ESP training set from alvinn to predict the branches in ear(ea), ear has an 8% miss rate. The miss rates with a darkened box around them are the highest miss rates for each program listed in the columns. The miss rates that are shaded are the lowest miss rates for each program listed in the columns.
Not surprisingly, both of the matrices show that the best predictor for most programs was the own program's training set. In Figure 4, the C program training sets that provided the worst prediction came from li, wdiff, and alvinn. In Figure 5, the Fortran programs that provided the worst training set for prediction are not as concentrated as in the C programs, and they include mdljsp2, SDS, fpppp, swm256, and tomcatv. What makes these programs particularly poor training sets? The answer can be seen in the "branch quantile" values in Table VI. Each of the programs that are poorly predictive tend to have a small number of branches that dominate the branch activity in the program, as indicated by the "Q95" value in Table VI. A small number of branches provides little training information for the machine-learning algorithms.
[FIGURES 4 & 5, ILLUSTRATION OMITTED]
Figure 6 shows the average miss rates for adding one program at a time into the cross-validation training set. The "Best First" order adds one program at a time to the training set, starting with the program that achieved the lowest average miss rate in Figures 4 and 5 (e.g., the C program flex and the Fortran program ora) and ending with the program that had the highest miss rate. In Figure 6, the "Worst First" order adds the programs to the cross-validation training set in the reverse order of "Best First" (i.e., adding the program with the highest average miss rate to the training set first).
[FIGURE 6, ILLUSTRATION OMITTED]
While the curves are not monotonic, the trend for the first half of all curves in Figure 6 is that the miss rate of the training set decreases as the size of the training set increases. Interesting, for the "Best First" lines, beyond the first half, the miss rate stays relatively constant, and in the case of Fortran it actually increases. This effect is not as apparent in the "Worst First" curves. Based on the previous figures, it is clear that some of the programs in the corpus are quite atypical. Atypical programs in the corpus would have a negative effect on the ESP method, especially if they formed a majority of the programs in the corpus. The results in Figure 6 suggest that both our C and Fortran corpuses were sufficiently large and diverse that the atypical programs did not have large negative effect on their predictive ability. But the figure also indicates that care must be taken that the programs in the corpus are "typical" of the programs being predicted, an admittedly difficult quality to measure.
6.3.2 Effect of Size and Content of Feature Set. Table XI shows the miss rates for each of the features described in Table V in sorted order of the feature's performance for both the C and Fortran programs. One interesting result is that the best feature for predicting the Fortran programs, Branch Instruction, is the worst predictor for the C programs. The features with the best prediction between the Fortran and C programs are NotTaken Successor Exit, Taken Successor Exit, and Taken Successor Ends. The first two features correspond to the loop-exiting heuristic which predicts that conditional branches with edges exiting a loop will not have that edge taken.
[TABULAR DATA XI NOT REPRODUCIBLE IN ASCII]
The best single predictor for the C programs is Taken Successor Ends. This feature encodes the branch type of the instruction that ends the "taken successor" block (see Figure 2 for an explanation). Basic blocks can be terminated by a number of instructions, the most frequent of which is another conditional branch, which occurs 65% of the time. We examined the decision tree formed by the Taken Successor Ends heuristic in more detail; while the branch prediction contribution of the conditional branches was largest, the seven other branch types contributed significantly to the overall branch prediction rate -- no one feature stood out as particularly more important than the others. Simply ignoring the extra information, as a human expert may be tempted to do, would have increased the miss rate by a few percent. Using an automated expert system, we were able to benefit from the additional information with little effort.
Table XII gives the top 10 double, triple, and quadruple feature combinations, which provided the lowest miss rates. The feature numbers listed in the table correspond to the feature numbers in Table XI. The table shows that combining the top 4 features in Table XI does not necessarily achieve the lowest miss rate. For example, three of the features in the best quadruple feature combination (01,03,28,25) for the C programs have among the worst miss rate for individual feature performance in Table XI. It is also striking that using the best four-feature combination in Fortran can result in miss rates significantly lower than those achieved when using the entire feature set (15.6% versus 20.3%).
Figure 7 shows the miss rate for adding features one at a time into the cross-validation profile when using the decision tree ESP approach. Similar to that in Figure 6, the "Best First" order for the C and Fortran programs represents adding the features in the order shown in Table XI going from the features with the lowest miss rate to the highest miss rate. The "Worst First" order in Figure 7 adds the features in the reverse order of "Best First."
[FIGURES 7, GRAPH OMITTED]
For each feature there is a number, a short description of the feature, and the feature's miss rate for the C and Fortran program. Number represents the feature's number.
The figure shows a slight downward trend in the lines as more features are added, although as with Figure 6 the trends are not monotonic. One striking aspect of some of the lines is how adding a single feature (and not necessarily an important one by itself) can significantly reduce the miss rate. This is particularly true when the 13th feature (NotTaken Successor Ends, feature 25) is added in the "Best First" Fortran line. This result, in combination with the data in Table XII. strongly suggests that for Fortran, at least, a small feature set containing the most important features results in the best miss rates for the technique. Any additional features appear to have little effect and sometimes reduce the effectiveness of the technique.
[TABULAR DATA XII NOT REPRODUCIBLE IN ASCII]
Branch prediction is very important in modern computer architectures. In this article, we investigated methods for static program-based branch prediction. Such methods are important, since they can be used to estimate branch behavior at compile-time when performing compiler optimizations [Hank et al. 1993].
We proposed a new, general approach to program-based behavior estimation called evidence-based static prediction (ESP). We then showed how our general approach can be applied specifically to the problem of program-based branch prediction. The main idea of ESP is that the behavior of a corpus of programs can be used to infer the behavior of new programs. In this article, we used a neural network and decision tree to map static features associated with each branch to a prediction that the branch will be taken.
ESP has the following advantages over existing program-based branch prediction approaches. First, instead of being based on heuristics, it is based on a corpus of information about actual program behavior and structure. We have observed that the effectiveness of heuristic approaches to branch prediction can be architecture, compiler, and language dependent. Thus, ESP can be specialized easily to work with new and different programming languages, compilers, computer architectures, or run-time systems. It is our hope that it can even be customized for specific application domains or workgroups with a modest amount of effort.
Second, the ESP approach does not require careful consideration when deciding what features to include in the training data. The neural net and decision tree are capable of ignoring information that is irrelevant, and such information does not degrade the performance of the branch predictions. On the other hand, with heuristic methods, trial and error are often required to find heuristics that are effective.
Finally, we have shown that the ESP approach results in branch prediction miss rates that are better than previous program-based heuristic approaches. Over a collection of 43 C and Fortran programs, the overall miss rate of ESP branch prediction was 20%, which compares against the 25% miss rate using a fixed ordering of the Ball and Larus heuristics and the overall 8% miss rate of the perfect static-profile predictor.
We see many future directions to take with this work. Currently, we have investigated how effective the neural network and decision tree are at providing a prediction for each branch. But both methods provide additional information, and the decision tree in particular provides an estimate of the branch probability. If that probability is greater than 50% we estimate that the branch will be taken. One future goal is to incorporate this branch probability data to perform program-based profile estimation using ESP. It is simple to add more "features" into our training information. We also plan to gather large bodies of programs in other programming languages, such as C++, Java, and Scheme and evaluate how ESP branch prediction works for those languages. We are also interested in seeing how effective other classification techniques, such as memory-based reasoning, will perform for ESP prediction. Finally, we are interested in comparing the effectiveness of using ESP prediction techniques against using profile-based methods across a range of optimization problems.
In order to achieve full instruction-level parallelism on future processors, profile-based compiler optimizations need to be performed. Since not all users will take the time and effort to profile their programs, techniques that estimate program behavior at compile time, such as ESP, will become increasingly important.
(1) In the above discussion we assumed that the network output will be either 0 or 1. However, the output must be continuous valued in order to apply gradient-based training procedures. Thus, we use the continuous activation rule for v presented earlier, and we simply interpret the continuous output as the network's confidence that the true branch probability is greater than 0.5.
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Poterfield. A., and Smith, B. 1990. The Tera computer system. In the International Conference on Supercomputing. IEEE, New York. 1-6.
Arnold, C. N. 1982. Performance evaluation of three automatic vectorizer packages. In Proceedings of the 1982 International Conference on Parallel Processing. IEEE, New York, 235-242.
Balasundaram, V., Fox, G., Kennedy, K., and Kremer, U. 1991. A static performance estimator to guide data partitioning decision. In the 3rd ACM SIGPLAN Symposium on, Principles and Practice of Parallel Program ming. ACM, New York, 213-223.
Ball, T. and Larus, J. R. 1993. Branch prediction for free. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation. ACM, New York, 300-313.
Berry, M. 1989. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. Int. J. Supercomput. Appl. 3, 3 (Fall), 5-40.
Calder. B. and Grunwald, D. 1994a. Fast and accurate instruction fetch and branch prediction. In the 21st Annual International Symposium on Computer Architecture. ACM, New York, 2-11.
Calder, B. and Grunwald, D. 1994b. Reducing branch costs via branch alignment. In the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 242-251.
Calder B. Grunwald, D., and Srivastava, A. 1995. The predictability of branches in libraries, In the 28th International Symposium on Microarchitecture. IEEE New York, 24-34.
Calder, B.. Grunwald, D., and Zorn, B. 1994. Quantifying behavioral differences between C and C++ programs. J. Program. Lang. 2, 4. Also available as Tech. Rep. CU-CS-698-94.. Univ, of Colorado, Boulder, Colo.
Chang, P. P. and Hwu, W. W. 1992. Profile-guided automatic inline expansion for C programs, Softw. Pract. Exper. 22, 5, 349-376.
Chang, P. P., Mahlke, A. S., and Hwu, W. W. 1991. Using profile information to assist classic compiler code optimizations. Softw,. Pract. Exper. 21, 12. 1301-1321.
Dempster, A. P. 1968. A generalization of Bayesian inference. J. Roy. Stat. Soc. 30. 205-247.
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. C-30, 7 (July), 478 490.
Fisher, J. A. and Freudenberger, S. M. 1992. Predicting conditional branch directions from previous runs of a program. In Proceedings of the 5th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V). ACM, New York, 85-95.
Hank, R., Mahlke, S., Bringmann, R., Gyllenhall, J., and Hwu, W. W. 1993. Superblock formation using static program analysis. In the 26th International Symposium on Microarchitecture, IEEE, New York, 247-256.
Hudson, C., Macke, T., Davies, J., Wolfe, M., and Leasure, B. 1986. The KAP/205: An advanced source-to-source vectorizer for the Cyber 205 supercomputer. In Proceedings of the 1986 International Conference on Parallel Processing. IEEE, New York, 827-835.
Hwu, W.-M. W. and Chang, P. P. 1989. Achieving high instruction cache performance with an optimizing compiler. In the 16th Annual International Symposium on Computer Architecture. ACM, New York, 242-251.
McFarling, S. and Hennessey, J. 1986. Reducing the cost of branches, In the 13th Annual International Symposium on Computer Architecture. ACM, New York, 396-403.
Pettis, K. and Hansen, R. C. 1990. Profile guided code positioning. In Proceed/ings of the SIGPLAN '90 Conference on Programming Language Design and Implementation. ACM, New York, 16-27.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, Calif.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1, Foundations, D. E. Rumelhart and J. L. McClelland, Eds. MIT Press, Cambridge, Mass., 318-362.
Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton University Press, Princeton, N.J.
Smith, J. E. 1981. A study of branch prediction strategies. In the 8th Annual International Symposium on Computer Architecture. ACM, New York, 135-148.
Smolensky, P., MOZER, M. C., and Rumelhart, D. E., Eds. 1996. Mathematical Perspectives on Neural, Networks. Erlbaum, Hillsdale, N.J.
Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation. ACM, New York, 196-205.
Wagner, T. A., Maverick, V., Graham, S., and Harrison, M. 1994. Accurate static estimators for program optimization. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation. ACM, New York, 85-96.
Wu, Y. and Larus, J. R. 1994. Static branch frequency and program profile analysis. In the 27th International Symposium on Microarchitecture. IEEE, New York.
Yeh, T.-Y. and Patt, Y. N. 1993. A comparison of dynamic branch predictors that use two levels of branch history. In the 20th Annual International Symposium on Computer Architecture. ACM, New York, 257-266.
|Printer friendly Cite/link Email Feedback|
|Author:||Calder, Brad; Grunwald, Dirk; Jones, Michael; Lindsay, Donald; Martin, James; Mozer, Michael; Zorn,|
|Publication:||ACM Transactions on Programming Languages & Systems|
|Date:||Jan 1, 1997|
|Previous Article:||Implementing signatures for C++.|
|Next Article:||Pure versus impure Lisp.|