Analysis of benchmark characteristics and benchmark performance prediction.
Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems--measurement techniques; modeling techniques; performance attributes; D.2.8 [Software Engineering]: Metrics--performance measures; I.6.4 [Simulation and Modeling]: Model Validation and Analysis
General Terms: Measurement, Performance
Additional Key Words and Phrases: Abstract machine performance model, benchmark analysis, execution time prediction, microbenchmarking
Benchmarking is the process of running a specific program or workload on a specific machine or system and measuring the resulting performance. This technique clearly provides an accurate evaluation of the performance of that machine for that workload. These benchmarks can either be complete applications [Dongarra et al. 1987; MIPS 1989; UCB 1987], the most executed parts of a program (kernels) [Bailey and Barton 1985; Doduc 1989; McMahon 1986], or synthetic programs [Curnow and Wichmann 1976; Weicker 1988]. Unfortunately, benchmarking fails to provide insight as to why those results were obtained (either in terms of machine or program characteristics) and fails to provide run-times for that program on some other machine, or some other program on that machine [Dongarra et al. 1987; Worlton 1984]. This is because benchmarking fails to characterize either the program or machine. In this article we show that these limitations can be overcome with the help of a performance model based on the concept of a high-level abstract machine.
Our machine model consists of a set of abstract operations representing, for some particular programming language, the basic operators and language constructs present in programs. A special benchmark called a machine characterizer is used to measure experimentally the time it takes to execute each abstract operation (AbOp). Frequency counts of AbOps are obtained by instrumenting and running benchmarks. The machine and program characterizations are then combined to obtain execution time predictions. Our results show that we can predict with good accuracy the execution time of arbitrary scalar programs on a large spectrum of machines, thereby demonstrating the validity of our model. As a result of our methodology, we are able to individually evaluate the machine and the benchmark, and we can explain the results of individual benchmarking experiments. Further, we can describe a machine which does not actually exist and can predict with good accuracy its performance for a given workload.
In a previous paper we discussed our methodology and gave an in-depth presentation on machine characterization [Saavedra-Barrera et al. 1989]. In this article we focus on program characterization and execution time prediction; note that this article overlaps with Saavedra-Barrera et al.  to only a small extent, and only with regard to the discussion of the necessary background and methodology. Here, we explain how programs are characterized and present extensive statistics for a large set of programs including the Perfect Club and SPEC benchmarks. We discuss what these benchmarks measure and evaluate their effectiveness; in some cases, the results are surprising.
We also use the dynamic statistics of the benchmarks to define a metric of similarity among the programs; similar programs exhibit similar relative performance across many machines.
The structure of the article is as follows. In Section 2 we present an overview of our methodology, explain the main concepts, and discuss how we do program analysis and execution time prediction. We proceed in Section 3 by describing the set of benchmarks used in this study. Section 4 deals with execution time prediction. Here, we present predictions for a large set of machine-program combinations and compare these against real execution times. In Section 5 we present an extensive analysis of the benchmarks. The concept of program similarity is considered in Section 6. Section 7 ends the article with a summary and some of our conclusions. The presentation is self-contained and does not assume familiarity with the previous article.
2. ABSTRACT MODEL AND SYSTEM DESCRIPTION
In this section we present an overview of our abstract model and briefly describe the components of the system. The machine characterizer is described in detail in Saavedra-Barrera et al. ; this article is principally concerned with the execution predictor and program analyzer.
2.1 The Abstract Machine Model
The abstract model we use is based on the Fortran language, but it equally applies to other algorithmic languages. Fortran was chosen because it is relatively simple, because the majority of standard benchmarks are written in Fortran, and because the principal agency funding this work (NASA) is most interested in that language. We consider each computer to be a Fortran machine, where the run-time of a program is the (linear) sum of the execution times of the Fortran abstraction operations (AbOps) executed. Thus, the total execution time of program A on machine M ([T.sub.A,M]) is just the linear combination of the number of times each abstract operation is executed ([C.sub.i]), which depends only on the program, multiplied by the time it takes to execute each operation ([P.sub.i]), which depends only on the machine:
(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
[P.sub.M] and [C.sub.A] represent the machine performance vector and program characterization vector respectively.
Equation (1) decomposes naturally into three components: the machine characterizer, program analyzer, and execution predictor. The machine characterizer runs experiments to obtain vector [P.sub.M]. The dynamic statistics of a program, represented by vector [C.sub.A], are obtained using the program analyzer. Using these two vectors, the execution predictor computes the total execution time for program A on machine M.
We assume in the rest of this article that all programs are written in Fortran, are compiled with optimization turn off, and executed in scalar mode. All our statistics reflect these assumptions. In Saavedra-Barrera  and Saavedra and Smith [1995a; 1995b] we show how our model can be extended (very successfully) to include the effects of compiler optimization and cache misses.
2.2 Linear Models
As noted above, our execution prediction is the linear sum of the execution times of the AbOps executed. Equation (1) shows this linear model. Although linear models have been used in the past to fit a k-parametric "model" to a set of benchmark results, our approach is entirely different; we never use curve fitting. All parameter values are the result of direct measurements, and none are inferred as the solution of some fitted model. We make a specific point of this because this aspect of our methodology has been misunderstood in the past.
2.3 Machine Characterizer
The machine characterizer is a program which uses narrow spectrum benchmarking or microbenchmarking to measure the execution time of each abstract operation. It does this by, in most cases, timing a loop both with and without the AbOp of interest; the change in the run-time is due to that operation. Some AbOps cannot be so easily isolated, and more complicated methods are used to isolate the parameter of interest. There are 109 operations in the abstract model, up from 102 in Saavedra-Barrera et al. ; the benchmark set has been expanded since that time when additional AbOps were found to be needed.
The number and type of operations are directly related to the kind of language constructs present in Fortran. Most of these are associated with arithmetic operations and trigonometric functions. In addition, there are parameters for procedure call, array index calculation, logical operations, branches, and do loops [Saavedra-Barrera 1992; Saavedra-Barrera et al. 1989].
We note that obtaining accurate measurements of the AbOps is very tricky because the operations take nanoseconds, and the clocks on most machines run at 60 or 100 hertz. To get accurate measurements, we run our loops large numbers of times and then repeat each such loop measurement several times. There are residual errors, however, due to clock resolution, external events like interrupts, multiprogramming and I/O activity, unreproducible variations in the hit ratio of the cache, and paging [Clapp et al. 1986]. These issues are discussed in more detail [Saavedra-Barrera et al. 1989].
2.4 The Program Analyzer
The analysis of programs consists of two phases: the static analysis and the dynamic analysis. In the static phase, we count the number of occurrences of each AbOp in each line of source code. In the dynamic phase, we instrument the source code to give us counts for the number of executions of each line of source code, and then compile and run the instrumented version. The instrumented version tends to run about 15% slower than the uninstrumented version.
Let A be a program with input data I. Let us number each of the basic blocks of the program j = 1, 2, ... , m, and let [s.sub.i,j] (i = 1, 2, ... , n) designate the number of static occurrences of operation [P.sub.i] in block [B.sub.j]. Matrix [S.sub.A] = [[s.sub.i,j]] of size n X m represents the complete static statistics of the program. Let [[Mu].sub.A] = <[[Mu].sub.1], [[Mu].sub.2], ... , [[Mu].sub.j]> be the number of times each basic block is executed; then matrix [D.sub.A] = [[d.sub.j,j]] = [[[Mu].sub.j] [multiplied by] [s.sub.i,j]] gives us the dynamic statistics by basic block. Vector [C.sub.A] and matrix [D.sub.A] are related by the following equation:
(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Obtaining the dynamic statistics in this way makes it possible to compute execution time predictions for each of the basic blocks, not only for the whole program.
The methodology described above permits us to measure M machines and N programs and then compute run-time predictions for N [multiplied by] M combinations. Note that our methodology will not apply in two cases. First, if the execution history of a program is precision dependent (as is the case with some numerical analysis programs), then the number of AbOps will vary from machine to machine. Second, the number of AbOps may vary if the execution history is real-time dependent. All programs that we consider in this article have execution histories that are precision and time independent.(1)
2.5 Execution Prediction
The execution predictor is a program that computes the expected execution time of program A on machine M from its corresponding program and machine characterizations. In addition, it can produce detailed information about the execution time of sets of basic blocks or how individual abstract operations contribute to the total time.
Figure 1 shows a sample of the output produced by the execution predictor. Each line gives the number of times that a particular AbOp is executed and the fraction of the total that it represents. Next to it is the expected execution time contributed by the AbOp and the fraction of the total. The last line reports the expected execution time for the whole program.
[Figure 1 ILLUSTRATION OMITTED]
The statistics from the execution predictor provide information about what factors contribute to the execution time, either at the level of the abstract operations or individual basic blocks. For example, Figure 1 shows that 57% of the time is spent computing the address of a two-dimensional array element (arr2). This operation, however, represents only 33% of all operations in the program (column six). By comparing the execution predictor outputs of different machines for the same program, we can see if there is some kind of imbalance in any of the machines that makes its overall execution time larger than expected [Saavedra-Barrera and Smith 1990].
2.6 The Effects of the Memory Hierarchy, Compiler Optimization, and Vectorization
The goals of this article are (1) to show that good execution time predictions can be obtained for non-optimized programs using the basic abstract machine model and (2) to show that the same model can be used to enhance our understanding of the purpose and effectiveness of benchmarks. Because the effects of compiler optimization and the memory hierarchy require making some changes to the basic model, we eliminate the effects of the former by considering here only non-optimized code, and at this time we also ignore memory hierarchy delays. Ignoring the memory hierarchy is acceptable because almost all of the benchmarks in the SPEC and Perfect suite have low cache and TLB miss rates. For completeness purposes, however, in this section we briefly explain how the model can be (and is) extended to account for the effects of the memory hierarchy, compiler optimization, and vectorization. See Saavedra and Smith [1995b] for this extension. We also indicate, when appropriate, how these factors affect our ability to predict the execution time of programs. More details are given in the appropriate references.
Memory Hierarchy. To account for the effects of the memory hierarchy we can add an explicit term to Eq. (1) and get
(3)[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [M.sub.A,M] and [S.sub.M] represent vectors of misses and stall time penalties affecting all relevant levels of the memory hierarchy [Saavedra and Smith 1995b].
We note that the miss rates will be different for each machine, depending on the configurations and sizes of the caches and TLBs. Our extension in Saavedra and Smith [1995b] relies on miss ratios available from Gee and Smith .(2)
We can afford to ignore the effects of the memory hierarchy because most of the SPEC and Perfect benchmarks have low miss ratios for the cache and TLB. In Saavedra and Smith [1995b] we show that, on the SPEC benchmarks, including the effects of the cache and TLB misses accounts, on the average, for no more than 7.09% and 1.03%, respectively, of the total execution time. With respect to cache misses, the maximum delay contribution occurs on NASA7 with 8.55%, and with respect to TLB misses, the maximum contribution is for MATRIX300 with 3.09%. Overall, the effect of the memory hierarchy on our prediction errors reduces the root mean square error by 2.36%. Therefore, ignoring memory hierarchy effects does not prevent us from validating the predictions of the simpler model.
Optimization. In this article we consider only nonoptimized execution in both our predictions and run-time measurements. Ideally, a proper characterization of the effect of optimization would determine which optimizations a given compiler would perform on a given program; unfortunately, this is impractical, since if we could do that, we could write a better optimizing compiler than any of the ones that we would be analyzing.
The optimization issue is dealt with explicitly in Saavedra and Smith [1995a] and Saavedra-Barrera . There we do several things: we evaluate the performance improvement provided by various optimizing compilers; we determine explicitly which machine-independent optimizations are implemented by each compiler; we evaluate the extent to which different compilers do the same optimizations (and obtain similar speedups); and we successfully predict run-times for optimized code. We do the latter by defining an "optimized machine," which is the set of timings we obtain for the AbOps for optimized code. The performance of this optimized abstract machine is characterized by a new performance vector [P.sub.M,O]. This approach works because low-level optimizations, which appear to dominate, tend to improve the sequence of low-level instructions implementing high-level ones, without eliminating the semantics of the original instructions. In Saavedra-Barrera [1995a] we formalize this notion by introducing the concept of invariant optimization.
Once the effects of high-level, low-level, and microoptimizations have been captured with vectors [C.sub.A,O] and [P.sub.M,O], the execution time of an optimized program can be predicted using Eq. (1). In Saavedra and Smith [1995a] we show that the RMS (root mean square) predictions errors for nonoptimized and optimized programs are 20.59% and 31.89%, respectively. The error increase is quite small considering that; the amount of improvement due to optimization range from less to 40% to as much as 80%.
Vectorization. Extending our model to include vector machines is straightforward and consists of two steps. First, a set of vector AbOps which characterizes the performance of individual vector operations is added to the original set. In contrast with scalar AbOps, which are characterized with a single number, vector AbOps require three numbers: the startup delay, the maximum vector length, and the asymptotic latency. Second, vectorizing a program is similar to applying a high-level optimization, in which some statements inside the innermost loops are transformed into a sequence vector statements? The resulting vectorized program represents a valid Fortran program written in vector notation. Hence, computing the dynamic statistics for a vectorized program is not more difficult than doing it for a scalar one. Finally, an equation similar to (1) can be used to make execution time predictions. Some work along these lines appears in Von Worley and Smith .
2.7 Related Work
Several papers have proposed different approaches to execution time prediction, with significant differences in their degrees of accuracy and applicability. These attempts have ranged from using simple Markov Chain models [Beizer 1978; Ramamoorthy 1965] to more complex approaches that involve solving a set of recursive performance equations [Hickey and Cohen 1988]. Here we mention three proposals that are somewhat related to our concept of an abstract machine model and to the use of static and dynamic program statistics.
One way to compare machines is to do an analysis similar to ours, but at the level of the machine instruction set [Pueto and Shustek 1977]. This approach only permits comparisons between machines which implement the same instruction set.
In the context of the PTRAN project [Allen et al. 1987], execution time prediction has been proposed as a technique to help in the automatic partitioning of parallel programs into tasks. In Sarkar , execution profiles are obtained indirectly by collecting statistics on all the loops of a possible unstructured program and then combining that with analysis of the control dependence graph.
In Balasundaram et al.  a prototype of a static performance estimator which could be used by a parallel compiler to guide data-partitioning decisions is presented. These performance estimates are computed from machine measurements obtained using a set of routines called the training set. The training set is similar to our machine characterizer. In addition to the basic CPU measurements, the training set also contains tests to measure the performance of communication primitives in a loosely synchronous distributed-memory machine. The compiler then makes a static analysis of the program and combines this information with data produced by the training set. A prototype of the performance estimator has been implemented in the ParaScope interactive parallel programming environment [Balasundaram et al. 1989]. In contrast to our execution time predictions, the compiler does not incorporate dynamic program information; the user must supply the lower and upper bounds of symbolic variables used for do loops, and branching probabilities for if-then-else statements (or use the default probabilities provided by the compiler.)
3. THE BENCHMARK PROGRAMS
For this study, we have assembled and analyzed a large number of scientific programs, all written in Fortran, representing different application domains. These programs can be classified in the following three groups: SPEC benchmarks, Perfect Club benchmarks, and small or generic benchmarks. Table I gives a short description of each program. In the list for the Perfect benchmarks we have omitted the program SPICE, because it is included in the SPEC benchmarks as SPICE2G6. For each benchmark except SPICE2G6, we use only one input data set. In the case of SPICE2G6, the Perfect Club and SPEC versions use different data sets, and we characterize both executions and include other relevant examples. A more in-depth description of the benchmarks can be found in Cybenko et al. , SPEC [1989; 1990a; 1990b], and Saavedra-Barrera .
[TABULAR DATA I NOT REPRODUCIBLE IN ASCII]
3.1 Floating-Point Precision
In Fortran, the precision of a floating-point variable can be specified either absolutely (by the number of bytes used, e.g., real*4) or relatively (by using the words "single" and "double"). The interpretation of the latter terms is compiler and machine dependent. Most of the benchmarks we consider (see Table I) use relative declarations; this means that the measurements taken on the Cray machines (see Table II) are not directly comparable with those taken on the other machines. We chose not to modify any of the source code to avoid this problem.
Table II. Characteristics of the Machines Operating Machine Name/Location System CRAY Y-MP/8128 reynolds.nas.nasa.gov UNICOS 5.0.13 CRAY-2 navier.nas.nasa.com UNICOS 6.1 CRAY X-MP/48 NASA Ames COS 1.16 NEX SX-2 harc.edu VM/CMS Convex C-1 convex.riacs.edu UNIX C-1 v6 IBM 3090/200 cmsa.berkeley.edu VM/CMS r.4 IBM RS/6000 530 coyote.berkeley.edu AIX V.3 IBM RT-PC/125 loki.berkeley.edu ACIS 4.3 MIPS M/2000 mammoth.berkeley.edu RISC/os 4.50B1 MIPS M/1000 cassatt.berkeley.edu UMIPS-BSD 2.1 Decstation 3100 ylem.berkeley.edu Ultrix 2.1 Sparcstation I genesis.berkeley.edu SunOS R4.1 Sun 3/50 (68881) venus.berkeley.edu UNIX 4.2 r.3.2 Sun 3/50 baal.berkeley.edu UNIX 4.2 r.3.2 VAX 8600 vangogh.berkeley.edu UNIX 4.3 BSD VAX 3200 atlas.berkeley.edu Ultrix 2.3 VAX-11/785 pioneer.arc.nasa.gov Ultrix 3.0 VAX-11/780 wilbur.arc.nasa.gov UNIX 4.3 BSD Motorola M88K rumble.berkeley.edu UNIX R32.V1.1 Amdahl 5840 prandtl.nas.nasa.gov UTS V Compiler Machine Version Memory CRAY Y-MP/8128 CFT77 184.108.40.206 128 Mw CRAY-2 CFT 220.127.116.11 256 Mw CRAY X-MP/48 CFT 1.14 8 Mw NEX SX-2 FORT77SX 32 Mw Convex C-1 FC v2.2 100 MB IBM 3090/200 Fortran v2.3 32 MB IBM RS/6000 530 XL Fortran v1.1 16 MB IBM RT-PC/125 F77 v1 4 MB MIPS M/2000 F77 v2.0 128 MB MIPS M/1000 F77 v1.21 16 MB Decstation 3100 F77 v2.1 16 MB Sparcstation I F77 v1.3 8 MB Sun 3/50 (68881) F77 v1 4 MB Sun 3/50 F77 v1 4 MB VAX 8600 F77 v1.1 28 MB VAX 3200 F77 v1.1 8 MB VAX-11/785 F77 v1.1 16 MB VAX-11/780 F77 v2 4 MB Motorola M88K F77 v2.0b3 32 MB Amdahl 5840 F77 v2.0 32 MB Integer Real Machine Single Single Double CRAY Y-MP/8128 46 64 128 CRAY-2 46 64 128 CRAY X-MP/48 46 64 128 NEX SX-2 64 64 128 Convex C-1 32 32 64 IBM 3090/200 32 32 64 IBM RS/6000 530 32 32 64 IBM RT-PC/125 32 32 64 MIPS M/2000 32 32 64 MIPS M/1000 32 32 64 Decstation 3100 32 32 64 Sparcstation I 32 32 64 Sun 3/50 (68881) 32 32 64 Sun 3/50 32 32 64 VAX 8600 32 32 64 VAX 3200 32 32 64 VAX-11/785 32 32 64 VAX-11/780 32 32 64 Motorola M88K 32 32 64 Amdahl 5840 32 32 64
The size of the data type implementations are in number of bits.
4. PREDICTING EXECUTION TIMES
We have used the execution predictor to obtain estimates for the programs in Table I and for the machines shown in Table II. These results are presented in Figure 2. (For the complete set of numerical results see Saavedra and Smith .)
[Figure 2 ILLUSTRATION OMITTED]
A subset of programs did not execute correctly on all machines at the time of this research; some of these problems may have been corrected since that time. Some of the reasons for this were internal compiler errors, run-time errors, or invalid results. Livermore Loops is an example of a program which executed in all machines except the IBM RS/6000 530, where it gave a run-time error. A careful analysis of the program reveals that the compiler generated incorrect code. For three programs in the Perfect suite, the problems were mainly shortcomings in the programs. For example, TRACK gave invalid results in most of the workstations even after fixing a bug involving passing of a parameter; MG3D needed 95MB of disk space for a temporary file that few of the workstations had; SPEC77 gave an internal compiler error on machines using MIPS Co. processors; and on the Motorola 88000 the program never terminated.
Our results show not only accurate predictions in general but also reproduce apparent "anomalies," such as the fact that the CRAY Y-MP is 35% faster than the IBM RS/6000 for QCD but is slower for MDG. (Note that because of the relative declarations used for precision, the Cray is actually computing results at twice the precision of the RS/6000.)
In Table III we summarize the accuracy of our run-time predictions. The results show that 51% of all predictions fall within 10% of the real execution times, and almost 79% are within 20%. Only 15 out of 244 predictions (6.15%) have an error of more than 30%. The results represent 244 program-machine combinations encompassing 18 machines and 28 programs. These results are very good if we consider that the characterization of machines and programs is done using a high-level model.
Table III. Error Distribution for the Predicted Execution Times <5% < 10% < 15% <20% 68 (27.9) 124 (55.5) 171 (70.1) 192 (78.7) <30% >30% 229 (93.9) 15(6.15)
For each error interval, we indicate the number of predictions, from a total of 244, having errors that fall inside the interval (percentages inside parenthesis). The error is computed as the relative distance to the real execution time.
The average error and root mean square (RMS) error measured over all machine-program combinations are 0.57% and 15.07%, while the maximum discrepancy occurs on MATRIX300, with an average error of -24.51% and an RMS of 26.36%. Our predictions for this program consistently underestimate the execution time on all machines because, for this program, the number of cache and TLB misses is significant; the model used for this article does not consider this factor. As noted earlier, memory hierarchy delays are considered in Saavedra-Barrera  and Saavedra and Smith [1995a]. Because most of the benchmarks in the SPEC and Perfect suite tend to have low cache and TLB miss ratios [Gee and Smith 1993], our other prediction errors do not have the same problem as for MATRIX300. With respect to machines, the maximum errors occur on the IBM RS/6000 530, where the average error and RMS are 6.41% and 19.98%, respectively. Furthermore, the RS/6000 also gives the maximum positive and negative errors (-35.9% and 44.0%).
4.1 Single-Number Performance
Although it may be misleading, it is frequently necessary or desirable to describe the performance of a given machine by a single number. In Table IV we present (1) both the actual and predicted geometric means of the normalized (with respect to the VAX 11/780) execution times and (2) the error percentages. We can clearly see from the results that our estimates are very accurate; in all cases the difference is less than 8%. In those cases for which they are available, we also show the SPECmark numbers; note that our results are for unoptimized code, and the SPEC figures are for the best-optimized results.(4)
Table IV. Real and Predicted Geometric Means for Normalized Benchmark Results
SPECmark Actual Mean Prediction Difference Cray X-MP/48 N.A. 26.25 26.07 +0.69% IBM 3090/200 N.A. 33.79 32.27 -4.50% Amdahl 5840 N.A. 6.47 6.71 +3.71% Convex C-1 N.A. 7.36 6.99 -5.03% IBM 6000 530 28.90 16.29 15.69 -3.68% Sparc I 11.80 11.13 1.0.58 -4.94% Motorola 88k 15.80 14.24 15.34 +7.72 MIPS M.2000 17.60 13.88 1.3.70 -1.30% DEC 3100 11.30 9.01 8.43 -6.44% VAX 8600 N.A. 5.87 5.63 -4.09% VAX-11/785 N.A. 2.01 2.12 +5.47% VAX-11/780 1.00 1.00 1.00 N.A. Sun 3/50 N.A. 0.69 0.72 +4.35% Average N.A. 12.25 12.02 -1.88%
Execution times are normalized with respect to the VAX-ii/780. For some machines we also show their published SPEC ratios. The reason why some of the SPECmark numbers are higher than either the real or the predicted geometric means is because in contrast to our measurements the SPEC results are for optimized codes.
The significance of these results requires some clarification. The SPEC-mark is a relative metric of performance that uses as its baseline the results of the VAX-11/780. Under this metric, the correct interpretation of the results is not that optimization accounts for 8% of the time improvement, but that the performance of a machine relative to the VAX-11/780 is not significantly affected by the use of nonoptimized or optimized execution times. In other words, the amount of improvement produced by most commercial compilers for the same program exhibits little variability between compilers. Note that this cannot be interpreted as saying that all programs tend to be optimized by the same amount, but only that most compilers are, more or less, equally effective in optimizing the same program. Therefore, in ranking machines, nonoptimized and optimized executions will tend to produce similar results.(5)
5. PROGRAM CHARACTERIZATION
There are several reasons why it is important to know in what way a given benchmark "uses" a machine, i.e., which abstract operations the benchmark performs most frequently. That information allows us to understand the extent to which the benchmark may be considered representative: it shows how the program may be tuned and indicates the goodness of the fit between the program and the machine. With our methodology, this information is provided by the dynamic statistics of the program.
5.1 Normalized Dynamic Distributions
The complete normalized dynamic statistics for all benchmarks can be found in Saavedra and Smith . Here we present only a summary of the results.
5.2 Basic-Block and Statement Statistics
Figure 3 shows the distribution of statements, classified into assignments, procedure calls, if statements, branches, and do loop iterations. On this and similar figures we cluster the benchmarks according to the similarity of their distributions. The cluster to which each benchmark belongs is indicated by a roman numeral at the top of the bar.
[Figure 3 ILLUSTRATION OMITTED]
The results show that there are several programs in the Perfect suite whose distributions differ significantly from those of other benchmarks in the suite. In particular, programs QCD, MDG, and BDNA execute an unusually large fraction of procedure calls. A similar observation can be made in the case of if statements for programs QCD, MDG, and TRACK. TRACK executes an unusually large number of branches.
The SPEC and Perfect suites have similar distributions. SPICE2G6 using model GREYCODE and DODUC are two programs which execute a large fraction of if statements and branches. In GREYCODE, 35% of all its statements are branches, and DODUC has a large number of if statements. The distribution of statements also provides additional data. The distributions for programs FPPPP and BDNA are similar in the sense that both show a large fraction of assignments and a small fraction of do loops. Consistent with this is the observation that the most important basic block in FPPPP contains more than 500 assignments.
In Table V we give the average distributions of statements for the SPEC, Perfect Club, and small benchmarks. We also indicate the average over all programs. These numbers correspond to the average dynamic distributions shown in Figure 3. It is worth observing from this data that although the Perfect Club methodology counts only FLOPS (floating-point operations per second), not all of the benchmarks are dominated by floating-point operations.
Table V. Average Dynamic Distributions of Statements for Each of the Suites and for All Benchmarks
SPEC Perfect Various All Programs Assignments 66.4% 64.5% 53.9% 61.4% Procedure Calls 1.1% 2.7% 1.2% 1.8% IF Statements 5.5% 2.9% 7.6% 5.3% Branches 7.2% 2.8% 7.3% 5.0% DO Loops 19.8% 27.1% 30.0% 26.4%
5.3 Arithmetic and Logical Operations
Figures 4 and 5 depict the distribution of operations according to their type and what they compute. As it is clear from the graphs, for each program, operations on one or two data types are dominant. In this respect the Perfect benchmarks can be classified in the following way: ADM, DYFESM, FL052, and SPEC77 execute mainly floating-point single-precision operators; MDG, BDNA, ARC2D, and TRFD floating-point double-precision operators; QCD and MG3D floating-point single-precision and integer operators; TRACK floating-point double-precision and integer; and OCEAN integer and complex operators. These results further suggest the inadequacy of counting FLOPS as a performance measure. A similar classification can be obtained for the SPEC and the other benchmarks.
[Figures 4 & 5 ILLUSTRATION OMITTED]
With respect to the distribution of arithmetic operators, Figure 5 shows that the largest fraction correspond to addition and subtraction, followed by multiplication. Other operations like division, exponentiation, and comparison are relatively infrequent; see Table VI.
Table VI. Average Dynamic Distributions of Arithmetic and Logical Operations for Each of the Suites and for All Benchmarks
Operations SPEC Perfect Various All Programs Real (single) 2.0% 39.5% 51.4% 35.1% Real (double) 78.0% 40.1% 0.9% 35.5% Integer 17.9% 18.2% 44.8% 27.0% Complex 1.7% 1.8% 0.1% 1.2% Logical 0.4% 0.4% 2.8% 1.2% Arithmetic Operators SPEC Perfect Various All Programs Add/Subtract 52.6% 52.4% 50.0% 51.7% Multiply 38.7% 38.4% 22.4% 33.1% Quotient 1.9% 2.4% 1.3% 1.9% Exponentiation 0.1% 0.6% 0.2% 0.3% Comparison 6.7% 6.2% 25.9% 12.9%
5.4 References to Array and Scalar Variables
Run-time is affected by the need to compute the addresses of array data; no extra time is needed to reference scalar data. The frequencies of references to scalar and N-dimensional arrays are shown in Figure 6. We can see that for most of the Perfect benchmarks, the proportion of array references is larger than for scalar references. The Perfect benchmark with the highest fraction of scalar operands is BDNA, and on the SPEC benchmarks, DODUC, FPPPP, and all models of SPICE2G6 lean toward scalar processing. The distribution of the number of dimensions shows that on most programs a large portion of the references are to one-dimensional arrays with a smaller fraction in the case of two dimensions. However, programs ADM, ARC2D, and FL052 Contain a large number of references to arrays with three dimensions. NASA7 is the only program which contains four-dimensional array references.
[Figure 6 ILLUSTRATION OMITTED]
Most compilers compute array addresses by calculating, from the indices and array dimensions, the offset relative to a base element; the base element (such as X(0, 0, ..., 0)) may not actually be a member of the array. When optimization is disabled, the computation requires n - 1 adds and n - 1 multiplies. In scientific programs, array address computation can be a significant fraction of the total execution time. For example, in benchmark MATRIX300 this can account, on some machines, for more than 60% of the unoptimized execution time. When using optimization, most array address computations are strength-reduced to simple additions; see Saavedra-Barrera  and Saavedra and Smith  for how we handle that case.
The results in Figure 6 show that the average number of dimensions in an array reference for the Perfect and SPEC benchmarks are 1.616 and 1.842, respectively. However, the probability that an operand is an array reference is greater in the Perfect benchmarks (0.5437 versus 0.4568); see Table VII.
Table VII. Average Dynamic Distributions of Operands in Arithmetic Expressions for Each of the Suites and for All Benchmarks
SPEC Perfect Various All Programs Scalar 54.0% 45.7% 52.5% 49.8% Array 1D 13.4% 29.6% 42.6% 30.3% Array 2D 28.1% 15.5% 4.8% 14.7% Array 3D 3.3% 9.2% 0.1% 4.9% Array 4D 1.2% 0.0% 0.0% 0.2%
5.5 Execution Time Distribution
One of our most interesting measurements is the fraction of run-time consumed by the various types of operations; this figure is a function of the program and the machine. As examples, Figures 7 and 8 show the distribution of execution time for the IBM RS/6000 530 and CRAY Y-MP/832. We decompose the execution time into four classes: floating-point arithmetic, array access computation, integer and logical arithmetic, and other operations. All distributions were obtained using our abstract execution model, the dynamic statistics of the programs, and the machine characterizations.
[Figures 7 & 8 ILLUSTRATION OMITTED]
Our previous assertion that scientific programs do more than floating-point computation is evident from Figures 7 and 8. For example, programs QCD, OCEAN, and DYFESM spend more than 60% of their time executing operations that are not floating-point arithmetic or array address computation. This is even more evident for GREYCODE. Here less than 10% of the total time on the RS/6000 530 is spent doing floating-point arithmetic. The numerical values for each benchmark suite are given in Table VIII.
Table VIII. Average Dynamic Distributions of Execution Time for Each of the Suites and for All Benchmarks on the IBM RS/6000 530 and the CRAY Y-MP/832
IBM RS/6000 530 SPEC Perfect Various All Programs Floating Point 26.64% 21.33% 16.61% 20.94% Array Access 47.50% 51.40% 31.19% 43.80% Integer 10.30% 8.26% 20.51% 12.80 Other Operations 15.55% 19.01% 31.69% 22.47% Cray Y-MP/832 SPEC Perfect VariousAll Programs Floating Point 65.59% 56.15% 18.77% 45.79% Array Access 9.36% 10.42% 9.10% 9.75% Integer 5.98% 5.45% 24.26% 11.84% Other Operations 19.07% 27.98% 47.87% 32.63%
From the figures, it is evident that the time distributions for the RS/6000 530 and the CRAY Y-MP are very different even when all programs are executed in scalar mode on both machines. On the average, the fraction of time that the CRAY Y-MP spends executing floating-point operations is 46%, which is significantly more than the 21% on the RS/6000. These results are very surprising, as the CRAY Y-MP has been designed for high-performance floating point. As noted above, however, most of the benchmarks are double precision, which on the CRAY is 128 bits, and double precision on the CRAY is about 10 times slower than 64-bit single precision, since it is implemented in software. This effect is seen clearly in programs: DODUC, SPICE2G6, MDG, TRACK, BDNA, ARC2D, and TRFD. Using our program statistics, however, we can easily compute the performance when all programs execute using 64-bit quantities on all machines. In this case, we compute that the fraction of time represented by floating-point operations on the CRAY Y-MP decreases to 29%, still higher than for the RS/6000. Note that this is an example of the power of our methodology--we are able to compute the performance of a machine which does not exist.
The results also show the large fraction of time spent by the IBM RS/6000 in array address computation. One example is program FL052, which makes extensive use of three-dimensional arrays. In contrast, the distributions of MANDELBROT and WHETSTONE clearly show that these are scalar codes completely dominated by floating-point computation. Remember, however, that our statistics correspond to unoptimized programs. With optimization, the fraction of time spent computing array references is smaller, as optimizers in most cases replace most array address computations with a simple add by precomputing the offset between two consecutive elements of the array. This corresponds to applying strength reduction and backward code motion.
In Figure 9 we show the overall average time distribution for several of the machines. In the case of the supercomputers (CRAY Y-MP, NEC SX-2, and CRAY-2), single and double precision correspond to 64 and 128 bits. The results show that on the VAX 9000, HP-9000/720, RS/6000 530, and machines based on the R3000/R3010 processors, the floating-point contribution is less than 30%.
[Figure 9 ILLUSTRATION OMITTED]
5.6 Dynamic Distribution of Basic Blocks
Figure 10 shows the fraction of basic-block executions accounted for by the 5, 10, 15, 20, and 25 most frequently executed basic blocks. (A basic block is a segment of code executed sequentially with only one entry and one exit point.) There is an implicit assumption among benchmark users that a large program with a long execution time represents a more difficult and "interesting" benchmark. This argument has been used to criticize the use of synthetic and kernel-based benchmarks and has been one of the motivations for using real applications in the Perfect and SPEC suites. However, as the results of Figure 10 show, many of the programs in the Perfect and SPEC suites have very simple execution patterns, where only a small number of basic blocks determines the total execution time. The Perfect benchmark results show that on programs BDNA and TRFD the five most popular blocks account for 95% of all operations, from a total of 883 and 202 blocks respectively. Moreover, on seven of the Perfect benchmarks, more than 50% of all operations are found in only five blocks. The same observation can be made for some of the original SPEC benchmarks. As shown in Table IX, on the average, the five most popular blocks account for 55.45% and 71.85% of the total time on the Perfect and SPEC benchmarks.
Table IX. Portion of Basic-Block Executions Accounted for by 5 Most Frequent, 6-10 Most Frequent, etc. for Each of the Suites and for All Benchmarks
SPEC Perfect Various All Programs 1-5 blocks 72.10% 55.00% 76.80% 66.10% 6-10 blocks 9.10% 14.60% 10.80% 12.10% 11-15 blocks 3.90% 8.30% 5.10% 6.30% 16-20 blocks 2.70% 4.90% 2.90% 3.70% 21-25 blocks 1.90% 4.50% 1.80% 3.00% >25 blocks 10.30% 12.70% 2.60% 8.80%
[Figure 10 ILLUSTRATION OMITTED]
5.6.1 Quantifying Benchmark Instability Using Skewness. When a large fraction of the execution time of a benchmark is accounted for by a small amount of code, the relative running time of that benchmark may vary widely between machines, depending on the execution time of the relevant AbOps on each machine, i.e., the benchmark results may be "unstable." We describe the extent to which the execution time is concentrated among a small number of basic blocks or AbOps as the degree of skewness of the benchmark. (This is not the same as the statistical coefficient of skewness, but the concepts are similar.) We define our skewness metric for basic blocks as 1/X, where
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where p(j) is the frequency of the jth most frequently executed basic block.
To see that the metric for skewness is, in fact, adequate, we have to remember that vector P = ([P.sub.1], ... , [P.sub.N]) represents an ordered probability distribution (statistic), i.e., i [is less than or equal to] j implies
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Moreover, the notion of skewness we want to capture says that one ordered statistic is more skewed than another one, when in the former a larger fraction of the cumulative distribution is concentrated in a smaller number of parameters (in this case basic blocks). From this it is easy to see that the amount of skewness should get its maximum value on vector [P.sub.o] = (1, O, ... ,O) (the whole execution is concentrated in a single basic block) and should be minimum on [P.sub.[infinity]] = (1/N, ... , 1/N). It is also clear that given an arbitrary statistic [P.sub.k] = ([P.sub.1], ... , [P.sub.N]), we can always construct a sequence of ordered statistics starting with [P.sub.o], ending on [P.sub.[infinity]], and passing through [P.sub.k], such that each statistic in the sequence is less "skewed" than its predecessor.
Table X gives the amount of skewness of the basic blocks for all programs. The results show that MATRIX300, MANDELBROT, and LINPACK are the ones with the largest skewness.
Table X. Skewness of Ordered Basic-Block Distribution for the SPEC, Perfect, and Small Benchmarks
Program Skewness Program Skewness 01 Matrix300 0.98 15 Nasa7 0.16 02 Mandelbrot 0.79 16 MDG 0.15 03 Linpack 0.64 17 Smith 0.14 04 BDNA 0.57 18 QCD 0.13 05 Tomcatv 0.54 19 Livermore 0.13 06 Baskett 0.47 20 MG3D 0.11 07 Erathostenes 0.45 21 SPICE2G6 0.08 08 TRFD 0.41 22 FLO52 0.08 09 Shell 0.39 23 ARC2D 0.07 10 DYFESM 0.25 24 TRACK 0.07 11 Whetstone 0.23 25 SPEC77 0.07 12 Fpppp 0.20 26 ADM 0.06 13 OCEAN 0.17 27 Doduc 0.05 14 Alamos 0.16
The skewness is defined to be the inverse of the mean of the ordered distribution.
MATRIX300 clearly illustrates the risks of using unstable, or highly skewed, programs as benchmarks. Soon after the SPEC benchmarks were released, computer manufacturers discovered that 99.9% of the total operations executed by MATRIX300 were in one basic block containing a single statement. By using preprocessors that inlined three levels of routines, they were able to expose the original matrix-multiply algorithm, which suffered from poor locality, and then replace it with a highly hand-optimized version that exploited as much locality of reference as possible by tiling the iteration space of the loop with respect to both the cache and registers. By doing this, for example, the SPECratio performance of the CDC 4330 (a machine based on the MIPS 3000 processor)jumped from 15.7 to 63.9 [SPEC 1990b]. The widespread use of this technique ended up exposing the ineffectiveness of MATRIX300 as a benchmark; MATRIX300 has been eliminated from later versions of the SPEC benchmark suite.
5.7 Distribution of AbOps
Figure 11 shows the cumulative distribution of AbOps for the different benchmark suites. Each bar indicates at the bottom the number of different AbOps operations executed by the benchmark. The results show that most programs execute only a small number of different operations, with MATRIX300 as an extreme example. The averages for the three suites and for all programs are presented in Table XI.
Table XI. Portion of AbOp Executions Accounted for by 2 Most Frequent, 5 Most Frequent, etc. for Each of the Suites and for All Benchmarks
SPEC Perfect Various All Programs 2 params 46.30% 41.30% 44.00% 43.30% 5 params 31.20% 31.70% 32.50% 31.90% 10 params 12.30% 17.70% 18.10% 16.70% 15 params 6.50% 5.70% 3.00% 5.00% 20 params 2.50% 2.30% 1.90% 2.00% >20 params 1.20% 1.30% 0.50% 1.00%
[Figure 11 ILLUSTRATION OMITTED]
We can also compute the skewness of the ordered distribution of AbOps in the same way as we did with basic blocks, i.e., as the inverse of the expected value of the distribution; the results are shown in Table XII. The programs with the largest values of skewness are MATRIX300, ALAMOS, and ERATHOSTENES. The results also show that DODUC is the SPEC benchmark with the lowest amount of skewness both in the distribution of basic blocks and AbOps.
Table XII. Skewness of Ordered Abstract Operation Distribution for the SPEC, Perfect, and Small Benchmarks
Program Skewness Program Skewness 01 Matrix300 0.405 15 Smith 0.254 02 Alamos 0.400 16 BDNA 0.251 03 Erathostenes 0.367 17 SPICE2G6 0.248 04 Shell 0.353 18 FLO52 0.243 05 Tomcatv 0.341 19 OCEAN 0.217 06 TRFD 0.325 20 SPEC77 0.215 07 Fpppp 0.315 21 Livermore 0.213 08 Linpack 0.309 22 ADM 0.210 09 DYFESM 0.296 23 QCD 0.200 10 ARC2D 0.286 24 TRACK 0.180 11 Mandelbrot 0.279 25 Nasa7 0.169 12 Baskett 0.263 26 Whetstone 0.155 13 MDG 0.256 27 Doduc 0.139 14 MG3D 0.255
The skewness is defined to be the inverse of the mean of the distribution.
5.8 The Amount of Skewness in Programs and the Distribution of Errors
As we have observed, many of the benchmarks concentrate their execution on a small number of AbOps. We would expect that our predictions of running time for benchmarks with highly skewed distributions of AbOp execution would show greater errors than those with less skewed distributions. This follows directly from the assumption that our errors in measuring AbOp times are random; there will be less cancellation of errors when summing over a small number of large values than a larger number of small values. (This can be explained more rigorously by considering the formula for the variance of a sum of random variables.)
We tested the hypothesis that prediction errors for programs with a skewed distribution of either basic blocks or abstract operations will tend to be larger than for those with less skewed distributions. An examination of those results showed that there is no correlation between prediction error and the skewness of the frequency of basic-block execution. There is a small amount of correlation between the skewness of the AbOp execution distribution and the prediction error. This lack of correlation seems to be due to two factors: (a) those programs with the most highly skewed distributions emphasize AbOps such as floating point, for which measurement errors are small, and (b) prediction errors are mostly due to other factors (e.g., cache misses), rather than errors in the measurement of AbOp execution times.
5.9 The SPICE2G6 Benchmark
In this section, we discuss in more detail the differences between the seven data sets used for the SPICE2G6 benchmark. SPICE2G6 is normally considered, for performance purposes, to be a good example of a large CPU-bound, scalar, double-precision floating-point benchmark, with a small fraction of complex arithmetic and negligible vectorization. Given its large size (its code and data sizes on a VAX-11/785 running ULTRIX are 325KB and 8MB, respectively), it might be expected to be a good test for the instruction and data caches. The SPEC suite uses, as input, a time-consuming bipolar circuit model called GREYCODE, while the Perfect Club uses a PLA circuit called PERFECT. GREYCODE was selected mainly because of its long execution time; we shall see however that its execution behavior is not typical, nor does it measure what SPICE2G6 is believed to measure.
The general statistics for the seven data models of SPICE2G6 show that the number of abstract operations executed by GREYCODE (2.005 x [10.sup.10]) is almost two orders of magnitude larger than the maximum on any of the other models (3.184 x [10.sup.8]) [Saavedra and Smith 1992]. For GREYCODE, however, only 33% of all basic blocks are executed. In contrast, the number of basic blocks touched by the BENCHMARK input is 52%. Another abnormal feature of GREYCODE is that it has the lowest fraction of assignments executed (60%), and of these only 19% are arithmetic expressions; the rest represent simple memory-to-memory operations. In the other models, assignments amount, on the average, to 70% of all statements, with arithmetic expressions being more than 35% of the total. Another distinctive feature of GREYCODE is the small fraction of procedure calls (2.8%) and the very large number of branches (36%).
More significant are the results shown in Figure 4. The distribution of arithmetic and logical operations shows that GREYCODE is mainly an integer benchmark; almost 87% of the operations involve addition and comparison between integers. On the other models the percentage of floating-point operations is never less than 26%, and it reaches 60% for MOSAMP2. These statistics suggest that GREYCODE is not an adequate benchmark for testing scalar double-precision arithmetic. Much better input models for SPICE2G6 are BENCHMARK, DIGSR, or PERFECT.
6. MEASURING SIMILARITY BETWEEN BENCHMARKS
A good benchmark suite is representative of the "real" workload, but there is little point to filling a benchmark suite with several programs which provide similar loads on the machine. In this section we address the problem of measuring benchmark similarity by presenting two different metrics and then comparing them. The first one is based on the dynamic statistics presented earlier. The rationale behind this metric is that we expect programs which execute similar operations to produce similar run-time results; by results, we mean estimates of processing speed for a given machine. The second metric works from the other end; benchmarks which yield proportional performance on a variety of machines should be considered to be similar. Here performance refers to the real execution time as opposed to the predicted execution time.
Our results show that the two metrics are highly correlated; what is similar by one measure is generally similar by the other. Note that the first metric is easier to compute (we only have to measure each benchmark, rather than run it on each machine) and would thus be preferred.
6.1 Program Similarity Metric Based on Dynamic Statistics
To simplify the benchmark characterization and clustering, we have grouped the 109 AbOps into 13 "reduced parameters," each of which represents some aspect of machine implementation; these parameters are listed in Table XIII. Note that the reduced parameters presented here are not the same as those used in Saavedra-Barrera et al. ; the ones presented here better represent the various aspects of machine architecture. As we would expect for a language like Fortran, most of the parameters correspond to floating-point operations. Others are integer arithmetic, logical arithmetic, procedure calls, memory bandwidth, and intrinsic functions. Integer and floating-point division are assigned to a single parameter. AbOps that change the flow of execution--branches and do loop instructions--are also assigned to a single parameter. Note also that the 13 parameters are not orthogonal, and for some machines, two or more items may be identical because they use the same functional unit in the same way.
Table XIII. The 13 Reduced Parameters Used in the Definition of Program Similarity
1 Memory Bandwidth 2 Integer Division 3 Integer Multiplication 4 Single-Precision Addition 5 Single-Precision Multiplication 6 Double-Precision Addition 7 Double-Precision Multiplication 8 Division 9 Logical Operations 10 Intrinsic Functions 11 Procedure Calls 12 Address Computation 13 Branches and Iteration
The formula we use as the metric for program similarity is the squared Euclidean distance, where every dimension is weighted according to the average run-time contributed by that parameter, averaged over the set of all programs. Let A = ([A.sub.1], ... , [A.sub.n]) and B = ([B.sub.1], ... , [B.sub.n]) be two vectors containing the reduced statistics for programs A and B; then the distance between the two programs (d(A, B)) is given by
(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [W.sub.i] is the value of parameter i averaged over all machines.
We computed the similarity distance between all program pairs, except for SPICE2G6, for which only we only included the GREYCODE and PERFECT input data sets. The average distance between all programs is 1.1990 with a standard deviation of 0.8169. Figure 12 shows the clustering of programs according to their distances. Pairs of programs having distance less than 0.4500 are joined by a bidirectional arrow. The thickness of the arrow is related to the magnitude of the distance. The most similar programs are TRFD and MATRIX300 with a distance of only 0.0172. In the next five distances we find the pairwise relations between programs DYFESM, LINPACK, and ALAMOS. Programs TRFD, MATRIX300, DYFESM, and LINPACK have similarities that go beyond their dynamic distributions. These four programs have the property that their most executed basic blocks are syntactic variations of the same code (SAXPY), which consists in adding a vector to the product between a constant and a vector, as shown in the following statement:
X(I, J) = X(I,J) + A* Y(K, I).
[Figure 12 ILLUSTRATION OMITTED]
Note that IBM RS/6000 has a special instruction to speed up the execution of these types of statements. In that machine, a multiply-add instruction takes four arguments and performs a multiply on two of them, adds that product to the third argument, and leaves the result in the fourth argument. By eliminating the normalization and round operations between the multiply and add, the execution time of this operation is significantly reduced compared to a multiply followed by an add [Olsson et al. 1990].
Three clusters are present in Figure 12. One, with eight programs and containing LINPACK as a member, includes those programs that are dominated by single-precision floating-point arithmetic. Another cluster, also having eight programs, contains those benchmarks dominated by double-precision floating-point arithmetic. There is a subset of programs in this cluster containing programs TRFD, MATRIX300, NASA7, ARC2D, and TOMCATV, which form a five-node complete subgraph, in which all distances between pairs of elements are smaller than 0.4500. The smallest cluster, with three elements, contains those programs with both significant-integer and floating-point arithmetic. We also include in the diagram those programs whose smallest distance to any other program is larger than 0.4500. These are represented as isolated nodes with the value of their smallest distance indicated below the name.
6.1.1 Minimizing the Benchmark Set. The purpose of a suite of benchmarks is to represent the target workload. Within that constraint, we would like to minimize the number of actual benchmarks. Our results thus far show that most individual benchmarks are highly skewed with respect to their generation of abstract operations, but the clusters shown in Figure 12 suggest that subsets of the suites test essentially the same aspects of performance. Thus, an acceptable variety of benchmark measurements could be obtained with only a subset of the programs analyzed earlier. A still better approach would be to run only one benchmark, our machine characterizer. Note that since the machine characterizer measures the run-time for all AbOps, it is possible to accurately estimate the performance of any characterized machine for any AbOp distribution, without having to run any benchmarks. Such an AbOp distribution can be chosen as the weighted sum of some set of existing benchmarks, as an estimate of some target or existing workload or in any other manner.
6.2 Program Similarity and Benchmark Results
Our motivation in proposing a metric for program similarity in Section 6.1 was to identify groups of programs having similar characteristics; such similar programs should show proportional run-times on a number of different machines. In this section, we examine this hypothesis. First, we introduce the concept of benchmark equivalence.
Definition. If [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the execution time of program A on machine [M.sub.i], then two programs are benchmark equivalent if, for any pair of machines [M.sub.i] and [M.sub.j] the following condition is true:
(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
that is, the execution times obtained using program A differ from the execution times using program B, on all machines, by a multiplicative factor k:
(6) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
It is unlikely that two different programs will exactly satisfy our definition of benchmark equivalence. Therefore, we define a weaker concept, that of execution time similarity, to measure how far two programs are from full equivalence. Given two sets of benchmark results, we define the execution time similarity of two benchmarks by computing the coefficient of variation of the variable [Z.sub.A,B,i] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].(6) The coefficient of variation measures how well the execution times of one program can be inferred from the execution times of the other program.
As we did in Section 6.1, we can compute the similarity distance between programs, by using here the coefficient of variation as a metric computed from the execution times. In Figure 13 we show a clustering diagram similar to the one presented in Figure 12. The diagram shows three well-defined clusters. One contains basically the integer programs: SHELL, ERATHOSTENES, BASKETT, and SMITH. Another cluster is formed by MATRIX300, ALAMOS, LIVERMORE, and LINPACK. The largest cluster is centered around programs TOMCATV, ADM, DODUC, FL057, and NASA7, with most of the other programs connected to these clusters in an unstructured way.
[Figure 13 ILLUSTRATION OMITTED]
Now that we have defined two different metrics for benchmark similarity, one based on program characteristics (see Section 6.1), and the other based on execution time results, we can compare the two metrics to see if there exists a good correlation in the way they rank pairs of programs. We measure the level of significance using the Spearman's rank correlation coefficient ([p.sub.s]), which is defined as
(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where [d.sub.i] is the difference of ranking of a particular pair on the two metrics. For our two similarity metrics the coefficient [p.sub.s] indicates that there is a correlation at a level of significance which is better than 0.00001.(7)
A scattergram of the two metrics is given in Figure 14; each point on the horizontal axis corresponds to the metric based on the dispersion of the execution time results while the vertical axis corresponds to the metric based on dynamic program statistics. Each "+" on the graph represents a pair of benchmark programs. The results indicate that there is a significant positive correlation between the two metrics at the level of 0.0001. Visually, we can notice that the two metrics correlate reasonable well, although clearly not perfectly. What this means is that if two benchmarks differ widely in the AbOps that they use most frequently, the chances are that they will give inconsistent performance comparisons between pairs of machines (relative to other benchmarks), and conversely. That is, if benchmarks A and B are quite different, benchmark A may rate machine X faster than Y, and benchmark B may rate Y faster than X. This suggests that our measure of program similarity is sufficiently valid that we can use it to eliminate redundant benchmarks from a large set.
[Figure 14 ILLUSTRATION OMITTED]
6.3 Limitations of Our Model
There are some limitations when using linear high-level models and software experiments to characterize machine performance. Here we briefly mention the most important of them. For a more in-depth discussion see Saavedra and Smith [1995a; 1995b], Saavedra-Barrera [1988; 1992], and Saavedra et al. .
The main sources of error in the results from our model can be grouped in two classes. The first corresponds to elements of the machine architecture which have not been captured by our model. The model described here does not account for cache or TLB misses; an extension to our model is presented in Saavedra-Barrera  and Saavedra and Smith [1995b] which adds this factor. We do not successfully capture aspects of machine architecture which are manifested only by the performance of certain sequences of AbOps, and not by a single AbOp in isolation--e.g., the IBM RS/6000 multiply-add instruction; we discuss this further below. We are not able to account for hardware or software interlocks, nonlinear interactions between consecutive machine instructions [Clapp et al. 1986], the effectiveness of branch prediction, and the effect on timing of branch distance and direction. We have also not accounted for specialized architectural features such as vector operations and vector registers.
Another source of errors corresponds to limitations in our measuring tools and factors independent from the programs measured: resolution and intrusiveness of the clock, random noise, and external events (interrupts, page faults, and multiprogramming) [Currah 1975].
It is also important to mention that the model and the results presented here reflect only unoptimized code. As shown in Saavedra and Smith [1995a], our model can be extended with surprising success to the prediction of the running times of optimized codes.
It is worth making specific mention of recent trends in high-performance microprocessor computer architecture. The newer machines, such as the IBM RS/6000 [Groves and Oehler 1990], can issue more than one instruction per cycle; such machines are called either Superscalar or VLIW (very long instruction word), depending on their design. The observed level of performance of such machines is a function of the actual amount of concurrency that is achieved. The level of concurrency is itself a function of which operations are available to be executed in parallel and whether those operations conflict in their use of operands or functional units. Our model considers abstract operations individually and is not currently able to determine the achieved level of concurrency. Much of this concurrency will also be manifested in the execution of our machine characterizer--i.e., on a machine with concurrency, we will measure faster AbOp times. Thus on the average we should be able to predict the overall level of speedup. Unfortunately, this accuracy on the average need not apply to predictions for the running times of individual programs. In fact this is what we observed in the case on the IBM RS/6000 530. In this machine the standard deviation of the errors is 21%, which is the largest among all machines measured.
The other "new" technique, superpipelining, does not introduce any new difficulties. Superpipelining is a specific type of pipelining in which one or more individual functional units are pipelined; for example, more than one multiply can be in execution at the same time. Superpipelining introduces the same problems as ordinary pipelining, in terms of pipeline interlocks, and functional unit and operand conflicts. Such interlocks and conflicts can only be analyzed accurately at the level of a model of the CPU pipeline.
7. SUMMARY AND CONCLUSIONS
In this article we have discussed program characterization and execution time prediction in the context of our abstract machine model. These two aspects of our methodology allow us to investigate the characteristics of benchmarks and compute accurate execution time estimates for arbitrary Fortran programs. The same approach could be used for other algebraic languages. In most cases, however, a larger number of parameters will be needed, and some special care should be taken in the characterization of library functions whose execution is input dependent, e.g., string library functions in C.
There are a number of results from and applications of our research: (1) our methodology allows us to analyze the behavior of individual machines and identify their strong and weak points, (2) we can analyze individual benchmark programs, determine what operations they execute most frequently, and accurately predict their running time on those machines which we have characterized, (3) we can determine "where the time goes," which aids greatly in tuning programs to run faster on specific machines, (4) we can evaluate the suitability of individual benchmarks, and of sets of benchmarks, as tools for evaluation, and we can identify redundant benchmarks in a set, and (5) we can estimate the performance of anticipated workloads on real machines, of real workloads on proposed machines, and of anticipated workloads on proposed machines.
As part of our research, we have presented extensive statistics on the SPEC and Perfect Club benchmark suites and have illustrated how these can be used to identify deficiencies in the benchmarks.
Related work appears in Saavedra and Smith [1995a], in which we extend our methodology to the analysis of optimized code, and in Saavedra and Smith [1995b], in which we extend our methodology to consider cache and TLB misses. See also Saavedra-Barrera et al. , which concentrates on machine characterization, and Saavedra-Barrera , which contains more extensive measurements than space permits here.
We would like to thank Ken Stevens, Jr. and Eugene Miya for providing access to facilities at NASA Ames, as well as David E. Culler and Luis Miguel who let us run our programs in their machines. We also thank Vicki Scott from MIPS Co., who assisted us with the SPEC benchmarks, and Oscar Loureiro and Barbara Tockey, who made useful suggestions.
(1) The original version of TRACK found in the Perfect Club benchmarks exhibited several execution histories due to an inconsistency in the passing of constant parameters. The version that we used in this article does not have this problem.
(2) The TLB is considered also a level in the hierarchy, even when in reality it is not.
(3) If the length of the loop is larger than the vector length, extra code for strip mining the original loop is also added.
(4) This is the only table in the article where optimized results are used.
(5) It is clear that the same argument cannot be applied to vectorization or parallelization.
(6) Programs that are benchmark equivalent will have zero as their coefficient of variation.
(7) In computing the rank correlation coefficient we use the same set of program pairs for both metrics. The number of pairs for which there are enough benchmark results to compute the coefficient of variation is only half the total number of pairs.
ALLEN, F., BURKE, M., CHARLES, P., CYTRON, R., AND FERRANTE, J. 1987. An overview of the PTRAN analysis system for multiprocessing. In Proceedings of the Supercomputing '87 Conference. ACM, New York.
BACON, D. F., GRAHAM, S. L., AND SHARP, O. L. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4 (Dec.), 345-420.
BAILEY, D. H. AND BARTON, J. T. 1985. The NAS kernel benchmark program. NASA Tech. Memo. 86711, NASA, Ames, Iowa. Aug.
BALASUNDARAM, V., Fox, G., KENNEDY, K., AND KREMER, U. 1991. A static performance estimator to guide data partitioning decisions. In the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 213-223.
BALASUNDARAM, V., KENNEDY, K., KREMER, U., MCKINLEY, K., AND SUBHLOK, J. 1989. The ParaScope Editor: An interactive parallel programming tool. In Proceedings of the Supercomputing '89 Conference. ACM, New York.
BEIZER, B. 1978. Micro Analysis of Computer System Performance. Van Nostrand, New York.
CLAPP, R. M., DUCHESNEAU, L., VOLZ, R. A., MUDGE, T. N., AND SCHULTZE, T. 1986. Toward real-time performance benchmarks for ADA. Commun. ACM 29, 8 (Aug.), 760-778.
CURNOW, H. J. AND WICHMANN, B.A. 1976. A synthetic benchmark. Comput. J. 19, 1 (Feb.), 43-49.
CURRAH, B. 1975. Some causes of variability in CPU time. Comput. Meas. Eval. 3, 389-392.
CYBENKO, G., KIPP, L., POINTER, L., AND KUCK, D. 1990. Supercomputer performance evaluation and the Perfect benchmarks. Tech. Rep. 965, Center for Supercomputing Research and Development, Univ. of Illinois, Urbana-Champaign, Ill. Mar.
DODUC, N. 1989. Fortran execution time benchmark. Version 29. Departement Informatique, Framantec, France. Unpublished manuscript. Mar.
DONGARRA, J. J. 1988. Performance of various computers using standard linear equations software in a Fortran environment. Comput. Arch. News 16, 1 (Mar.), 47-69.
DONGARRA, J. J., MARTIN, J., AND WORLTON, J. 1987. Computer benchmarking: Paths and pitfalls. Computer 24, 7 (July), 38-43.
GEE, J. AND SMITH, A.J. 1993. TLB performance of the SPEC benchmark suite. Univ. of California, Berkeley, Calif. Unpublished manuscript.
GEE, J., HILL, M. D., PNEVMATIKATOS, D. N., AND SMITH, A.J. 1991. Cache performance of the SPEC benchmark suite. IEEE Micro 13, 4 (Aug.), 17-27. Early version appears as Tech. Rep. UCB/CSD 91/648, Dept. of Computer Science, Univ. of California, Berkeley, Calif., Sept. 1991.
GROVES, R. D. AND OEHLER, R. 1990. RISC System/6000 processor architecture. In IBM RISC System/6000 Technology. SA23-2619, IBM Corp., Armonk, N.Y., 16-23.
HICKEY, T. AND COHEN, J. 1988. Automatic program analysis. J. ACM 35, 1 (Jan.), 185-220.
KNUTH, D. E. 1971. An empirical study of Fortran programs. Softw. Pract. Exper. 1, 105-133.
MCMAHON, F. H. 1986. The Livermore Fortran kernels: A computer test of the floating-point performance range. Rep. UCRL-53745, Lawrence Livermore National Laboratories, Livermore, Calif. Dec.
MIPS. 1989. MIPS UNIX Benchmarks. Perf. Brief CPU Benchmarks 3.8 (June).
OLSSON, B., MONTOYE, R., MARKSTEIN, P., AND NGUYENPHU, M. 1990. RISC System/6000 floating-point unit. In RISC System/60000 Technology. SA23-2619, IBM Corp., Armonk, N.Y., 34-43.
PUETO, B. L. AND SHUSTEK, L. J. 1977. An instruction timing model of CPU performance. In the 4th Annual Symposium on Computer Architecture. ACM, New York, 165-178.
PONDER, C. G. 1990. An analytical look at linear performance models. Tech. Rep. UCRL-JC-106105, Lawrence Livermore National Laboratories, Livermore, Calif. Sept.
RAMAMOORTHY, C. V. 1965. Discrete Markov analysis of computer programs. In Proceedings of the ACM National Conference. ACM, New York, 386-392.
SAAVEDRA, R. H. AND SMITH, A.J. 1992. Analysis of benchmark characteristics and benchmark performance prediction. Tech. Rep. USC-CS-92-524, Univ. of Southern California, Los Angeles. Extended version appears as Tech. Rep. UCB/CSD 92/715, Univ. of California, Berkeley, Calif., 1992, Dec.
SAAVEDRA, R. H. AND SMITH, A.J. 1995a. Benchmarking optimizing compilers. IEEE Trans. Softw. Eng. 21, 7 (July), 615-628.
SAAVEDRA, R. H. AND SMITH, A. J. 1995b. Measuring cache and TLB performance. IEEE Trans. Comput. 44, 10 (Oct.), 1223-1235.
SAAVEDRA-BARRERA, R. H. 1988. Machine characterization and benchmark performance prediction. Tech. Rep. UCB/CSD 88/437, Univ. of California, Berkeley, Calif. June.
SAAVEDRA-BARRERA, R. H. 1992. CPU performance and evaluation time prediction using narrow spectrum benchmarking. Ph.D. thesis, Tech. Rep. UCB/CSD 92/684, Univ. of California, Berkeley, Calif. Feb.
SAAVEDRA-BARRERA, R. H. AND SMITH, A.J. 1990. Benchmarking and the abstract machine characterization model. Tech. Rep. UCB/CSD-90-607, Univ. of California, Berkeley, Calif. Nov.
SAAVEDRA-BARRERA, R. H., SMITH, A. J., AND MIYA, E. 1989. Machine characterization based on an abstract high-level language machine. IEEE Trans. Comput. 38, 12 (Dec.), 1659-1679.
SARKAR, V. 1989. Determining average program execution times and their variance. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation. ACM, New York, 298-312.
SPEC. 1989a. SPEC Newslett. Benchmark Results 2, 1 (Winter).
SPEC. 1989b. SPEC Newslett. Benchmark Results 2, 2 (Spring).
SPEC. 1990a. SPEC Newslett. Benchmark Results 3, 1 (Winter).
SPEC. 1990b. SPEC Newslett. Benchmark Results 3, 2 (Spring).
UCB. 1987. SPICE2G. 6. EECS/ERL Industrial Liaison Program, Univ. of California, Berkeley, Calif. Mar. Software.
VON WORLEY, S. AND SMITH, A.J. 1995. Microbenchmarking and performance prediction for parallel computers. Tech. Rep. UCB/CSD-95-873, Univ. of California, Berkeley, Calif. May.
WEICKER, R. P. 1988. Dhrystone benchmark: Rationale for version 2 and measurement rules. SIGPLAN Not. 23, 8 (Aug.).
WORLTON, J. 1984. Understanding supercomputer benchmarks. Datamation 30, 14 (Sept. 1), 121-130.
The material presented here is based on research supported principally by NASA under grant NCC2-550, and in part by the National Science Foundation under grants MIP-8713274, MIP-9116578, and CCR-9117028, by the State of California under the MICRO program, and by IBM Corp., Philips Laboratories/Signetics, Apple Computer Corp., Intel Corp., Mitsubishi Electric Research Laboratories, Sun Microsystems, and Digital Equipment Corporation. Authors' addresses: R. H. Saavedra, Computer Science Department, Henry Salvatori Computer Science Center, University of Southern California, Los Angeles, CA 90089-0781; email: saavedra@cs, usc.edu; A. J. Smith, Computer Science Division, EECS Department, University of California, Berkeley, CA 94720-1776.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
|Printer friendly Cite/link Email Feedback|
|Author:||Saavedra, Rafael H.; Smith, Alan J.|
|Publication:||ACM Transactions on Computer Systems|
|Date:||Nov 1, 1996|
|Previous Article:||Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling.|
|Next Article:||Diffracting trees.|