Does your workstation computation belong on a vector supercomputer?
We devote one section of this article to each of our five questions: 1. Would you pay 10 times as much to solve your problem 30 times faster? 2. Can your code be easily vectorized? 3. Will your program be used enough to warrant hand-optimization? 4. Can you afford to retune your software's floating-point arithmetic? 5. Does your application require more than 200MB but less than 16GB of primary memory?
A final section summarizes our framework for the "porting decision problem," and indicates how it might be extended to the problem of deciding whether to port workstation code to a massively parallel supercomputer such as CM-5.
We developed our list of questions to save others from wasting time, as we did, porting unsuitable code from a workstation to a supercomputer.
As indicated in the first three side-bars of this article ("Speedups on a Dusty Deck," "Further Adventures with a Dusty Deck," and "Speedups on CPlex"), none of our experiments or projections show more than a 30-fold speed advantage in moving our code to a Cray. This relatively small improvement in speed justifies neither the cost of the port, nor the relative expense of a Cray computation. Although we get our Cray cycles "for free" (via a straightforward grant-writing process), we are interested in proving the economic feasibility of our wire-routing method. Thus we view the 10-fold increase in actual cost of rapidly solving a wire-routing problem on a Cray, vs. 30 times more slowly on a workstation, as a deciding factor in our comparison.
In contrast, Robert Bixby of Rice University is often quite satisfied to use a Cray, instead of his Sun-4 workstation, to solve his linear programs . Economic feasibility is not of particular concern to him, but the 30- to 40-fold speedup he typically observes is enough to make a dramatic difference in the quantity and quality of his research results.
The preceding discussion has centered on speedup and cost ratios. These are indeed important considerations, but they are not the only ones. For example, Bixby cites the limited memory capacity of a Sun-4 as a compelling reason for doing most of his research on a Cray . We treat memory capacity and several other important issues in our five-question framework for deciding "to port or not to port?"
Would You Pay 10 Times as Much to Solve Your Problem 30 Times Faster?
We have never observed more than a 20-fold speedup in solving our linear programs, when comparing a workstation to a Cray. Your code may be more amendable to speedup than ours. Still, we doubt you will see as much as a 30-fold speedup, without an expensive and time-consuming recoding. In the following section, we argue for a 30-fold limit on speedups; if you accept this value for now, we can make some broad-brush economic arguments.
A vector supercomputer costs between 1 and 25 million dollars, depending on the number of processors. A uniprocessor workstation costs between 5,000 and 50,000 dollars. To narrow this range somewhat, and to make the discussion more concrete, let's compare a $10,000-(list price) SPARCstation IPX with a $3 million-(list price) single-processor Cray Y-MP system. The SPARC is configured with a 16-in. color monitor, a 400MB disk, 16MB of dynamic RAM (DRAM), and a 40MHz CPU. The Cray Y-MP has a 167MHz CPU, 40GB of disk, and 128MB of static RAM (SRAM). Computer list prices and discounting practices change at least as rapidly as the hardware and software being sold, which is to say these late-1992 price/performance datapoints will be obsolete by the time this article is printed. Still, it indicates that the ratio between the purchase costs of single-CPU vector supercomputer systems and "low-end" workstations is approximately 300.
This 300-fold difference in purchase cost can be used to infer a 300-fold difference in operating cost, by reasoning along the following lines. Both workstations and supercomputers are functionally obsolete within five years, so that yearly depreciation and finance charges alone are 20- to 30% of the purchase cost. Yearly budgets for operating personnel, software, and hardware maintenance each add another 5- to 20% of the hardware purchase cost to the yearly operating expense, depending on the level of service provided. As a rough estimate, then, yearly computer costs are 35- to 90% of the purchase price. An hour of CPU time on a supercomputer, then, costs between $120 and $350. An hour of workstation time costs about 300 times less, or $10 to $30 per day.
Our economic argument is irrelevant for certain users, who for one reason or another have "free" or, rather, subsidized access to workstations and/or Crays. Some organizations have excess computing capacity in workstations or supercomputers or both. Furthermore, some researchers and supercomputer centers are subsidized by current U.S. R&D policy. (One rationale for this policy is that uneconomical use of supercomputers today may lead to greater economies in the future.)
We therefore encourage you to insert a more accurate value, if you have one, for our nominal ratio r = 300 of the cost of an hour of CPU time on a supercomputer divided by the cost of an hour on a workstation. The question for this section should then be revised to read "Would you pay r/30 times as much to solve your problem 30 times faster?"
Our question is by no means rhetorical. In some applications it is perfectly appropriate to pay a very large premium for fast answers. Consider the problem of weather prediction. If you developed a code that enables a Sun workstation to predict tomorrow's weather in just 24 hours ($10 to $30) of run time, your code would have no commercial value. However, if your code ran just 8 times faster (i.e., in 3 hours on a Cray), you could sell your weather predictions for a tidy profit, even after paying $360 or even $1,000 per prediction for your Cray run time.
In our case, we are able to justify the increased cost of a Cray computation only if we make the most optimistic assumptions for Cray cost and problem size. We believe our end-user population will be paying essentially full cost for their computations. They would not be willing to spend more than about $1,000 for a solution to a wire-routing problem, and would not be willing to wait more than a week for the results. One week ($70 to $210) of Sun-4 time, assuming a 30-fold speedup, is equivalent to about six CPU-hours of supercomputer run time. If we paid $120 per CPU-hour for Cray time, then a 6-hour supercomputer computation will cost about $720. If, however, our supercomputer costs more than this, or if our speedup is less than 30, then we will surely exceed our $1,000-per-computation limit even on the largest workstation-feasible problems. Smaller problems will, of course, run more economically on a workstation than on a supercomputer. We therefore answer our first question with a qualified "no, but it may be a close call for our largest problems, if we can find inexpensive Cray cycles and if we indeed get a 30-fold speedup."
Bixby, in contrast, is often content to spend 150 seconds of Cray run time, instead of 4,500 seconds of Sun-4 run time, to solve a difficult problem. When solving a large problem with integrality constraints, for example, he will solve a series of associated linear programming (LP) problems. The Cray turnaround is short enough that he can monitor the solution process: if the current batch of LPs is not being solved rapidly, he can abort it, adjust some algorithmic parameters, and start another batch. This adjustment can result in significant savings in total Cray run time, compared with how long the Cray would have taken to run the same series of LPs without adjustment. On a slower machine, he says, "what usually happens is that nothing at all gets done in terms of experimenting with the LPs" . In other words, the native 30- to 40-fold speedup of the Cray on his large-scale LPs can be magnified significantly by on-line algorithmic adjustments to a batched series of problems. The same adjustments are, of course, available to workstation users, but it is much more disruptive to make once-per-hour adjustments to workstation codes throughout a working day, than it is to spend a concentrated block of time making similar adjustments to supercomputer codes. Thus Bixby answers our first question with an emphatic "yes, for much of my work."
Can Your Code be Easily Vectorized?
By following the advice in this section, you should be able to determine if you would obtain more than a 30-fold speedup if you moved your workstation code to a supercomputer. If you decide you will obtain an s-fold speedup, then you should rewrite the question asked in the preceding section to read: "Would I pay 300/s times as much to solve my problem s times faster?"
In a nutshell, the reason it is difficult to obtain large speedups is that code written for use on scalar machines, such as a workstation or a mainframe, will not take full advantage of the specialized functional units of a supercomputer. To gain the potential 1,000-fold run-time advantage of a parallel supercomputer, programs must be vectorized [14, p. 81], and then parallelized. (See the sidebars on "Vectorizable Code" and "Parallelizable Code" for more detail.)
Parallelization is another technique for code acceleration. If a program is perfectly parallelizable, one could run it on k processors in about 1/kth the time, at about the same cost. Thus users sensitive to both cost and run time should try to modify their code to run on many workstations in parallel. Users who are more worried about run time than cost should rewrite their code so it will run efficiently on a massively parallel supercomputer. It is sometimes possible to write code that will run efficiently on both types of architectures, using for example the C language macros developed at Argonne National Laboratories . The latest version of these macros is available by anonymous ftp from info.mcs.anl.gov, in directory pub/p4.
Although theory and practice in the field of parallel algorithms is developing rapidly, we believe it is generally more difficult to achieve efficient parallelism than it is to achieve efficient vectorization. Indeed, supercomputer parallelism is generally attempted only after efficient vectorization is obtained. This returns us to the main question of this article. Can your computation make efficient use of a single processor on a vector supercomputer?
Vectorization can be accomplished in two ways: The simplest approach is to compile your unmodified source code on a supercomputer's vectorizing compiler, and hope for good results. The second, and much more expensive, method of vectorization is to employ a skilled programmer to rewrite portions of the source code. We discuss the limitations of vectorizing compilers in this section, deferring our discussion of code rewriting to the next section.
Despite major advances in compiler technology, any of the following programming constructs hamper the vectorization of loops: recursion, subroutine calls, I/O statements, assigned GOTO's, certain nested "if" statements, GOTO's that exit the loop, backward transfers within the loop, and references to nonvectorized external functions [14, p. 90). Such constructs are common but relatively innocuous, outside of inner loops. When they appear in an inner loop, however, that loop will not be vectorizable.
Code that is not vectorizable will not run much faster on a supercomputer than on a workstation. For example, John McCalpin of the University of Delaware has developed a benchmark which is typical of numerical models in meteorology/oceanography/geophysical fluid dynamics. Running it on dozens of machines, he finds that a Cray Y-MP executes his code about three times faster than does an IBM RS-6000 Model 530, about seven times faster than a DECstation 3100, and about 20 times faster than a Sun-4/260 . The code he wrote is completely vectorizable; however, it calls a library routine (for solving an elliptic equation on a 101 x 41 finite difference grid) that is not vectorizable. The nonvectorizable library routine accounts for about 30% of the total CPU operations. As will be argued, a code that is 30%, or even 20%, nonvectorizable will not see much vector speedup.
It may surprise some Communications readers to learn that a modern workstation can execute scalar code about as fast as does a supercomputer, which is to say at 10 to 100 million instructions per second. Vector supercomputers are limited to this speed, whenever the operands and the machine instructions cannot be perfected. On nonvectorizable code, the limiting factor is the 10 nanoseconds (ns) or so that it takes to fetch an arbitrary operand from the fastest layer of the random-access memory hierarchy, universally implemented in SRAM in today's architectures.
One way to decide if your code is a candidate for vectorization is to profile its execution time on your workstation. The assumption here is that code outside your inner loops is very unlikely to be vectorizable; as we will see, some inner loops are not vectorizable either. Anyway, if your workstation spends less than 80% of its time executing the inner loops of your code, then it is very unlikely to be sped up appreciably by a vectorizing compiler. This prediction is based on classic work by G. Amdahl . A simplified version of Amdahl's law states that, if vectorizable code is executed instantaneously by an "ideal" vector supercomputer, and if nonvectorized code is not sped up at all, then the total speedup is 1/x. Here, x is the fraction of time spent by the workstation in nonvectorized code. For McCalpin's code, described previously, x is at least 0.30, so the maximum speedup obtainable on any vector supercomputer over a fast workstation (such as an RS-6000) is not much more than three. Indeed, this is what he observed.
A slightly more detailed analysis will allow one to predict speedups when moving from a relatively slow workstation, such as a Sun SPARCstation 1, to a vector supercomputer. In this case, we can assume the code outside your inner loops might be accelerated by at most a factor of six. On an idealized supercomputer, your inner loops would be executed instantaneously, but your total speedup can be no more than 6/(20%) = 30, if 20% of your workstation run time is due to computations outside your inner loops. This is a good explanation for the 20- to 30-fold speedups we have observed for some workstation codes.
As evidence of the predictive power of this simple model, consider Mohan Ramamurthy's experience with running six floating-point intensive Fortran codes for solving problems in geophysical fluid dynamics . One code is 95% scalar, yielding only a 3-fold, 5-fold, and 12-fold speedup of a single-processor Cray Y-MP over an IBM RS-6000/320, a DECstation 3100, and a SPARCstation 1, respectively. The geometric average of the speedups was 10, 20 and 50, respectively. The most vectorizable code yielded Cray speedups of 25-, 100-, and 200-fold over these three workstations. This data indicates that large speedups are indeed available for some workstation codes, especially if they are run on relatively slow workstations. However, note that the largest Cray/RS-6000 speedup is 25.
Another user who has found only small speedups is W. David Higgins of Harris Computer Systems. He measured the speedup of a single Cray-2 processor, in comparison to an IBM RS-6000/540 and several other workstations, on a finite element analysis code named Ansys . This code is a product of Swanson Analysis Systems Inc. of Houston, Pennsylvania. Higgins observed a speedup of 2.5 on a benchmark named SP-3 containing 100 8-node solid elements. The speedup increased to 6.4 on LS4, a "large problem requiring 600MB of disk." (Some readers may be curious to know Higgins's data for several high-end PCs on the SP-3 benchmark. He found that an ALR 486/33MHz ran 12 times slower than the Cray-2 on his SP-3 benchmark; an HP 486/25MHz ran 19 times slower; and a Gateway 386/33MHz ran 28 times slower. These larger speedups are a reflection of the lower serial execution rate of an Intel 486 processor, in comparison to an RS-6000 or a Cray.)
Our experience with the linear program solver XMP, detailed in the two sidebars on "Dusty Decks", is perhaps typical of users with "dusty deck" Fortran codes. As indicated in the sidebar, "Speedups on CPlex" even after switching to a faster solver named CPlex, we still see no reason to run our application on a supercomputer.
You can use a workstation utility known as a "profiler" to estimate the percentage of CPU time spent in the inner loop of your code. If this percentage is less than 80%, you might consider rewriting your code, as discussed in the next section of this article. If, on the other hand, this percentage is high, you are still not assured of having vectorizable code. From the point of view of vectorization, an ideal inner loop is one that
* only references a few different items of data in each iteration
* performs simple array indexing
* has no data-conditional branches
If your inner loop passes all these tests, it will probably be vectorizable (the sidebar on "Vectorizable Code" provides further information).
In summary, we assert that most codes will not be sped up by more than a factor of 30, when moved from a workstation to a single-processor vector supercomputer. When run on a parallel supercomputer, larger speedups are possible on some, very simple, inner loops with huge iteration counts. The easiest way to determine the speedup of your code is to compile it on a supercomputer, then run it on a representative set of test data. Most supercomputer centers are willing to let a potential client make a test run of this form, at nominal cost.
Will Your Program be Used Enough to Warrant Hand-Optimization?
As noted in the preceding section, global compiler directives are not often sufficient to give much (if any) speedup. Hand-optimization is necessary for efficient use of the functional units of a supercomputer in order to approach its full speed advantage over a scalar processor.
The time required to hand-optimize a program will depend on the size of the utilized code [14, pp. 6-7], the style in which it was written, and the programmer's familiarity with it. By style we mean, "Does the program contain good comments and descriptive variables?"
Of course, the expense of programmer-mediated optimization must be offset with the savings gained from the decrease in execution time. In the worst case, it will take months or years to hire and train programmers to rewrite, document, and test a few thousand lines of code. In the best case, one's supercomputer center consultants may be able to point out a simple modification to your inner loop, achieving a factor-of-10 speedup in a matter of days.
In our case, examination of the XMP source code showed its inner loop to be a prime example of "dusty deck" Fortran. No obvious fixes suggest themselves. The development of an efficient code for solving linear programs on a supercomputer is a major development task, one that very few organizations can or should attempt, in our opinion.
This brings us to our final bit of advice: Before rewriting your own code, we strongly recommend you make a thorough search for similar code that has already been vectorized and/or parallelized. Should this search be successful, you stand to save a lot of time, money, and aggravation.
Can You Afford to Retune Your Software's Floating-Point Arithmetic?
Our fourth question is pointed at the profound differences between, and general inferiority of, floating-point arithmetic on a Cray vector supercomputer vs. that on a workstation. All modern workstations follow the IEEE 754 floating-point standard. All currently marketed Cray vector supercomputers (i.e., the Cray-2, the Cray X-MP, and the Cray Y-MP) use non-IEEE floating-point formats. Significantly, Cray's recently announced massively parallel architecture, based on the Alpha workstation processor chip, will support IEEE-standard floating-point arithmetic.
In general, you should expect to encounter little or no difficulty with floating-point computations when porting your code from one IEEE-standard workstation to another. If, however, your code is sensitive to floating-point roundoff errors, you may have trouble porting it between workstations, and major headaches when porting it to a vector supercomputer.
The Cray floating-point storage format has 1 sign bit, 15 exponent bits, and 48 mantissa bits. The most significant bit is stored explicitly in the leftmost mantissa bit, giving 48 bits of precision . The IEEE-standard double-precision storage format has 1 sign bit, 11 exponent bits, and 52 mantissa bits. Here, the most significant mantissa bit is implied, giving the user 53 bits of precision , fully 5 bits more storage precision than is available on the Cray-2.
The IEEE standard specifies that the result of an arithmetic operation should be the same as it would be if computed exactly and then rounded using the current rounding mode . Irritatingly, the IEEE standard specifies a number of roundoff and truncation algorithms that may be employed when storing a floating-point number. Caveat porter!
The definitional situation is more complex on a Cray. One set of definitions are employed on the Cray X-MP, Y-MP, and Y-MP C90. A second set is used on the Cray-2. For example, the Cray Y-MP floating-point multiplication algorithm has "near-zero average rounding error," with "variations due to truncation and rounding" in the range - 0.23 x [2.sup.-48] [2.sup.-48] to +0.57 x [2.sup.-48] . Thus the effective precision of Cray Y-MP arithmetic is slightly less than the 48 bits available in the storage format.
In computations where precision is not crucial, workstation users may employ the single-precision IEEE-standard floating-point storage format. In this format, a floating-point number occupies only 32 bits of storage, but its range and precision are limited. On most modern workstations, computations on single-precision numbers proceed more rapidly than those on double-precision numbers, especially if bandwidth to and from memory is a bottleneck.
We note, in passing that very-high-precision floating-point arithmetic is available on some workstations (such as the DECstation 3100) as well as on most supercomputers. Such "extended-precision" computations can carry 100 bits of precision, but typically run 4 to 10 times more slowly.
In addition to the precision advantage outlined, the IEE standard provides a straightforward method for handling exceptional operations, such as dividing by 0 or taking the square root of a negative number. These exceptional operations yield results that, when printed, are easily understood. Alternative methods of handling exceptions, seen on Crays (and on older mainframes) are an easily lost or misinterpreted error message, or, possibly, an abnormal termination of your job. The latter is irritating, to say the least, when the exceptional operation would not have affected the major results of your run.
In our application, we observed that the XMP linear program package was unable to solve a 1,500-constraint problem when it was compiled and run on a Cray-2. The identical XMP source code was able to solve this problem when compiled and run on the Sun-3, the Sun-4, and the DECstation 3100. We surmise that the loop termination conditions in the XMP package were tuned for IEEE-standard arithmetic, and would require some modification to work adequately on the Cray-2. If we were still interested in porting XMP to a Cray, we would contact its supplier for advice on modifying the code.
We conclude that the peculiarities and relative imprecision of non-IEEE floating-point arithmetic might make it difficult to port a workstation code to a vector supercomputer.
Does Your Application Require More Than 200MB, but Less than 16GB, of Primary Memory?
Most workstations have 8- or 16MB of primary memory. A fully configured state-of-the-art workstation such as a DECstation 3100 or an IBM RS-6000 can have up to 256MB of primary memory. If you need more main memory than this, you should consider moving to a Cray-2, which can have up to 16GB of primary memory. However, if your code makes random access to more than 16GB, you belong on a workstation or possibly a mainframe. The reason is that an inexpensive CPU can keep its disks busy much more cost-effectively than can a Cray.
The importance of matching application size to primary memory space can not be overstated. If your program frequently accesses a data structure that is larger than its main memory, its run time will be limited by your computer's disk transfer time. A slowdown for this reason is called "disk thrashing." In the worst case, your program might make random accesses to single words in a data structure stored on its disk. If this occurs, your workstation could run 100,000 times more slowly than if its data structure fit into its primary memory: a random access to fast semiconductor RAM takes about 100ns, while a random access to a fast disk takes about 10 milliseconds.
On a supercomputer, disk thrashing has more severe economic consequences than it does on a workstation. On most codes, a $120-per-hour Cray Y-MP CPU can do only a few more disk accesses per second than can a $10-per-day workstation. Also, the cost of a disk access on a workstation can more easily be mitigated by multiprogramming, a technique whereby another task (or another user) can use the workstation's CPU, while the first task is awaiting disk service. It takes only a millisecond or less to switch tasks. By contrast, a multiprogrammed Cray processor also takes a millisecond or so to switch tasks--but each millisecond of "wasted CPU time" on a Cray costs 300 times as much as on a workstation!
The Cray X-MP and Y-MP architectures include a large DRAM "secondary memory" designed to allow rapid block level access to a 1000MB data set. There is thus some hope of efficiently running codes with huge data sets on these machines, even though they lack a large primary memory. If written for Cray-2, a KSR-1, or a CM-5, such codes might run even more cost-effectively than on a Cray X-MP or Y-MP, because the primary memory on these computers is built entirely from inexpensive DRAM.
In order to estimate your program's space requirements, we suggest asking your computation center's consulting staff for help. If you give them a running copy of your code and a sample dataset, they should be able to measure (or tell you how to measure) your program's use of the system's disk drives and virtual memory. A bit of analysis and examination of source code documentation will be in order as well, in order to extrapolate how your program would behave on a worst-case set of input data.
Alternatively, youmight try to calculate the maxium total size of your program's data and code area. You can then use this as a worst-case estimate of your program's memory requirements. Most programs will not thrash if limited to a smal fraction of their total memory area, since at any given time, they will access only a small part of their data and code. The actively referenced part of a program's data and code space is called its "working set."
In our application, a linear program with 1,000 constraints uses 0.5MB of memory for data storage on the workstations, or 0.78MB of memory on the Cray-2. The difference between the Cray-2 and the workstation is a result of the use of 4-byte integers on the workstations, as opposed to the 8-byte Cray-2 integers. On both machines, the space required for the program object code quickly becomes irrelevant as the problem size increases.
Since we are using linear programming code based on sparse matrices, our working set will grow with the number of nonzeros in the constraint matrix. In our problems, we expect the number of nonzeros per expect the number of nonzeros per row to be essentially independent of the number of constraints. We thus extrapolate linearly, finding that 40MB of working set should be enough to handle the 80,000-constraint linear program arising from using our method to route the wires in a modern gate array. Thus a memory-rich workstation should be able to solve a wire-routing problem for even the largest current gate arrays. Gate arrays are expected to become larger in the future, but so are workstation memories. We conclude that our program's working set is not large enough to require a supercomputer.
Bixby's answer to this section's question is "yes." Many of his most interesting problems require 70- to 100MB of data space. The largest workstation of Bixby's immediate disposal (in 1990) had 32MB. This might be sufficient reason to keep him on the Cray for most of his computations. It is not always possible to predict memory usage before a computation begins, and it is very irritating to see your program abort after several hours (or several days!) with no usable output, due to lack of mkemory space.
We conclude that vector supercomputers are appropriate for anyone who, like Bixby's, is exploring the contemporary limits of feasible floating-point computability.
We have posed five questions to help you decide whether or not to port your application to a vector supercomputer:
1. Would you pay 10 times as much to solve your problem 30 times faster?
2. Can your code be easily vectorized?
3. Will your program be used enough to warrant hand-optimization?
4. Can you afford to retune your software's floating-point arithmetic?
5. Does your application require more than 200MB but less than 16GB of primary memory?
If answer "no" to the first question, but you think you could obtain a speedup greater than 30 with a vectorizing compiler alone (Question two), or with hand-optimization followed by vectorized compilation (question three), then porting your code may well make economic sense. A negative answer to either question 4) or 5), however, is a sure indication of trouble. Simplifying, we thus assert that if you answer any two o four questions "no," your application does not belong on a supercomputer.
All five questions are answered affirmatively for Bixby's research into large-scale linear and integer program, indicating that his work does indeed belong on a Cray. Our application, however, scored four "no" answers. In hindsight, it seems odd that we even attempted to port our wire-routing application to a vector supercomputer.
Related Article: Speedups on "Dusty Deck"
In 1986 we believed we had an application that was well suited for running on a Cray supercomputer. We were developing a promising new approach for deciding how to route the microscopic wires on a semicustom integrated circuit known as a gate array. Our idea is to generate a fairly large linear program (832 constraints) for a small (12 x 15) gate array. The solution of the linear program indicates how wires might be placed into the Manhattan-like channels between the gates in the array [12, 19].
The CPU-intensive kernel of the 1986 version of our wire-routing system was Roy E. Marsten's XMP package for solving linear programs. This packing of Fortran routines is running on many different computers, ranging from PCs to supercomputers [16, p. 3].
Since our Sun-3 workstation took 100 seconds to solve our 832-constraint problem using the XMP library, we feared that our system would not be feasible for large gate arrays. To give a sense of scale, a modern gate array such as Texas instruments' 1 [mu]m CMOS TGC118 has 18,620 sites in a 133 x 140 array. Gate arrays with more than a million sites are now being produced by some vendors.
Routing a 133 x 140 array with our method would result in a linear program approximately 100 times larger than our 12 x 15 example. Since linear programming by the simplex method (as in XMP and CPlex) usually runs in O([n.sup.3]) time, Sun-3 run time on a 133 x 140 array would be roughly [(100).sup.3] times as long as in our 12 x 15 example, that is, [10.sup.8] seconds, or about two years!
Since the Cray-2 typically runs at tens of MFLOPS, and our Sun-3 with a 68881 coprocessor typically runs at tens of KFLOPS, we naively hoped to see our run time decrease by a factor of 1,000 after porting our program to a Cray-2. With the expected 1,000-fold speedup, we could route a full TGC118 array in a day of Cray-2 run time.
In retrospect, our expectation of a 1,000-fold speedup was naive. On our code, we observed only a 20-fold speedup. A Cray-2 took 5.1 seconds to solve our 832-constraint benchmark, as opposed to 101 seconds on a Sun-3/160. Later, we tried this benchmark on several other computers, obtaining run times of 42.1 seconds on our Sun-4/110, 19.9 seconds on a DECstation 3100, and 3.4 seconds on a Cray X-MP.
Related article: Further Adventures with a "Dusty Deck"
Run times for benchmarks are notoriously sensitive to some seemingly minor changes in hardware and software. For purposes of comparison, we list our test conditions below. All workstations had floating-point chips and at least 8MB of memory. The Sun-3/160 is rated at 2MIPS; it was running Sun Unix 4.2 release 3.5. The Sun-4/110 is rated at 7MIPS; it was running SunOS Unix 4.0. The DECstation 3100 is rated at 14MIPS, and it was running an early version of its operating system, Ultrix-32 ver 2.0 rev 7.0. The Cray-2 is rated at 1,800 MFLOPS (peak); it has 4 processors and 4GB of primary memory; and it was running UNICOS 4.0 at the time of our experiments. The Cray X-MP has 4 processors, 128MB of primary memory, and 1GB of semiconductor secondary memory. The X-MP was running UNICOS 5.0. We used the CFT77 compiler on the Crays, and the Fortran 77 compiler supplied by each workstation's manufacturer.
Leaving the obsolescent Sun-3 out of the comparison, the largest speedup we observed when moving XMP from our slowest workstation (a Sun 4) to our fastest Cray (a X-MP) was 12. Thus, a $10-per-day workstation will solve the same set of problems about 25 times more cheaply than a $120-per-hour supercomputer, albeit 12 times more slowly.
In a vain attempt to improve performance without rewriting the XMP package, we made more than a dozen runs on the Cray-2 in order to try all combinations of the compiler flags that affect vectorization. We also tried a prerelease version of the parallelizing front-end to the compiler, but it produced an error message on our code. None of the compiler switches resulted in improvements for the XMP source code on our test data. This implies that the code is not vectorizable. On further investigation, we found the inner loop of the XMP code contained dozens of lines of source code, several data-conditional branches, and iterates, on average, just 1.8 times per entry. This is sufficient explanation for its nonvectorizability. Furthermore, we do not believe any compiler will ever be able to parallelize this "dusty deck" Fortran inner loop, apparently unmodified since it was written in the early 1970s. Please note that our findings do not imply the XMP package is useless or obsolete. Indeed, other researchers recently obtained splendid results from the ZOOM utility of XMP, in our application area of global wire routing for VLSI .
Related article: Speedups on CPlex
After modernizing both software and hardware early in 1990, we still found no reason to run our wire-routing application on a supercomputer. Our Sun 4/110 runs about 2.5 times faster than our old Sun-3/160, on the XMP code. We gained an additional factor of 15 by replacing the XMP code with the CPlex linear program solver. With CPlex, our Sun 4 workstation solves our 832-constraint benchmark in 2.5 seconds, a 40-fold improvement on our Sun-3 run time with XMP. Using our naive O([n.sup.3]) run-time extrapolation for a "typical" linear program solver, it seems possible that our present code could route the wires on a 133 x 140 gate array in two or three weeks of Sun-4 run time. Gate arrays with more than 10,000 or 20,000 sites could be handled by decomposition into smaller subproblems, along the lines suggested by Hu and Shing . Using current workstation technology, then, along with some "algorithmic tuning," our wire-routing system now shows promise of economic feasibility.
It would be interesting for us to use the CPlex code to solve our 832-constraint problem on a Cray, in order to complete the comparison. Unfortunately, our $500 CPlex academic use license gives us access only to Sun-4 object code. Still, we can make some speedup estimates on the basis of CPlex promotional literature : their code runs only 10 to 14 times faster on a Cray Y-MP than on a Sun-4, for five linear programming problems (viz. scagr25, seba, scsd8, nesm, and ship121 in the "netlib" set of benchmarks) similar in size to our 832-constraint problem.
Bixby has kindly supplied us with timing data on a new, "far better vectorized" version of CPlex, for 10 large linear programs with 50 to 1,200 non-zeros per row. The speedups ranged from 15 to 66 on a Cray Y-MP vs. a Sun 4/390, when the same solution algorithm is used for both machines. Using the default solution algorithm or each machine's version of CPlex, the speedup range is 11 to 96, thereby illustrating how important algorithmic adjustment can be, in such problems. Bixby says he typically sees speedup ratios in the range of 30 to 40.
Related article: Vectorizable Code
If your inner loop(s) are particularly simple, and if they account for almost all the execution time on a workstation run, then it may be possible to make an "armchair estimate" of the speedup your code might obtain on a vector supercomputer. The methodology suggested here is to find published speedup data for a loop that is similar to yours, either in a book  or in some other published source [7, 8, 18].
The critical features of your inner loop are its iteration count  and its data access pattern, although seemingly minor variations in coding and compiler version may result in large differences in MFLOP ratings. In March 1990, John Gregory of Cray Research Inc. compiled loops A and B (shown later) with the cf77 compiler for a Cray Y-MP in single-processor mode, clocking them at approximately 5MFLOPS when the iteration count n was 10, at 50MFLOPS when n = 100, and at 200MFLOPS when n = 1,000. Recently, Charles Grassl, also of Cray Research Inc., independently coded Loops A and B for various total sizes (in bytes) of the referenced arrays. On a single-processor Y-MP run for using cf77 version 6.0, he measured 200MFLOPS for n = 22500 on Loop A, but only 25MFLOPS for n = 210 on the doubly indexed Loop B. Loop C's indirect addressing limits it to 46MFLOPS, or about 1/6 of the peak Cray Y-MP single-processor performance, even for n = 150,000. We conclude that Loop A is definitely amenable to vectorization, Loop B may be problematic, and Loop C is vectorizable, but makes somewhat inefficient use of a Cray Y-MP's floating-point units.
Loop A: do 1 i=1,n 1 s=s+a(i)*b(i)
Loop B: do 2 i=2,n do 2 j=1,i-1 2 a(i)=a(i)+b(i,j)*a(i-j)
Loop C: do 3 i=1,n 3 a(i) = b(ia(i)) + c(ib(i))
A Sun SPARCstation-1 is much less sensitive than a Cray to iteration counts and access patterns. Each of these loops is executed at 1- to 3MFLOPS for any n [greater than] 2 on this workstation. For n [less than] 10, the Sparc is thus nearly as fast as a Cray Y-MP; but for large n, the speed ratio can be 100 or more.
Note: the indirect arrays ia() and ib() are random values selected at run time, uniformly in 1..n with replacement. In Grassl's code, n was a run-time variable; it may have been a compile-time constant in Gregory's code. The arrays a(), b(), and c are stored in 64-bit floating-point format.
Related article: Parallelizable Code
All currently marketed vector supercomputers can be configured to run your code in parallel on multiple processors. If your code is perfectly parallelizable, it will run k times faster on k processors. In practice, even the best codes run somewhat more slowly than this, due to the overheads of interprocessor communication. Whenever these overheads can be amortized to nearly zero on a per-FLOP basis, we obtain efficient parallelization for sufficiently large problem sizes n.
Charles Grassl recently ran our Loops A, B, and C on an 8-processor Cray Y-MP C90. In a single-processor run, he measured 600MFLOPS for Loop A (n = 22,500 to 225,000), 100 to 200MFLOPS for Loop B (n = 210 to 700), and 90MFLOPS for Loop C (n = 15,000 to 150,000). In an 8-processor run, Loop A ran at 1,500MFLOPS for n = 22,500, to 3,700MFLOPS for n = 225,000. Since the single-processor code for Loop A ran at 600MFLOPS, the "ideal" 8-processor speed for large n is 4,800MFLOPS. Still, we see that Loop A is 8-way parallelizable on a Cray Y-MP C90, with 77% or more efficiency for large n.
Loop B ran at approximately the same speed in single-processor mode as it did in multiple processor mode on a Cray Y-MP C90, for n between 210 and 700. Thus it is not parallelized; perhaps larger n would help. Certainly, recoding should help: with an "ideal" compiler, the Cray Y-MP C90 should execute the computation of Loop B at a speed approaching, or even exceeding, that of Loop A. The key observation is that the loops are highly parallelizable, if the computation is reordered so that the large-J additive terms for a(i+1) are collected before the sum for a(i) is complete. Another important observation is that only a little more than one 8-byte memory fetch per multiply-add step is necessary, if the computation is "blocked" so that most a(i) and a(i-j) references are to values in vector registers. Thus we assert that Loop B should run at somewhat better than 4GFLOPS for sufficiently large n on an 8-processor Cray Y-MP C90.
Loop C ran at 300MFLOPS (for n = 15,000) to 600MFLOPS (for n = 150,000). This is a 3-fold to 6-fold speedup over the single-processor timing, so we have achieved effective but not completely efficient parallelization with this code for this range of n.
At press time, no authoritative benchmark data was available for CM-5 runs on my loops. Instead, we offer the following calculations of ideal speeds that might be achieved with sufficiently clever codings, for a 1024-processor CM-5. Note that recoding is absolutely necessary to get supercomputer performance out of the CM-5 on my loops. The current Fortran compiler for the CM-5 will not parallelize Fortran 77 code.
If Loop A is rewritten as the Fortran 90 statement s = sum(a*b), an ideal compiler for the CM-5 would issue code that performs each multiply-add operation in the time required for DRAM fetches of both the a(i) and b(i) operands. This is 16 bytes of DRAM traffic for every two floating-point operations, on an architecture with a maximum DRAM bandwidth of 512GB/sec . A suitably revised version of Loop A should thus be executable on a 1024-processor CM-5 at a speed approaching (512GB/sec) x (2FLOP/16GB) = 64GFLOPS. Thus a large CM-5 might be more than 10 times faster than a Cray Y-MP C90 on our first loop.
Parallelizing Loop B 1024-fold, in order to achieve GFLOP or even 10GFLOP performance on a CM-5, seems possible for sufficiently large n. (This might be a very challenging coding project.) We tentatively assert that, with sufficiently clever coding and for sufficiently large n, a 1024-processor CM-5 should be able to run Loop B somewhat faster than does an 8-processor Cray Y-MP C90.
Loop C illustrates a difficulty with all massively parallel architectures known to this author: too much "random" indirect addressing precludes efficient parallelization. Reducing such indirections by source code revision is, for this reason, a guiding principle for code design on any massively parallel architecture. Loop C does two indirect accesses for every floating-point addition. This is an unfavorable ratio for the CM-5, as we will see.
If ideally coded for a 1024-processor CM-5, Loop C's performance would be limited by the following considerations. Each cluster of 16 processors has a 160MB/sec bandwidth limit on data transmitted from other clusters . Also, in a 1024-processor computation of Loop C, almost all of the a() and b() data required for a cluster's computation will be held in other processors' DRAM. These observations imply that a 1024-processor CM-5 cannot execute Loop C at a speed greater than (160MB/sec/16 processors) x (1FLOP/16 bytes) x (1024 processors) = 625 MFLOPS. This is about 1% of its speed on Loop A. The maximum achievable CM-5 performance on Loop C will be somewhat less than we have just estimated, since we have not made allowances for the CPU time required to handle interprocessor communication. Messages less than a few hundred bytes are not handled efficiently by any currently available, general-purpose, message-passing software for the CM-5 . Using special purpose code, Charles Leiserson's group has measured bandwidths in excess of 4MB/sec/processor for a short-packet random permutation similar to the one defined by Loop C . If we executed Loop C's floating-point data movements at 4MB/sec/processor, a 1024-processor CM-5 would achieve 250MFLOPS. (Even this figure seems optimistic, in our limited experience with CM-5 Fortran codes.) As noted previously, an 8-processor Cray Y-MP C90 using the standard cf77 compiler runs Loop C at 600MFLOPS.
We tentatively conclude that Fortran codes making extensive use of "random" indirect accesses to arrays are not efficiently executed on today's massively parallel architectures and compilers.
A full comparison of vector supercomputers with massively parallel supercomputers is well beyond the scope of this article. The preceding analyses are meant to be illustrative, not authoritative. Nonetheless, it should be clear that a parallel supercomputer, especially if it is massively parallel, is very sensitive to the differences between the computations of Loops A, B, and C, as well as to the way in which these computations are coded. We thus see no reason to expect parallelization speedups, beyond the limited speedup obtainable through vectorization, on "dusty deck" codes using present-day (or even next-generation) computers and compilers.
[1.] Amdahl, G. The validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings 30 (1967).
[2.] Bixby, R.E. Private correspondence, Sept. 1990.
[3.] CPlex Optimization, Inc., 7710-T Cherry Park, Suite 124, Houston, Tex. 77095. CPlex Performance Characteristics, 1989.
[4.] Cray Research, Inc. Cray-2 computer systems functional description manual. Manual HR-02000-0D, 1989.
[5.] Cray Research, Inc. Cray Y-MP computer systems functional description manual. Manual HR-04001-0C, 1990.
[6.] Lusk, E. and Overbeek, R. et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. ISBN 0-03-014404-3.
[7.] Feo, J.T. An analysis of the computational and parallel complexity of the livermore loops. Parallel Comput. 7 (1988), 163-185.
[8.] Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, Calif, 1990. ISBN 1-55860-069-8.
[9.] Higgins, W.D. Re: Ibm 6000 vs HP 9000 series 700. In comp.benchmarks, June 27, 1991.
[10.] Hockney, R.W. and Jesshope, C.R. Parallel Computers 2. Adam Hilger, ISBN 0-85274-812-4, 1988.
[11.] Hu, T.C. and Shing, M.T. A decomposition algorithm for circuit routing. Mathematical Programming Study 24 (1985), 87-103.
[12.] Karp, R.M., Leighton, F.T., Rivest, R.L., Thompson, C.D., Vazirani, U.V. and Vazirani, V.V. Global wire routing in two-dimensional arrays. In 24th Annual Symposium on Foundations of Computer Science, IEEE (Nov. 1983), pp. 453-459.
[13.] Leiserson, C.E., Abuhamdeh, Z.S. et al. The network architecture of the connection machine CM-5. In Fourth Annual ACM Symposium on Parallel Algorithms and Architectures (July 1992), pp. 272-285.
[14.] Levesque, J.M. and Williamson, J.W. A Guidebook to Fortran on Supercomputers. Academic Press, Inc., 1989. ISBN 0-12-444760-0.
[15.] Lin, M., Tang, R., Du, D.H.C., Klietz, AE. and Saroff, S. Performance evaluation of the CM-5 interconnection network. In Compcon (Feb. 1993).
[16.] Marsten, R.E. XMP Technical reference manual. XMP Software, Inc., July 1987.
[17.] McCalpin, J.D. Benchmark summary--FP-intensive, 2D CFD model. comp.benchmarks, Nov. 20, 1990.
[18.] McMahon, F. The Livermore fortran kernels: A computer test of numerical performance range. Tech. Rep. UCRL-53745, Lawrence Livermore National Laboratory, 1986.
[19.] Ng, A., Raghavan, P. and Thompson, C.D. Experimental results for a linear program global router. Comput. Artif. Intell. 6, 3 (1987) 229-242.
[20.] Ramamurthy, M.K. Floating-point benchmark comparison. In comp. benchmarks, Nov. 12, 1990.
[21.] Thaik, R.W., Lek, N. and Kang, S.-M. A router using zero-one integer linear programming techniques for sea-of-gates and custom logic arrays. IEEE TCAD-ICS 11, 12 (Dec. 1992) 1495-1507.
[22.] Thinking Machines Corp., Cambridge, Mass. The connection machine CM-5 Tech. Sum., Jan. 1992.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||includes related articles on supercomputer applications, vectorizable code, and parallelizable code|
|Author:||Thomborson, Clark D.|
|Publication:||Communications of the ACM|
|Date:||Nov 1, 1993|
|Previous Article:||The CM-5 Connection Machine: a scalable supercomputer.|
|Next Article:||An improved inspection technique.|