Ultracomputers: a teraflop before its time.
In 1989 I described the situation in high-performance computers in science and engineering, including several parallel architectures that could deliver teraflop power by 1995, but with no price constraint . I predicted either of two alternatives: SIMDs with thousands of processing elements or multicomputers with 1,000+ interconnected, independent computers, could achieve this goal. A shared-memory multiprocessor looked infeasible then. Traditional, multiple vector processor supercomputers such as Crays would simply not evolve to a teraflop until 2000. Here is what happened.
1. During 1992, NEC's four-processor SX3 is the fastest computer, delivering 90% of its peak 22Gflops for the Linpeak benchmark, and Cray's 16-processor YMP C90 provides the greatest throughput for supercomputing workloads.
2. The SIMD hardware approach that enabled Thinking Machines to start up in 1983 and obtain DARPA funding was abandoned because it was only suitable for a few, very large-scale problems, barely multiprogrammed, and uneconomical for workloads. It is unclear whether large SIMDs are "generation"-scalable, and they are clearly not "size"-scalable. The main result of the CM2 computer was 10Gflop-level of performance for large-scale problems.
3. Ultracomputer-sized, scalable multicomputers (smC) were introduced by Intel and Thinking Machines, using "Killer" CMOS, 32-bit microprocessors. These product introductions join multicomputers from companies such as Alliant, AT&T, IBM, Intel, Meiko, Mercury, NCUBE, Parsytec, and Transtech. At least Convex, Cray, Fujitsu, IBM, and NEC are working on new-generation smCs that use 64-bit processors. By 1995, this score of efforts, together with the evolution of fast, LAN-connected workstations will create "commodity supercomputing." The author advocates workstation clusters formed by interconnecting high-speed workstations via new high-speed, low-overhead switches, in lieu of special-purpose multicomputers.
4. Kendall Square Research introduced their KSR 1 scalable, shared-memory multiprocessors (smP) with 1,088 64-bit microprocessors. It provides a sequentially consistent memory and programming model, proving that smPs are feasible. The KSR breakthrough that permits scalability to allow it to become an ultracomputer is based on a distributed, memory scheme, ALLCACHE[TM] that eliminates physical memory addressing. The ALLCACHE design is a confluence of cache and virtual memory concepts that exploit locality required by scalable, distributed computing. Work is not bound to a particular memory, but moves dynamically to the processors requiring the data. A multiprocessor provides the greatest and most flexible ability for workload since any processor can be deployed on either scalar or parallel (e.g., vector) applications, and is general-purpose, being equally useful for scientific and commercial processing, including transaction processing, databases, real time, and command and control. The KSR machine is most likely the blueprint for future scalable, massively parallel computers.
Figure 1 shows the evolution of supers (four- to five-year gestation) and micro-based scalable computers (three-year gestation). In 1992, petaflop ([10.sup.15] flops) ultracomputers, costing a half-billion dollars do not look feasible by 2001. Denning and Tichy  argue that significant scientific problems exist to be solved, but a new approach may be needed to build such a machine. I concur, based on results to date, technology evolution, and lack of user training.
The teraflop quest is fueled by the massive (gigabuck-level) High Performance Computing and Communications Program (HPCC, 1992) budget and DARPA's military-like, tactical focus on teraflops and massive parallelism with greater than 1,000 processing elements. The teraflops boundary is no different than advances that created electronic calculators (kiloflops), Cray computers (megaflops), and last-generation vector supercomputers (Gflops). Vector processing required new algorithms and new programs, and massively parallel systems will also require new algorithms and programs. With slogans such as "industrial competitiveness," the teraflop goal is fundable--even though competitiveness and teraflops are difficult to link. Thus, HPCC is a bureaucrat's dream. Gigabuck programs that accelerate evolution are certain to trade off efficacy, balanced computing, programmability, users, and the long term. Already, government-sponsored architectures and selected purchasing have eliminated benchmarking and utility (e.g., lacking mass storage) concerns as DARPA focus narrowed on the teraflop. Central purchase of an ultracomputer for a vocal minority wastes resources, since no economy of scale exists, and potential users are not likely to find or justify problems that effectively utilize such a machine without a few years of use on smaller machines.
Worlton describes the potential risk of massive parallelism in terms of the "bandwagon effect," where we make the biggest mistakes in managing technology . The article defines "bandwagon" as "a propaganda device by which the purported acceptance of an idea, product or the like by a large number of people is claimed in order to win further public acceptance." He describes a massively parallel bandwagon drawn by vendors, computer science researchers, and bureaucrats who gain power by increased funding. Innovators and early adopters are the riders. The bandwagon's four flat tires are caused by the lack of systems software, skilled programmers, guideposts (heuristics about design and use), and parallelizable applications.
The irony of the teraflops quest is that programming may not change very much even though virtually all programs must be rewritten to exploit the very high degree of parallelism required for efficient operation of the coarse-grained, scalable computers. Scientists and engineers will use just another dialect of Fortran that supports data parallelism.
All computers, including true supers, use basically the same, evolving, programming model for exploiting parallelism: SPMD, a single program, multiple data spread across a single address space that supports Fortran . In fact, a strong movement is directed toward the standardization of High Performance Fortran (HPF) using parallel data structures to simplify programming. With SPMD, the same program is made available to each processor in the system. Shared memory multiprocessors simply share a copy in common memory and each computer of a multicomputer is given a copy of the program. Processors are synchronized at the end of parallel work units (e.g., outermost DO loop). Multicomputers, however, have several sources of software overhead due to communication being message-passing instead of direct, memory reference. With SPMD and microprocessors with 64-bit addressing, multicomputers will evolve to be the multiprocessors they simulate by 1995. Thus, the mainline, general-purpose computer is almost certain to be the shared memory, multiprocessor after 1995.
The article will first describe supercomputing evolution and the importance of size-, generation-, and problem-scalability to break the evolutionary performance and price barriers. A discussion about measuring progress will follow. A taxonomy of alternatives will be given to explain the motivation for the multiprocessor continuing to be the mainline, followed by specific industrial options that illustrate real trade-offs. The final sections describe computer design research activities and the roles of computer and computational science, and government.
Evolution to the Ultracomputer: A Scalable Supercomputer
Machine scalability allows the $30 million price barrier to be broken for a single computer so that for several hundred million dollars' or a teraflop's worth of networked computers, the ultracomputer, can be assembled. Until 1992, a supercomputer was defined both as the most powerful central computer for a range of numerically intense computation (i.e., scalar and vector processing), with very large data sets, and costing about $30 million. The notion of machine or "size" scalability, permitting computers of arbitrary and almost unlimited size, together with finding large-scale problems that run effectively have been key to the teraflop race . This is a corollary of the Turing test: People selling computers must be smarter than their computers. No matter how difficult a computer is to use or how poorly it performs on real workloads, given enough time, someone may find a problem for which the computer performs well. The problem owner extols the machine to perpetuate government funding.
In 1988, the Cray YMP/8 delivered a peak of 2.8 Gflops. By 1991, the Intel Touchstone Delta (672 node multicomputer) and the Thinking Machines CM2 (2K processing element SIMD), both began to supply an order of magnitude more peak power (20 gigaflops) than supercomputers. In supercomputing, peak or advertising power is the maximum performance that the manufacturer guarantees no program will exceed. Benchmark kernels such as matrix operations run at near peak speed on the Cray YMP. Multicomputers require O(25,000) matrices to operate effectively (e.g., 14 Gflops from a peak of 20), and adding processors does not help. For O(1,000) matrices that are typical of supercomputer applications, smCs with several thousand processing elements deliver negligible performance.
Supers 1992 to 1995
By mid-1992 a completely new generation of computers have been introduced. Understanding a new generation and redesigning it to be less flawed takes at least three years. Understanding this generation should make it possible to build the next-generation supercomputer class machine, that would reach a teraflop of peak power for a few, large-scale applications by the end of 1995.
Table 1 shows six alternatives for high-performance computing, ranging from two traditional supers, one smP, and three "commodity supers" or smCs, including 1,000 workstations. Three metrics characterize a computer's performance and workload abilities. Linpeak is the operation rate for solving a system of linear equations and is the best case for a highly parallel application. Solving systems of linear equations is at the root of many scientific and engineering applications. Large, well-programmed applications typically run at one-fourth to one-half this rate. Linpack 1K x 1K is typical of problems solved on supercomputers in 1992. The Livermore Fortran Kernels (LFK) harmonic mean for 24 loops and 3 sizes, is used to characterize a numerical computer's ability, and is the worst-case rating for a computer as it represents an untuned workload.
New-generation, traditional or "true" multiple vector processor supercomputers have been delivered by Cray and NEC that provide one-fourth to one-eighth the peak power of the smCs to be delivered in 1992. "True" supercomputers use the Cray design formula: ECL circuits and dense packaging technology to reduce size, allow the fastest clock; one or more pipelined vector units with each processor provide peak processing for a Fortran program; and multiple vector processors communicate via a switch to a common, shared memory to handle large workloads and parallel processing. Because of the dense physical packaging of highpower chips and relatively low density of the 100,000 gate ECL chips, the inherent cost per operation for a supercomputer is roughly 500 to 1,000 peak flops/$ or 4 to 10 times greater than simply packaged, 2 million transistor "killer" CMOS microprocessors that go into leading edge workstations (5,000 peak flops/$). True supercomputers are not in the teraflops race, even though they are certain to provide most of the supercomputing capacity until 1995.
Intel has continued the pure multicomputer path by introducing its third generation, Paragon with up to 4K Intel i860 microprocessor based nodes, each with a peak of 4 x 75 Mflops for delivery in early 1993. Intel is offering a 6K node, $300 million, special ultracomputer for delivery in late 1993 that would provide 1.8 peak teraflops or 6K peak flop/$.
In October 1991, Thinking Machines Corp. (TMC) announced its first-generation multicomputer consisting of Sun servers controlling up to 16K Sparc microprocessor-based computational nodes, each with four connected vector processors to be delivered in 1992. The CM5 workload ability of a few Sun servers is small compared to a true supercomputer and to begin to balance the computer for general utility would require several disks at each node. The expected performance for both supercomputer-sized problems and a workload means that the machine is fundamentally special-purpose for highly parallel jobs. The CM5 provides 4,300 peak flops/$.
In both the Paragon and CM5 it is likely that the most cost-effective use will be with small clusters of a few (e.g., 32) processors.
1995 Supers: Vectors, Scalable Multicomputers or Multiprocessors
Traditional or "true" supercomputers have a significant advantage in being able to deliver the computational power during this decade because they have evolved for four, four-year generations for almost 20 years,(1) and have an installed software-based, programming paradigm, trained programmers, and wider applicability inherent in finer granularity. The KSR-1 scalable multiprocessor runs traditional, fine-grained supercomputer Fortran programs, and has extraordinary single-processor scalar and commercial (e.g., transaction processing) throughput.
The smCs are unlikely to be alternatives for general-purpose computing or supercomputing because they do not deliver significant power for scalar- and finer-grained applications that characterize a supercomputer workload. For example, the entire set of accounts using the Intel smC at Cal Tech is less than 200, or roughly the number of users that simultaneously use a large super. Burton Smith  defines a general-purpose computer as: 1. Reasonably fast execution of any algorithm that performs well on another machine. Any kind of parallelism should be exploitable. 2. Providing a machine-independent programming environment. Software should be no harder to transport than to any other computer. 3. A storage hierarchy performance consistent with computational capability. The computer should not be I/O bound to any greater extent than another computer.
Whether traditional supercomputers or massively parallel computers provide more computing, measured in flops/month by 1995 is the object of a bet between the author and Danny Hillis of Thinking Machines . Scalable multicomputers (smCs) are applicable to coarse-grained, highly parallel codes and someone must invent new algorithms and write new programs. Universities are rewarded with grants, papers, and the production of knowledge. Hence, they are a key to utilizing coarse-grained, parallel computers. With pressure to aid industry, the Department of Energy laboratories see massive parallelism as a way to maintain staffs. On the other hand, organizations concerned with cost-effectiveness, simply cannot afford the effort unless they obtain uniquely competitive capabilities.
Already, the shared, virtual memory has been invented to aid in the programming of multicomputers. These machines are certain to evolve to multiprocessors with the next generation. Therefore, the mainline of computing will continue to be an evolution of the shared memory multiprocessor just as it has been since the mid-1960s . In 1995, users should be able to buy a scalable parallel multiprocessor for 25K peak flops/$, and a teraflop computer would sell for about $40 million.
Supercomputer users and buyers need to be especially cautious when evaluating performance claims for supercomputers. Bailey's  twelve ways to obfuscate are:
1. Quote 32-bit performance results as 64-bit performance
2. Present inner kernel performance as application performance, neglect I/O
3. Employ assembly, micro-code and low-level code
4. Scale up problem size to avoid performance drop-off when using large numbers of processors with overhead or inter-communication delays or limits
5. Quote performance results extrapolated to a full system based on one processor
6. Compsare results against unoptimized, scalar Cray code
7. Compare direct run-time with old code on an obsolete system (e.g., Cray 1)
8. Quote additional operations that are required when using a parallel, often obsolete, algorithm
9. Quote processor utilization, speedup, or Mflops/$ and ignore performance
10. Mutilate the algorithm to match the architecture, and give meaningless results
11. Compare results using a loaded vs. dedicated system
12. If all else fails, show pictures and videos
Each year a prize administered by the ACM and IEEE Supercomputing Committees awards a prize to reward practical progress in parallelism, encourage improvements, and demonstrate the utility of parallel processors . Various prize categories recognize program speedup through parallelism, absolute performance, performance/price, and parallel compiler advances. The first four years of prizes are given in Table 2.
The 1987 prize for parallelism was won by a team at Sandia National Laboratory using a 1K node NCUBE and solving three problems. The team extrapolated that with more memory, the problem could be scaled up to reduce overhead, and a factor of 1,000 (vs. 600) speedup could be achieved. An NCAR Atmospheric Model running on the Cray XMP had the highest performance. In 1989 and 1990, a CM2 (SIMD with 2K processing elements) operated at the highest speed and the computation was done with 32-bit floating point numbers. The problems solved were 4 to 16 times larger than would ordinarily have been solved with modified problems requiring additional operations .
The benchmarking process has been a key to understanding computer performance until the teraflop started and peak performance replaced reality as a selection criterion. Computer performance for an installation can be estimated by looking at various benchmarks of similar programs, and collections of benchmarks that represent a workload . Benchmarks can be synthetic (e.g., Dhrystones and Whetstones), kernels that represent real code (e.g., Livermore Loops, National Aerodynamic Simulation), numerical libraries (e.g., Linpack for matrix solvers, FFT), or full applications (e.g., SPEC for workstations, Illinois's Perfect Club, Los Alamos Benchmarks). No matter what measure is used to understand a computer, the only way to understand how a computer will perform is to benchmark the computer with the applications to be used. This also measures the mean time before answers (mtba), a most important measure of computers productivity.
Several Livermore Loop metrics, using a range of three vector lengths for the 24 loops, are useful: the arithmetic mean typifying the best applications (.97 vector ops), the arithmetic mean for optimized applications (.89 vector ops), geometric mean for tuned workload (.74 vector ops), harmonic mean for untuned workload (.45 vector ops), and harmonic mean compiled as a scalar for an all-scalar operation (no vector ops). Three Linpack measurements are important: Linpack 100 x 100, Linpack 1,000 x 1,000 for typical supercomputing applications, and Linpeak (for an unconstrained sized matrix). Linpeak is the only benchmark that is run effectively on a large multicomputer. Massive multicomputers can rarely run an existing supercomputer program (i.e., the dusty deck programs) without a new algorithm or new program. In fact, the best benchmark for any computer is whether a manufacturer can and is willing to benchmark a user's programs.
Two factors make benchmarking parallel systems difficult: problem scalability (or size) and the number of processors or job streams. The maximum output is the perfectly parallel workload case in which every processor is allowed to run a given-size program independently and the uniprocessor work-rate is multiplied by the number of processors. Similarly, the minimum wall clock time should be when all processors are used in parallel. Thus, performance is a surface of varying problem size (scale) and the number of processors in a cluster.
The Commercial Alternatives
The quest generated by the HPCC Program and the challenge of parallelism has attracted almost every computer company lashing together microprocessors (e.g., Inmos Transputers, Intel i860s, and Sparcs) to provide commodity, multicomputer supercomputing in every price range from PCs to ultracomputers. In addition, traditional supercomputer evolution will continue well into the twenty-first century. The main fuel for growth is the continued evolution of "killer" CMOS microprocessors and the resulting workstations.
Traditional or "True" Supercomputer Manufacturer's Response
In 1989 the author estimated that several traditional supers would be announced this year by U.S. companies. Cray Research and NEC have announced products, and Fujitsu has announced its intent to enter the U.S. supercomputer market. Seymour Cray formed Cray Computer. Supercomputer Systems Inc. lacks a product, and DARPA's Tera Computer Company is in the design phase. Germany's Suprenum project was stopped. The French Advanced Computer Research Institute (ACRI) was started. Numerous Japanese multicomputer projects are underway. Supercomputers are on a purely evolutionary path driven by increasing clock speed, pipelines, and processors. Clock increases in speed do not exceed a factor of two every five years (about 14%). In 1984, a committee projected the Cray 3 would operate in 1986. In 1989, the 128-processor Cray 4 was projected to operate at 1GHz. in 1992. In 1992, the 16-processor Cray 3, projected to operate at 500MHz, was stopped. NEC has pioneered exceptional vector speeds, retaining the title of the world's fastest vector processor. The NEC vector processor uses 16 parallel pipelines and performs operations at a 6.4Gflop rate. Fujitsu's supercomputer provides two, 2.5Gflop vector units shared by four scalar processors. Every fouror five years the number of processors is doubled, providing a gain of 18% per year. The C90 increased the clock frequency by 50% to 250MHz, doubled the number of pipelines and the number of processors over the YMP. Figure 1 projects a 1995 Cray supercomputer that will operate at 100Gflops using a double-speed clock, twice the number of processors, and twice the number of pipelines.
In most production environments, throughput is the measure and the C-90 clearly wins, since it has a factor of four times the number of processors of either Fujitsu or NEC. In an environment where a small number of production programs are run, additional processors with scalar capability may be of little use. Environments that run only a few coarse-grained codes can potentially use smCs for large-grained problems. The traditional supercomputer market does not look toward high growth because it provides neither the most cost-effective solution for simpler scalar programs, nor the peak power for massively parallel applications. Scalar codes run most cost-effectively on workstations, while very parallel code may be run on massively parallel computers, provided the granularity is high and the cost of writing the new code is low. Despite these factors, I believe traditional supercomputers will be introduced in 2000.
"Killer" CMOS Micros for Building Scalable Computers
Progress toward the affordable teraflop using "killer" CMOS micros is determined by advances in microprocessor speeds. The projection  that microprocessors would improve at a 60%-per-year rate, providing a quadrupling of performance each three years still appears to be possible for the next few years (Table 3). The quadrupling has its basis in Moore's Law stating that semiconductor density would quadruple every three years. This explains memory-chip-size-evolution. Memory size can grow proportionally with processor performance, even though the memory band-width is not keeping up. Since clock speed only improves at 25% per year (a doubling in three years), the additional speed must come from architectural features (e.g., superscalar or wider words, larger cache memories, and vector processing).
The leading edge microprocessors described at the 1992 International Solid State Circuits Conference included: a microprocessor based on Digital's Alpha architecture with a 150- or 200MHz clock rate; and the Fujitsu 108 (64) 216 (32-bit) Mflop Vector Processor chip that works with Sparc chips. Using the Fujitsu chip with a microprocessor would provide the best performance for traditional supercomputer-oriented problems. Perhaps the most important improvement to enhance massive parallelism is the 64-bit address enabling a computer to have a large global address space. With 64-bit addresses and substantially faster networks, some of the limitations of message-passing multicomputers can be overcome.
In 1995, $20,000 distributed computing node microprocessors with peak speeds of 400 to 800 Mflops can provide 20,000 to 40,000 flops/$. For example, such chips are a factor of 12 to 25 times faster than the vector processor chips used in the CM5 and would be 4.5 to 9 times most cost-effective. Both ECL and GaAs are unlikely runners in the teraflop race since CMOS improves so constantly in speed and density. Given the need for large on-chip cache memories and the additional time and cost penalties for external caches, it is likely that CMOS will be the semi-conductor technology for scalable computers.
Not just Another Workstaion: A Proposal for Having Both High-Performance Workstations and Massive Parallelism
Workstations are the purest and simplest computer structure able to exploit microprocessors since they contain little more than a processor, memory, CRT, network connection, and i/o logic. Furthermore, their inherent CRTs solve a significant part of the i/o problem. A given workstation or server node (usually just a workstation without a CRT, but with large memory and a large collection of disks) can also become a multicomputer.
Nielsen of Lawrence Livermore National Laboratory (LLNL) has outlined a strategy for transitioning to massively parallel computing . LLNL has made the observation that is spends about three times as much on workstations that are only 15% utilized, as it does on supercomputers. By 1995, microprocessor-based workstations could reach a peak of 500Mflops, providing 25,000 flops per dollar or 10 times the projected cost-effectiveness of a super. This would mean that inherent in its spending, LLNL would have about 25 times more unused peak power in its workstations than it has in its central supercomputer or specialized massively parallel computer.
The difficult part of using workstations as a scalable multicomputer (smC) is the low-bandwidth communication links that limit their applicability to long-grained problems. Given that every workstation environment is likely to have far greater power than a central super, however, the result should clearly justify the effort. An IEEE standard, the Scalable Coherent Interface or SCI, is being implemented to interconnect computers as a single, shared-memory multiprocessor. SCI uses a ring, such as KSR, to interconnect the computers. A distributed directory tracks data as copies migrate to the appropriate computer node. Companies such as Convex are exploring the SCI for interconnecting HP's micros as an alternative and preferred mini-super that can also address the supercomputing market.
A cluster of workstations interconnected at speeds comparable to Thinking Machines's CM5, would be advantageous in terms of power, cost-effectiveness, and administration compared with LAN-connected workstations and supercomputers. Such a computer would have to be centralized in order to have low latency. Unlike traditional timeshared facilities, however, processors could be dedicated to individuals to provide guaranteed service. With the advent of HDTV, low-cost video can be distributed directly to the desktop, and as a byproduct users would have video conferencing.
smPs: Scalable Multiprocessors
The Kendall Square Research KSR 1. The Kendall Square Research KSR 1 is a size-and-generation-scalable, shared-memory multiprocessor computer. It is formed as a hierarchy of interconnected "ring multis." Scalability is achieved by connecting 32 processors to form a "ring multi" operating at oneGB/sec (128 million accesses per sec). Interconnection bandwidth within a ring scales linearly, since every ring slot may contain a transaction. Thus, a ring has roughly the capacity of a typical cross-point switch found in a supercomputer room that interconnects 8 to 16, 100MB/sec HIPPI channels. The KSR 1 uses a two-level hierarchy to interconnect 34 rings (1,088 processors), and is therefore massive. The ring design supports an arbitrary number of levels, permitting ultras to be built.
Each node is comprised of a primary cache, acting as a 32MB primary memory, and a 64-bit superscalar processor with roughly the same performance as an IBM RS6000 operating at the same clock-rate. The superscalar processors containing 64 floating-point and 32 fixed-point registers of 64 bits is designed for both scalar and vector operations. For example, 16 elements can be pre-fetched at one time. A processor also has a 0.5MB sub-cache supplying 20 million accesses per sec to the processor (computational efficiency of 0.5). A processor operates at 20MHz. and is fabricated in 1.2 micron CMOS. The processor, sans caches, contains 3.9 million transistors in 6 types of 12 custom chips. Three-quarters of each processor consists of the Search Engine responsible for migrating data to and from other nodes, for maintaining memory coherence throughout the system, using distributed directories, and ring control.
The KSR 1 is significant because it provides size- (including I/O) and generation-scalable smP in which every node is identical; an efficient environment for both arbitrary workloads (from transaction processing to timesharing and batch) and sequential to parallel processing through a large, hardware-supported address space with an unlimited number of processors; a strictly sequential consistent programming model; and dynamic management of memory through hardware migration and replication of data throughout the distributed, processor memory nodes, using its Allcache mechanism.
With sequential consistency, every processor returns the latest value of a written value, and results of an execution on multiple processors appear as some interleaving of operations of individual nodes when executed on a multithreaded machine. With Allcache, an address becomes a name and this name automatically migrates throughout the system and is associated with a processor in a cache-like fashion as needed. Copies of a given cell are made by the hardware and sent to other nodes to reduce access time. A processor can pre-fetch data into a local cache and post-store data for other cells. The hardware is designed to exploit spatial and temporal locality. For example, in the SPMD programming model, copies of the program move dynamically and are cached in each of the operating nodes' primary and processor caches. Data such as elements of a matrix move to the nodes as required simply by accessing the data, and the processor has instructions that pre-fetch data to the processor's registers. When a processor writes to an address, all cells are updated and memory coherence is maintained. Data movement occurs in sub-pages of 128 bytes (16 words) of its 16K pages.
Every known form of parallelism is supported via KSR's Mach-based operating system. Multiple users may run multiple sessions, comprising multiple applications, comprising multiple processes (each with independent address spaces), each of which may comprise multiple threads of control running simultaneously sharing a common address space. Message-passing is supported by pointer-passing in the shared memory to avoid data copying and enhance performance.
KSR also provides a commercial programming environment for transaction processing that accesses relational databases in parallel with unlimited scalability, as an alternative to multicomputers formed from multiprocessor mainframes. A 1K-node system provides almost two orders of magnitude more processing power, primary memory, I/O bandwidth, and mass storage capacity than a multiprocessor mainframe. For example, unlike the typical tera-candidates, a 1,088-node system can be configured with 15.3 terabytes of disk memory, providing 500 times the capacity of its main memory. The 32- and 320-node systems are projected to deliver over 1,000 and 10,000 transactions per sec, respectively, giving it over a hundred times the throughput of a multiprocessor mainframe.
smCs: Scalable Multicomputers for "Commodity Supercomputing"
Multicomputer performance and applicability are determined by the number of nodes and concurrent job streams, the node and system performance, I/O bandwidth, and the communication network bandwidth, delay, and overhead time. Table 1 gives the computational and workload parameters, but for a multicomputer operated as a SPMD, the communications network is quite likely the determinant for application performance.
Intel Paragon: A Homogeneous Multicomputer. This is shown in Figure 8. A given node consists of five i860 microprocessors: four carry out computation as a shared-memory multiprocessor operating at a peak of 300Mflops rate, and the fifth handles communication with the message-passing network. Each processor has a small cache, and the data-rate to primary memory is 50 million accesses per sec, supporting a computational intensity of 0.67 for highly select problems. The message-passing processor and the fast 2D mesh topology provide the very high, full-duplex data-rate among the nodes of 200MB/sec. The mesh provides primitives to support synchronization and broadcasting.
Paragon is formed as a collection of nodes controlled by the OSFI (Mach) operating system with micro kernels that support message-passing among the nodes. Each node can be dynamically configured to be a service processor for general-purpose timesharing, or part of a parallel-processing partition, or an I/O computer because it attaches to a particular device. A variety of programming models are provided, corresponding to the evolution toward a multiprocessor. Two basic forms of parallelism are supported: SPMD using a shared virtual memory and MIMD. With SPMD, a single, partition-wide, shared virtual memory space is created across a number of computers, using a layer of software. Memory consistency is maintained on a page basis. With MIMD a program is optimized to provide the highest performance within a node using vector processing, for example. Messages are explicitly passed among the nodes. Each node can have its own virtual memory.
CM5: A Multicomputer Designed to Operate as a Collection of SIMDs. The CM5 is shown in Figure 9 consisting of 1 to 32 Sun server control computers, Cc, (for a 1K-node system), on which user programs run; the computational computers, Cv, with vector units; Sun-based I/O server nodes, Cio; and a switch to interconnect the elements. The system is divided into a number of independent partitions with at least 32 Cv's that are managed by one Cc. A given partitioning is likely to be static for a relatively long time (e.g., work shift to days). The Sun servers and I/O computers run variants of Sun O/S, providing a familiar user-operating environment together with all the networking, file systems, and graphical user interfaces. Both the SPMD and message-passing programming models are supported. Each of the computational nodes, Cv, can send messages directly to Cio's, but other system calls must be processed in the Cc.
The computational nodes are Sparc micros that control four independent vector processors that operate independently on 8MB memories. A node is a message-passing multicomputer in which the Sparc moves data within the four 8MB memories. Memory data is accessed by the four vector units at 16Maccess per sec (Maps) each, providing memory bandwidth for a computational intensity of 0.5. Conceptually, the machine is treated as an evolution of the SIMD CM2 that had 2K floating-point processing elements connected by a message-passing hypercube. Thus, a 1K-node CM5 has 4K processing element, and message-passing among the 4 vector units and other nodes is controlled by the Sparc processors. The common program resides in each of the nodes. Note that using Fujitsu Vector Processing chips instead of the four CM5 vector chips would increase peak performance by a factor of 3.3, making the 1995 teraflop peak achievable at the expense of a well-balanced machine.
Computational intensity is the number of memory accesses per flop required for an operation(s) of a program. Thus depending on the computational intensity of the operations, speed will vary greatly. For example, the computational intensity of the expression A = B + C is 3, since 3 accesses are required for every flop, giving a peak rate of 21Mflops from a peak of 128. A C90 provides 1.5 Maps per 1Mflops, or a 16-processor system is capable of operating at 8Gflops.
The switch has three parts: diagnosis and reconfiguration; data message-passing; and control. The data network operates at 5MB/sec full duplex. A number of control messages are possible, and all processors use the network. Control network messages include broadcasting (e.g., sending a scalar or vector) to all nodes unless it abstains, results recombining (network carries out arithmetic and logical operations on data supplied by each node), and global signaling, synchronization for controlling parallel programs. The switch is wired into each cabinet that holds the 256 vector computers.
A Score of Multicomputers. Cray Research has a DARPA contract to supply a machine capable of peak teraflop operation by 1995, and a sustained teraflop by 1997 using DEC Alpha microprocessors. Convex has announced it is working on a massively parallel computer, using HP's microprocessor as its base. IBM has several multicomputer systems that it may productize based on RS6000 workstations. Japanese manufacturers are building multicomputers, for example, using a comparatively small number (100s) of fast computers (i.e., 1Gflop) interconnected via very high-speed networks in a small space.
A number of multicomputers have been built using the Inmos Transputer and Intel i860 (e.g., Transtech Parallel Systems). Mercury couples 32, 40MHz. i860s and rates the configuration at 2.5Gflops for signal processing, simulation, imaging, and seismic analysis. Meiko's 62-node multicomputer has a peak of 2.5Gflops and delivered 1.3Gflops for Linpeak, or approximately half its peak on a O(8500) matrix . Parsytec GC consists of 64 16K nodes, delivering a peak of 400Glops. The nCUBE 2 system has up to 8K nodes with up to 64MB per node.
Multicomputers are also built for specific tasks. IBM's Power Visualizer uses several i860s to do visualization transformations and rendering. AT&T DSP3 Parallel Processor provides up to 2.56Gflops, formed with 128 signal-processing nodes. The DSP3 is used for such tasks as signal- and image-processing, and speech recognition. The DARPA-Intel iWarp developed with CMU is being used for a variety of signal-and image-processing applications that are typically connected to workstations. An iWarp node provides only 10Mflops (or 20Mflops for 32-bit precision) and 0.5 to 16MB of memory per node. Each node can communicate at up to 320MB per sec on 8 links.
The Teradata/NCR/AT&T systems are used for database retrieval and transaction-processing in a commercial environment. The system is connected by a tree-structured switch, and the hundreds of Intel 486 leaf nodes processors handle applications, communications, or disk access. A system can process over 1,000 transactions per sec working with a single database, or roughly four times the performance of a multiprocessor mainframe.
Programming Environments to Support Parallelism
Although the spectacular increases in performance derived from microprocessors are noteworthy, perhaps the greatest breakthrough has come in software environments such as Linda, the Parallel Virtual Machine (PVM), and Parasoft's Express that permit users to structure and control a collection of processes to operate in parallel on independent computers using message passing. Of course, user interface software, debuggers, performance-monitoring, and many other tools are part of these basic parallel environments.
Several programming models and environments are used to control parallelism. For multiprocessors, small degrees of parallelism are supported through such mechanisms as multitasking and Unix pipes in an explicit or direct user control fashion. Linda extends this model to manage the creation and distribution of independent processes for parallel execution in a shared address space. Medium (10 to 100) and high degrees of parallelism (1,000) for a single job can be carried out in either an explicit message-passing or implicit fashion. The most straightforward implicit method is the SPMD model for hosting Fortran across a number of computers. A Fortran 90 translator should enable multiple workstations to be used in parallel on a single program in a language evolutionary fashion. Furthermore, a program written in this fashion can be used equally effectively across a number of different environments from supercomputers to workstation networks. Alternatively, a new language, having more inherent parallelsim, such as dataflow may evolve. Fortran will adopt it.
Much of university computer architecture research is aimed at scalable, shared-memory multiprocessors (e.g., ) and supported by DARPA. In 1991, MITI sponsored the first conference on shared-memory multiprocessors in Tokyo, to increase international understanding. It brought together research results from 10 universities (eight U.S, two Japanese), and four industrial labs (three U.S., one Japanese). This work includes, directory schemes to efficiently "track" cached data as it is moved among the distributed processor-memory pairs, performance analysis, interconnection schemes, multithreaded processors, and compilers.
Researchers at the University of California, Berkeley are using a 64-node CM5 to explore various programming models and languages including dataflow. Early work includes a library to allow the computer to simulate a shared memory multiprocessor. An equally important part of Berkeley's research is the Sequoia 2000 project being done in collaboration with NASA and DEC that focuses on real-time data acquistion of 2 terabytes of data per day, secondary and tertiary memories, and very large data-bases requiring multiple accesses.
Seitz at Cal Tech, developed the first multicomputer (interconnected via a hypercube network) and went on to develop high-bandwidth grid networks that use wormhole routing. The basic switch technology is being used in a variety of multiprocessor and multicomputers including Intel and Cray.
The CEDAR project at the University of Illinois is in the completion phase, and scores of papers describe the design, measurements, and the problem of compiling for scalable multiprocessors. Unfortunately, CEDAR was built on the now defunct Alliant Multiprocessor.
MIT has continued researching multiple machine designs. The Monsoon dataflow computer became operational with 16, 5Mflop nodes and demonstrates scalability and implicit parallelism using a dataflow language. The next dataflow processor design, T* is multithreaded to absorb network latency. The J-machine is a simple multicomputer designed for message-passing and low-overhead system calls. The J-machine, like the original RISC designs, places hardware functions in software to simplify the processor design. A J-machine, with sufficient software, carries out message-passing functions to enable shared-memory multiprocessing. Alewife, like Stanford's DASH is a distributed multiple multithreaded processor that is interconnected via a grid switch. Additional efforts are aimed at switches and packaging, including a 3D interconnection scheme.
Rice University continues to lead compiler research in universities and was responsible for the HPF Forum. HPF is a successor to Fortran D (data parallelism) that was initially posited for all high-performance machines (SIMDs, vector multiprocessors, and multicomputers). The challenge in multicomputers is initial data allocation, control of memory copies, and avoiding latency and overhead when passing messages.
Stanford's DASH is a scalable multiprocessor with up to 64 processors arranged in a grid of 4 x 4 nodes. Each node consists of a four-processor Silicon Graphics multiprocessor that is roughly equivalent to a uniprocessor with an attached vector unit. DASH demonstrated linear speedups for a wide range of algorithms, and is used for compiler research. Some applications have reached over 100Mflops for 16 processors, which is about the speed a four-vector processor system of comparable speed achieves. Since the system is relatively slow, it is unclear which principles have applicability for competitive computers.
DARPA has funded Tera Computer to start-up. Tera, a second-generation HEP, is to have 256 128-instruction stream processors or 32K processors and operate in 1995. With a multiple instruction stream or multithreaded processor, any time a processor has to wait for a memory access, the next instruction stream is started. Each processor is built using very fast gate arrays (e.g., GaAs) to operate at 400MHz. The expected latency to memory is between 40 and 100 ticks, but since each processor can issue multiple requests, a single physical processor appears to support 16 threads (or virtual processors). Thus, a processor appears to have access to a constant, zero latency memory. Since a processor is time-shared, it is comparatively slow and likely to be unusable for scalar tasks, and is hardly a general-purpose computer according to Smith's definition . The physical processors connect to 512 memory units, 256 I/O cache units (i.e., slower memories used for buffering I/O), and I/O processors through a 4K-node interconnection network. In order to avoid the "dance hall" label, the network has four times more nodes than are required by the components. Tera's approach has been to design a computer that supports a parallelizing compiler.
Summary and Conclusions
In 1989 I projected that supercomputers would reach a teraflop by 1995 using either a SIMD or multicomputer approach, but neglected to mention the price. By mid-1992, scalable multicomputers have improved by a factor of 5 to 10 to reach 100 [+ or -] 20 Gflops at a supercomputer price level, and the SIMD approach was dropped. Scalable multicomputers also break through the $30 million supercomputer barrier to create a teraflop ultracomputer ($50 to 400 million), and are not recommended buys. In 1995 semiconductor gains should increase performance by a factor of four and competition should reduce the price a factor of two to supercomputer levels as projected in Figure 1. Given the number of applications and state of training, waiting for the teraflop at the supercomputer price level is recommended.
Multiprocessors have been the mainline of computing because they are general-purpose and can handle a variety of applications from transaction-processing to time-sharing, that are highly sequential to highly parallel. KSR demonstrated that massive, scalable distributed, shared-memory multiprocessors (smPs) are feasible. Multicomputers from the score of companies combining computers will evolve to become multiprocessors.
Important gains in parallelism have come from software environments that allow networks of computers to be applied to a single job. Thus every laboratory containing fast workstations has or will have its own supercomputer for highly parallel applications. The rapid increase in microprocessor power ensures that the workstation will perform at near super speed for sequential applications. LAN environments can provide significant supercomputing for highly parallel applications by 1995. It is critical for companies to provide fast, low-overhead switches that will allow users to configure multicomputers composed of less than 100 high-performance workstations, because these are likely to provide the most cost-effective and useful computing environments.
A petaflop (10(15) floating-point operations per sec) at less than $500 million level is likely to be beyond 2001. By 2001, semiconductor gains provide an increase factor of 16 over 1995 computers. Better packaging and lower price margins through competition could provide another increase factor of two or three. The extra increase factor of 20 for a petaflop is unclear. Based on today's results and rationales, a petaflop before its time is as bad an investment as the teraflop before its time. Evolution and market forces are just fine ... if we will just let them work.
The author would like to thank Peter Denning and other reviewers for their helpful suggestions in the editing process. These individuals provided performance data and helpful insight: David Culler (University of California, Berkeley), Charles Grassl (Cray Research), Justin Rattner (Intel), Ron Jonas and Ruby Lee (HP), Chani Pangali (Kendall Square Research), Frank MacMahon (Lawrence Livermore National Laboratories), John Mashey (MIPS), Tadashi Watanabe (NEC), Jack Dongarra (Oak Ridge National Laboratory and University of Tenn.), Mike Humphrey (Silicon Graphics Inc.), Burton Smith (Tera Computer), David Douglas and John Mucci (Thinking Machines Corp.), and Jack Worlton (Worlton & Assoc.)
(1) Cray 1 (1975), Cray IS (1978), Cray XMP-2, 4 (1982, 1984), Cray YMP-8 (1988), and Cray C-90 (1992)
1. Bailey, D.H. Twelve ways to fool the masses when giving performance result on parallel computers. Supercomput. Rev. (Aug. 1991), 54-55.
2. Bell, G. The future of high performance computers in science and engineering. Commun. ACM 32, 9 (Sept. 1989), 1091-1101.
3. Bell, G. 11 rules of supercomputer design. University Video Communications Videotape, vol. III, Stanford Calif., 1989.
4. Bell, G. Three decades of multiprocessors. CMU Computer Science: 25th Anniversary Commemorative R. Rashid, Ed. ACM Press, Addison-Wesley, Reading, Mass., 1991, pp. 3-27.
5. Berry, M., Cybenko, G., and Larson, J. Scientific benchmark characterizations. Parallel Comput. 17, (1991), 1173-1194.
6. Denning, P.J. Working sets past and present. IEEE Trans. Softw. Eng. SE-6 (Jan. 1980), 64-84.
7. Denning, P.J. and Tichy, W. Highly parallel computation. Science 250, xx (Nov. 30, 1990), 1217-1222.
8. Dongarra, J.J. Performance of various computers using standard linear equation software. University of Tennessee and Oak Ridge National Laboratory, CS-89-85, Jan. 13, 1992.
9. Dongarra, J.J., Karp, Miura, K. and Simon, H.D. Gordon Bell Prize Lectures. In Proceedings Supercomputing 91 (1991), pp 328-337.
10. Gustafson, J.L., Montry, G.R. and Benner, R.E. Development of parallel methods for a 1024 processor hypercube. SIAM J. Sci. Stat. Comput. 9, 4 (July 1988), 609-638.
11. Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach. Morgan Kaufman, San Mateo, Calif., 1990.
12. Hill, M.D. What is scalability? Computer Architecture News 18, 4 (Dec. 1990), 18-21.
13. Hillis, D. and Steele, G. The CM5, University Video Communications, Two videotapes, vol. IV, Stanford Calif., 1992.
14. Hockney R.W. and Jesshope, C.R. Parallel Computers 2, Adam Hilger, Bristol, 1988.
15. IEEE Computer Society Technical Committee on Supercomputing Applications. NSF Supercomputer Centers Study, Feb. 1992.
16. Li, K. and Schafer, R.A hypercube shared virtual memory system. 1989 International Conference on Parallel Systems,
17. Office of Science and Technology Policy (OSTP). Grand challenges 1993: High performance computing and communications, A report by the Committee on Physical, Mathematical, and Engineering Sciences of the Federal Coordinating Council for Science, Engineering, and Technology in the Office of Science and Technology Policy, National Science Foundation, 1992.
18. Nielsen, D.E. A stategy for smoothly transitioning to massively parallel computing. Energy and Tech. Rev. (Nov. 1991), 20-31.
19. Pancake, C.M. Software support for parallel computing: Where are we headed? Commun. ACM 34., 11 (Nov. 1991), 52-66.
20. Scott, S.L. A cache coherence mechanism for scalable, shared-memory multiprocessors. In Proceedings of the International Symposium of Shared Memory Multiprocessing Information Processing Society of Japan, (Tokyo, Apr., 1991) 49-59.
21. Smith, B. The end of architecture. Comput. Architecture News 18., 4 (Dec. 1990), 10-17.
22. Watanabe, The NEC SX3 Supercomputer, University Video Communications, Videotape, Stanford Calif., 1992.
23. Worlton, J.A critique of "massively" parallel computing. Worlton & Associates Tech. Rep. 41, Salt Lake City Utah, May 1992.
24. Worlton, J. The MPP Bandwagon. Supercomput. Rev., To be published
CR Categories and Subject Descriptors: B.0 [Hardware]: General; C.0 [Computer Systems Organization]: General--instruction set design (e.g., RISC, CISC); J.1 [Computer Applications]: Administrative Data Processing--government, system architectures
General Terms: Design, Management
Additional Key Words and Phrases: Government policy, parallel processing, scientific programming
About the Author:
GORDON BELL is a computer industry consultant. He has been vice president of R&D at Digital Equipment Corp., a founder of various companies, and was the first assistant director of the NSF Computing Directorate. From 1966 to 1972 he was a professor of computer science at Carnegie Mellon University. In 1991 Bell won the National Medal of Technology and in 1992 was awarded the IEEE's Von Neumann Medal. Author's Present Address: 450 Old Oak Court, Los Altos, CA 94022
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||includes related article on Tera Taxonomy|
|Author:||Bell, C. Gordon|
|Publication:||Communications of the ACM|
|Date:||Aug 1, 1992|
|Previous Article:||Computing in the Middle East.|