Can parallel algorithms enhance serial implementation?
Performance improvement is a fundamental concern in computer science and engineering. Observing the history of the field, one would expect that any improvement in the ability of computer systems would be quickly met by applications utilizing it. Over the last few decades, much effort has been invested in seeking gains in performance by using parallel computers. However, the state of the art is that the programming models for many available parallel computers do not make them very, easy to program. In contrast, this article presents a software-centric approach, in which ease of programming is a first priority for both uniprocessors and multiprocessors.
Here, I outline two concrete reasons and one general reason why parallel programs could give a (modest) gain in performance over serial code on uniprocessors, especially with the current trends in uniprocessor architecture.
Hiding latency to memory through prefetching. In a parallel program with p-fold parallelism it is possible to prefetch data p steps ahead of when it needed, since there are p tasks to execute while waiting for the result. If the available bandwidth permits these p prefetches, then this will make all references appear as if they are in fast cache memory, rather than in slower main memory. The response rate of main memory then becomes limited by total bandwidth to that memory, but it is no longer strictly limited by latency, to main memory. Increased bandwidth can be bought, and the price is dropping. Decreasing latency is much more difficult, and large improvements appear to be problematic.
Instruction-level parallelism. A parallel program should provide a which higher degree of instruction level parallelism (ILP) than a serial program. The extensive use of ILP is one of the most important advances made in commodity processors over the past decade. However, current designs are approaching, if not hitting, the limit in terms of how much inherent parallelism appears in standard serial programs. To show the increasing role that ILP plays in computer architecture, we need only to note that nearly 20% of  is devoted to that issue. See references therein to quite a few processors using ILP that were announced in 1994-95 by many major computer vendors.
State changes. Consider an instantaneous description of a computer system that provides the state of all communication channels and memory locations at some point in time. The length of this description can reach billions of values. Let one clock cycle pass and compare the new instantaneous description with the old one. The number of values that change clearly upper-bounds the parallelism the system may provide, and the fundamental challenge is to make the most effective use of this changing-values parallelism. An important trend in computer architecture is the continuing increase in changing-values parallelism, and we suggest using parallel algorithms for these fundamental challenges.
This leads us to the rather broad thesis of this article: A parallel program is potentially an invaluable asset for efficient implementation purposes. Since a parallel program allows concurrent tasks to be processed in any order, or even simultaneously, it also gives much more freedom for optimization to the hardware, operating system, and compiler (namely, gives more freedom to lower-level implementations; such as cache, prefetch, pipelining, code rearrangement, and estimators of working set and code behavior in operating systems). This flexibility is very, structured and might be more useful than information about execution that is gathered from:
* Statistical analysis of execution,
* Static analysis of flow diagrams,
* Ad-hoc optimizations in run-time, and
* Various heuristics.
Further, the parallel paradigm allows more than just prediction of future memory requests. It allows the lower levels to alter the execution so that it will be best suited to the internal organization of the lower level. For instance, loading several independent tasks into the registers (an instance of the general prefetch idea) may result in better utilization of the processing unit. These examples suggest that much greater advantage could be taken of the trend toward increasing changing-values parallelism using parallel programs, well beyond what is suggested in this article.
The thesis implies a grand goal, which is still relatively more modest than the longer-term computer-science-and-beyond challenge of making parallel computer systems and parallel programming ubiquitous for general-purpose computing. The longer-term challenge should respond to the main driving force in the quest for increased performance of computer systems. These are application domains (within computer science and beyond) as articulated, for instance, in the U.S. Federal High Performance Computing program. The so-called "Grand Challenge" scientific computing applications and "National Challenge" applications, which include the realms of health care, electronic commerce, education, digital libraries, and crisis management are a few key examples.
The grand goal is to make parallel programming attractive option for standard sequential procesors. I suggest this grand goal deserves careful engineering prototyping. For parallel computing, accomplishing the goal would be a key step in meeting the longer-term challenge. In turn, much of the research outlined here is already being executed in the context of parallel machines. This research could be applied to uniprocessors, perhaps with some modification.
The following areas within computer science and engineering should take part either in such prototyping or in an actual implementation of the grand goal.
Programming languages. Current programming languages need to be modified to incorporate ways the programmer can identify parallelism that may be hard for the compiler to find automatically. Some languages, such as High-performance Fortran and NESL , already enable expressing parallelism.
Compilers. A considerable effort is needed to modify current compilers to enable them to take advantage of parallelism identified by the programmer.
Hardware configuration needs to be adapted to take advantage of this parallelism. Modifications in hardware configuration range from relatively modest suggestions with respect to current serial computers to move toward our grand goal to the much more ambitious suggestion of eventually developing a full-fledged parallel computer to move toward the long-term challenge. In a modified serial computer, the memory management system should, given a list of future memory requests, enable them to be prefetched (i.e., get the required data into the cache) efficiently. In other words, the available parallelism is translated into predictability of reference.
Having computer system designers rely on predictability of reference as much as they have been relying on locality, of reference will involve quite a revolution for computer architecture, since it would profoundly affect a fundamental methodology of computer systems design, even if locality of reference continues to play a crucial role. Improved multi-threading is one way to take advantage of parallelism, since we may want the system to concurrently support several threads of comptutation. I have already mentioned increased instruction-level parallelism; increased use of pipelining is yet another way to take advantage of parallelism.
A processor that will function well in such a serial system is likely to be better suited to a multiprocessor environment, streamlining the transition into parallel systems from the architecture side as well. In a modified parallel computer, some of this parallelism will be directed toward the changing-values parallelism of each of the processors and some toward effective utilization of the number of processors.
Programming and design of algorithms. In order to prepare available algorithms for parallel machines, try to rewrite the program to express as much parallelism as the algorithm permits -- or come up with a new algorithm and program. The goal is that the rewritten (or new) program will run in the smallest possible number of parallel rounds without increasing its total number of operations, or with only a modest increase in the number of operations. How much of an increase in the total number of operations is acceptable depends on how advantageous the implementation of the parallel program (as advocated here) will be, taking constant factors into account. Note that getting the absolute fastest possible parallel running time may not be as significant, since beyond a certain level of parallelism the incremental advantage of additional parallelism is very small.
It appears to be easy in many cases to rewrite a program to incorporate a modest amount of parallelism, and for the applications described here even a modest level may already be helpful. Of course, additional parallelism may be even better. For example, a serial emulation of a parallel program whose parallel running time (i.e., number of parallel cycles, where each instruction is performed in the earliest possible cycle while theoretically assuming no shortage in the number of available processors) is ten times faster than its serial counterpart may, theoretically, already yield better performance.
We do not expect the possible need for development of a new parallel algorithm to be a significant disadvantage. Parallel algorithmics has undergone major developments in the last few ears, and parallel algorithms and techniques for many problems already exist. For brevity, we reference only  and , which describe how to use the PRAM model of parallel computation as a programmer's model. In any case, the development of a new parallel algorithm is of independent interest, since the general long-term trend in computer science is in the direction of parallelism and a new parallel algorithm will be added to a long-lasting knowledge base. But how will people know to design parallel algorithms and write parallel programs?
Education. Thinking in parallel is an alien culture and therefore has to be taught in school. So an interesting outcome of our approach is that parallel algorithms would have to be taught as part of the standard computer science and engineering undergraduate curriculum irrespective of whether (or when) parallel processing becomes ubiquitous in the general-purpose computing world. This curriculum update should occur in anticipation of future changes, since one of the objectives of such a curriculum is to prepare young graduates for a lifetime professional career. These thoughts led me to work on .
I see this grand goal of making parallel programming an attractive option for sequential machines as important for the following reasons:
* It unifies third-party software for sequential machines with parallel machines, enabling parallel computing to compete to become a platform for software commodity products.
* It supplies the education system with a means of preparing programmers for a changing environment without the criticism that parallel programming is limited to a small part of the market.
* It influences the commodity processor market to evolve toward being part of a multiprocessor assemblage in a parallel computer system.
* It places a stronger emphasis on the ease of programming for parallel programming. This is because ease of programming is a necessary condition for the programmer's model to become an attractive option.
* It supplies a novel path for bringing many of the ideas developed in the parallel computing community to a wider community.
* I envision it as providing a strong impetus in the long term for the use of parallel machines.
Strategic Plan for Transition into
The long-term challenge of bringing about a transition into ubiquitous parallel computer systems and parallel programming for general-purpose computing should become an engineering project once the basic technology is available, and the implementation of any engineering project must follow a plan.
I believe the ideas presented here can actually be a basis for a two-step plan. The plan is gradual, advancing toward the revolutionary long-term challenge in an evolutionary, step-by-step manner, separating system evolution from application programming evolution. The primary purpose of the first step is accomplishing the grand goal of making parallel programming an attractive option for standard sequential processors. By the time the grand goal is accomplished, which should take several years, the application programmer will have extended his or her programming model to include parallel programming, and computer systems will have been updated to support such programming efficiently. The first step will also use the time it takes to accomplish the grand goal to phase out and replace serial code, which will not run efficiently. on parallel systems. The second step is exclusively in the domain of computer systems and should begin in parallel with the first step: Plan a series of prototype parallel systems whose goal is a parallel system that efficiently, supports the extended programmer's model and will be available by the time the first step is accomplished.
In some ways parts of the plan are already being funded by, the government (e.g., the Tera computer for multithreading) and industry (e.g., many vendors are adding prefetching). Government support appears to be necessary for any plan whose main benefits are expected in more than two to three years. As we have witnessed in the last few years, the understanding of which research and development plan should be supported is not entirely consistent across different administrations and congressional committees. However, following the multi-billion dollar government investment in high-performance computing, it would certainly be counterproductive and entirely inconsistent to halt now, at a point where we begin to derive concrete benefits. Finally, we believe some sort of consortium to guide the effort will be needed.
The project which resulted in  asked explicitly for such a plan. Even in view of the many outstanding position papers in that volume which address numerous cutting-edge issues in parallel computing research, we still feel our plan is unique -- probably a first of its kind, offering a new approach for coping with the problem of having so little software for existing parallel systems. This problem is often referred to as "the parallel software crisis" . We hope the parallel computing community will find the plan thought-provoking. Some colleagues may want to follow it, while others, who may feel challenged to come up with a plan of their own, can use it as a strawman.
Our basic approach is software-centric, since our plan provides a bridgehead into parallel computer systems by letting the programmer input the forms of parallelism with which to work comfortably, rather than the more common hardware-centric approach, which is based on building parallel systems first and only then determining how useful they end up being. A software-centric approach makes sense, since over the life of a commercially successful computer, the cost of application software products and development typically dwarfs direct computer systems costs. It is for this reason I believe it is important not to design a programming model that overemphasizes the programming for locality: It has become clear that programming for locality is a significant burden on a programmer for all but a limited set of regular applications.
Furthermore, although it is true that many current parallel systems have quite different bandwidths for local and remote memory, the trend among processors is toward higher bandwidths. Burton Smith's keynote address at Frontiers'95 provides strong support for our approach. He concluded his talk by saying "The parallel software crisis is the hardware fault." Another argument is as follows: Latency-tolerating mechanisms in nodes and the question of whether future microprocessors should contain one very large processor or move to multiple processors (and, if the latter, in what style) are longer-term issues whose resolution will be extremely difficult and of critical importance for multiprocessors and multicomputers as the number of transistors per chip continues to increase. These points were made by John Hennessy, in  and suggest that we should expect considerable pluralism among hardware-centric approaches in the longer term. The anticipated debate among those approaches can only be enriched by software-centric approaches.
The experimental results reported in  were accomplished with Suleyman Sahinalp. The impact of e prefetching idea was empirically evaluated for a uniprocessor that, instead of disks, uses the fast memories of other computers on the same local area network as secondary memory. This help and numerous comments from Guy Blelloch, Mike Franklin, Yossi Gil, Allan Gottlieb, Yossi Matias, and Amy Weinberg are gratefully acknowledged.
References [1.] Blelloch, G.E., Chatterjee, S., Harwick, J.C., Sipelstein, J., and Zagha, M. Implementation of a portable nested data-parallel language. In Proceedings of the 4th ACM PPOPP. 1993, pp. 102-111.
[2.] Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Qualitative Approach. 2d ed. Morgan Kaufmann, San Mateo, Calif., 1996.
[3.] Ja'Ja', J. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, Mass., 1992.
[4.] Metcalf, M., and Reid, J. Fortran 90 Explained. Oxford University, Press, New York, 1990.
[5.] Pancake, C.M. Software support for parallel computing: Where are we headed? Commun. ACM 34, 11 (199 1), pp. 53-64.
[6.] Vishkin, U. Can parallel algorithms enhance serial implementation? In Proceedings of the 8th IEEE International Parallel Processing Symposium. 1994, pp. 376-385. Univ. of Maryland UMIACS-TR-91-145.1. To ftp: (a) ftp to ftp.cs.umd.edu; (b) login as "anonymous" with your electronic address as the password; (c) cd pub/papers/papers/2784.1; (d) to get a compressed postscript file: get 2784.1.ps.Z.
[7.] Vishkin, U., Ed. Developing a Computer Scienre Agenda far High-Performance Computing. ACM Press, New York, 1994.
[8.] Vishkin, U. Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques. Monograph. In preparation.
UZI VISHKIN is Professor at the Institute for Advanced Computer Studies and Electrical Engineering Department at the University of Maryland. He is also Professor in the Department of Computer Science at Tel Aviv University and an ACM Fellow. Author's Present Address: The University of Maryland Institute for Advanced Computer Studies (UMIACS), College Park, MD 20742-3251; email: email@example.com
For a more technical version of this article, readers are referred to , where theoretical modeling and experimental evidence of the value of prefetching in parallel algorithms presented, as well as a review of related directions and literature.
This work has been partially supported by NSF grants CCR-9111348 and 9416890.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||parallelism in computer programs|
|Publication:||Communications of the ACM|
|Date:||Sep 1, 1996|
|Previous Article:||Madefast: collaborative engineering over the Internet.|
|Next Article:||Markets and privacy.|