Scalable software for successful research: keeping computation-dependent scientific research competitive.
The trend to massive parallelism appears to be inescapable, as current high-end commodity processors, the workhorse for most scientific computing applications, now support eight or more simultaneous threads of execution per CPU socket. Most computational nodes and scientific workstations contain several of these multi-core processors. Multi-threaded applications that utilize all this hardware capability can deliver an order of magnitude increase in computational throughput and corresponding decrease in time-to-solution. Future systems will support even more processors and threads per processor.
Unfortunately, these new multi-core workstations only deliver increased performance when the number of software tasks equals or exceeds the number of hardware processing units. This makes sense because the software has broken up some computational problem into enough tasks to give each hardware computational unit something to do. Obviously, there will be idle hardware units when there are fewer software tasks than hardware units. In a nutshell, this is the argument behind Amdahl's law that notes the speedup of a program using multiple processors will be limited by the rime spent in any sequential portions of the code.
As a result, scientists planning future research efforts cannot just assume that existing software will run faster on newer hardware. If the software used in the project does not scale, the extra processing cores of newer commodity workstations will not help speed computational research. Effectively processing throughput will plateau at or near current levels, which can have serious consequences for computation-based research because it defines the limits of what work can be performed and how competitive a project might be relative to other computational-based projects. In effect, the mass adoption of parallel architectures in the computer industry is forcing researchers to evaluate the scaling behavior of their existing software and consider investing in multi-threaded and highly-parallel software.
So, what sort of performance can be achieved with multithreaded software, with what benefits, and should hybrid general purpose graphics processing unit (GPGPU) technology be considered? As mentioned previously, current high-end multi-core processors can deliver a 10x performance increase and, in a few years, even low-end systems will have this same capability. An order of magnitude is certainly a significant advance, but such performance does not necessarily represent a fundamental change for computation-dependent science. Machines that are 10x faster make the computational workflow more interactive, because tasks that previously took hours only take minutes, and extended computational work that previously took days can occur overnight.
Extraordinary performance improvements for many problems can be achieved by plugging GPGPU technology into commodity workstations and computational nodes. The scientific and technical literature published over the past two years demonstrates the success many researchers have had in achieving one to two orders of magnitude speedup (10x to 100x) in performance over conventional processors. In very rare cases, some researchers have even reported three orders of magnitude, or 1000x, greater performance heavily utilizing the highly optimized transcendental function units on NVIDIA GPGPU hardware.
Those applications that achieve high performance are massively-threaded so they can fully utilize the many hundreds to thousands of simultaneous hardware threads of execution that become available when one or several GPGPU boards are plugged into a conventional processor motherboard. The very highest performing applications also exhibit relatively high data reuse within the graphics processors.
As discussed in my January 2008 column, "GPGPUs: Neat Idea or Disruptive Technology," applications that deliver 100x or faster performance are disruptive and have the potential to fundamentally affect scientific research by removing time-to-discovery barriers. Computational tasks that previously would have required a year to complete can finish in days. Better scientific insight becomes possible, because researchers can work with more data and have the ability to utilize more accurate, albeit computationally expensive, approximations and numerical methods.
For the experimentalist in particular, the results of newer high-throughput instruments (or collections of many instruments) can be utilized to create higher-resolution and more informative pictures of what is occurring in nature--potentially in real-rime. One example is the Harvard Connectome project, which is an effort to map all the connections in the brain and peripheral nervous system of model animals, such as the lab mouse, utilizing massively-parallel computational techniques to assemble and identify features based on 3 nm/ pixel high-resolution microscopy images. The project homepage notes, "The critical challenges are computational, as the total number of voxels needed to establish the Connectome is ~10 (14)." A voxel is essentially a 3D pixel.
Some single-threaded distributed applications that utilize MPI or other distributed software frameworks can use multicore hardware by assigning a separate process to each computational core. However, it is important to understand the distinction between distributed and multi-threaded programming models to interpret the costs associated with each approach.
MPI, for example, is a well-established and commonly used programming framework. Many applications that utilize MPI--even those that are single threaded--can run on conventional multi-core processors. Essentially, each processor core is utilized as a separate computational process distinct from all other MPI processes. Using MPI to run on multi-core processors can work, but it might also waste significant memory and introduce needless communications overhead, because all shared and modified data must be communicated and stored separately amongst the processes.
In contrast, threaded programs generally share a common memory space, which can eliminate redundant storage and needless communications overhead. GPGPUs in particular utilize a very efficient form of massive-threading. It is worth noting that MPI can be utilized in a hybrid computational approach to distribute an application across many devices, where each distinct MPI process internally utilizes a thread-based model to run efficiently within the multi-core computer or on a GPGPU. Many researchers, including myself, have achieved excellent performance and scaling behavior with this hybrid approach.
Be aware that distributed programming frameworks like MPI do not run on the graphics processors but, rather, on the conventional processors of the host computer. Any parts of the program that utilize graphics processors must be written (or compiled) to utilize large numbers of concurrent threads of execution within the internal threaded environment of the GPU.
For many projects, it will be necessary to commit resources to rewriting some software to effectively utilize a large number of simultaneous threads so that they can exploit newer computers and the ever more massively parallel systems that will be generally available in the next few years. Such a commitment appears to be necessary to keep computation-dependent research competitive as massively parallel hardware becomes ever more inexpensive, capable and ubiquitous in the world-wide scientific community.
(1.) The Harvard Connectome Initiative: iic.harvard.edu/research/connectome
(2.) GPGPUs: Neat Idea or Disruptive Technology: www.scientificcomputing.com/gpgpus-neat-idea-or-disruptive.aspx
(3.) NVIDIA Community Showcase: www.nvidias.com/cuda
Rob Farber is a senior research scientist at Pacific Northwest National Laboratory. He may be reached at editor@ScientificComputing.com
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||High Performance Computing|
|Date:||Mar 1, 2010|
|Previous Article:||EMR systems: modern fable's happy ending: trend analysis, data accessibility and direct patient communication make electronic medical record systems...|
|Next Article:||Cloud computing comes to HPC.|