Realizing the benefits of affordable tflop-capable hardware: exciting things are happening with this technology in the hands of the masses.
The scientific community is generally very accepting of new computational hardware that can be leveraged to provide exceptional new performance and capability for research. NVIDIA first publically released Compute Unified Development Architecture (CUDA) in February 2007 and, roughly three years later, a large number of scientists and research organizations around the world are now using CUDA. Figure 1 illustrates performance increases reported by 438 of these research efforts (along with some commercial companies) that decided to showcase their GPGPU performance on the NVIDIA Web site. All these applications are written in high-level languages, such as C, and utilize the NVIDIA CUDA development tools and programming environments.
This graph demonstrates that current commodity GPGPU hardware can deliver from one to two orders of magnitude for applications ranging from economics and computational finance to modeling the infinitesimal with quantum chemistry. By exploiting multiple GPGPUs and/or the very high performance on transcendental functions that GPUs provide, some applications can achieve three orders (1000x) the performance of a single-processor system.
Harvard Connectome is a favorite project of mine that combines GPGPU computational ability with some excellent robotic technology in an effort that will create 3-D wiring diagrams of the brains of various model animals, such as the cat and lab mouse. This project exemplifies early recognition of the potential inherent in inexpensive parallel computing technology. It might provide the basis for seminal and disruptive scientific research. Instead of guessing, vision researchers and neurologists might finally have a Galilean first-opportunity ability to see, study and model using extraordinarily detailed data created by the Connectome project. Skeptics will surely disagree, but I cannot help but get excited by the potential, and my sense is that a future scientific and technology revolution could start based on observations and data produced by this project.
GPGPU technology enables orders of magnitude performance increases that researchers in numerous other scientific areas are exploiting as well. The Metropolis algorithm, for example, is among 10 algorithms acknowledged as having the greatest influence on development and practice of science and engineering in the 20th Century. (1) Widely used in searching for the minimum energy state in statistical physics, it is an instance of a large class of sampling algorithms known as Markov chain Monte Carlo (MCMC) methods that have played a significant role in statistics, econometrics, physics and computing science. For some applications, MCMC simulation is the only known general approach for providing a solution within a reasonable time. Results in the literature indicate GPGPUs can increase performance for Metropolis and Monte Carlo algorithms from 300x to 1000x that of a single-core processor. (2,3,4) Based On interest and inclinations, others will certainly point out that generally available and affordable parallel processing, of which GPGPU technology is but a subset, is having a dramatic effect on a multitude of other computational methods and research efforts. From a software perspective, they will point out that CUDA is but one programming environment and that there are others including OpenCL and distributed frameworks, such as MPI and hadoop that support the development of scalable and efficient software.
I most certainly agree and note that applications that provide two orders of magnitude (100x) increased computational capability--regardless of the technology platform--are disruptive and have the potential to fundamentally affect scientific research by removing time-to-discovery barriers. With such a performance boost, previously unrealistic computational tasks that would have taken one or more years can now finish in a day or week. The fact that GPGPU technology can be purchased by anyone for a low cost just expands the impact of the technology.
Even the 10x speedup provided by modern low-end multicore workstations and laptops containing eight or more cores, or by "poorly" performing GPGPU applications, can make computational workflows more interactive. Effectively using this commodity technology means people only need to wait minutes for tasks to complete that previously world have taken hours. Similarly, overnight runs can now process workloads that would have taken days on single-core hardware.
Observation shows that there is a very large and established body of scientists and decision-makers who are reticent to invest in software that can exploit these parallel performance gains. I believe they do not understand the implications and potential that low-cost, easily accessible parallel hardware has for computation-based research. Both simulation and data-driven projects that depend on sensors or high-throughput instruments are affected. Very low hardware cost coupled with the fact that everyone (scientists, students, parents, hobbyists, etcetera) have access to this technology is causing it to be rapidly adopted and accepted.
NVIDIA reports that over 100 million CUDA-enabled GPGPU have been sold and states on their Web site that "CUDA-based GPU computing is now part of the curriculum at more than 200 universities, including MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences." Due to low power consumption and small form factor, this technology can be installed in a multitude of locations: within a sensor, next to an instrument, or on a student's or scientist's desk.
Teaching about the potential of this technology has been an important aspect of my columns. Scalability, ability of parallel hardware to efficiently address the computational problem, and lack of consensus on programming models all complicate the cost analysis that should precede any significant software or technology investment. In some cases, intuition will have to fill in portions of the cost analysis.
[FIGURE 1 OMITTED]
The alternative is to accept that current application performance will essentially plateau at or near current levels on both current and future hardware. Legacy software written for single-core processors represents a huge base of currently installed software that will, in general, not benefit from new parallel and multi-core hardware. Other causes include multi-threaded or distributed applications that scale poorly. Regardless, a failure to invest in software that can run and scale well on current and future hardware has profound implications for computation-dependent projects. Effectively, the software then defines the limits of what computational work can be performed and how competitive a product or project might be relative to other computational-based approaches.
With care, appropriate and simple programming models can support both longevity and high performance. For example, the SIMD mapping of a broad class of machine-learning and optimization problems general mapping that was created in the 1980s for the CM-2 connection machine has been utilized on modern GPGPU and supercomputer technology to achieve both high performance and near-linear scalability. The simplicity of the SIMD (and MIMD) computation model has allowed this computational approach (and software) to run efficiently on many machines during the past 30 years.
Scalability is also an essential characteristic for parallel project longevity. This same mapping for machine-learning and optimization problems scales very well on a variety of machine architectures to support analysis of large data sets. In combination with GPGPU technology, many problems (neural networks, SVM, PCA, etcetera) can achieve very high performance, and near-linear scalability across arbitrary numbers of GPGPUs. Similarly, high-end supercomputers can be used efficiently as demonstrated by running a PCA problem on Texas Advanced Computing Center's (TACC) Ranger supercomputer. Using this same 1980s computational framework, Ranger delivered near-linear scaling to 60,000 processing cores, near-peak performance per core, and 363 billion floating-point operations per second even when taking into account all communications overhead. (5)
Looking ahead, new supercomputers, such as National Center for Supercomputing Applications' (NCSA) Blue Waters will deliver a petaflop ([10.sup.15] floating-point operations per second) of sustained performance (10 petaflops peak performance) in a petabyte shared-memory environment that supports over a million concurrent hardware threads of execution.
Considering that we are standing at the dawn of the petascale computer era, it is fun to speculate about the performance of future low-cost hardware. As noted in the beginning of this column, the 1996 ASCI Red supercomputer was the first to break the teraflop computing barrier. Now, in 2010, anyone can purchase teraflop-capable hardware for a few hundred dollars. In 2011, the Blue Waters project will start to deliver a world-record breaking 1- to 10-petaflops of performance. So, what sort of performance can we expect from hardware that anyone will be able to purchase 15 years from now?
(1.) Beichl, I., Sullivan, F., 2000. "The metropolis algorithm." Computing in Science and Engineering 2 (1), 65-69.
(2.) Khiripet, "GPGPU Case Studies": http://rbdweb.nstda.or.th/rbdweb/download/Dr.Noppadon.pdf
(3.) Suchard, et. el., "Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures" http://ftp.stat.duke.edu/WorkingPapers/10-02.pdf
(4.) Nvidia, "CUDA Showcase": http://nvidia.com/cuda
(5.) Farber: http://www.gpucomputing.net/?q=node/176
Rob Farber is a senior research scientist at PNNL. He may be reached at editor@ScientificComputing.com
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||High Performance Computing|
|Date:||May 1, 2010|
|Previous Article:||Industry insights: what you should know about power and performance efficiency.|
|Next Article:||The truth about watts and flops.|