Opening minds: the greatest architectural challenge: several computer architectural trends provide significant performance benefits.
In recent years, the emphasis on multi- and many-core architectures has changed and fragmented the landscape of computer architectures, since increasing processor clock rates no longer is a viable way for manufacturers to increase performance. This article will map out several architectural trends in the computer industry that can provide significant performance benefits for a broad spectrum of applications and will include the surprising observation that comparisons made by vendors and scientists alike against the current generation of x86 architecture processors may be inadvertently biased due to lack of use of valuable performance enhancing capabilities within this venerable architecture.
Please don't get me wrong, as I strongly support many of the new architectures. Both graphics processing unit (GPU) computing and the Cray XMT open new vistas of performance for a wide-spectrum of important computing problems. As a recognized scientist and technologist at a U.S. national laboratory, I have the benefit of early access to many of the latest (and largest) supercomputer technologies, which provides me a wide "horizon view" of the industry. Conversely, I also must face the challenge of evangelizing to management and justify ing these technologies so they will be accepted by my colleagues and utilized in both production and leadership computing environments. For this latter reason, I try to ensure that any new architecture I evangelize delivers the expected performance over conventional x86-based computing clusters--especially after our technical staff really digs in and tries to get all the performance possible from each and every one of our machines, including our mainstream processing dusters.
A very clear and obvious trend in the current generation is massive threading for both multi-core and many-core processing. Several of my previous Scientific Computing columns have discussed the reasons for this trend including "HPC's Future" and "Back to the Future: The Return of Massively Parallel Systems." Succinctly, massive multithreading forces both hardware architectures and software programming models to focus on latency, utilization and scheduling issues in order to effectively support very large numbers of simultaneous threads of execution.
Multi-core designs in the mainstream x86 architecture markets must support ever-increasing numbers of threads out of necessity, as it enables greater and more efficient use of their multi-core processors. With 8- to 16-core processing cores (2x the number of cores if hyper threading is enabled) coupled to large shared-memory subsystems measured in the hundreds of gigabytes of RAM, many commodity x86 machines have specifications competitive with some of the large shared-memory supercomputers of just a few years ago. Without doubt, the number of cores in the x86 processors will continue increasing.
Massive multithreading (coupled with other architectural features of the hardware) allows graphics processors to achieve extremely high floating-point performance and also allows architectures such as the Cray XMT to achieve near-linear scaling on irregular memory access problems. Roughly speaking, graphics processors can be considered "streaming processors" because they like coalesced memory operations that simultaneously stream data from all of the on-board graphics memory banks. Conversely, the Cray XMT architecture was designed to run algorithms that effectively have random memory access behavior. In fact, the memory subsystem of the XMT actually randomizes access to physical memory addresses to reduce hotspot behavior. Performing a sequential fetch of consecutive memory addresses on these machines actually causes data to be read from random hardware memory addresses.
Latency hiding (coupled with other hardware features) is the motivation for massive threading models in the Cray XMT (formerly Eldorado and follow-on to the MTA-2) architecture. With its latency tolerant ThreadStorm processors, high bandwidth network, global shared memory and fine-grained synchronization, the XMT architecture is showing that it can scale extraordinarily well on sparse graph problems that hit scaling bottlenecks on other architectures.
Without question, it is extremely important to find scalable solutions for algorithms that require irregular memory accesses. These are algorithms that do not permit locality of reference or data reuse, so hardware caches are rendered useless because each memory access almost always causes a cache miss. Conventional cache-based architectures fail to perform well on this general class of problems that arise when solving graph, finite automata and a myriad of other algorithms associated with large-scale data analysis and data-mining.
By facilitating the use of large numbers of simultaneous threads, the Cray hardware and software work together to "hide"--in a scalable fashion--the latency caused by accessing random memory locations. David Bader and Kamesh Madduri, for example, have been able to use the Cray architecture to solve with near-linear speedup important problems such as the single source shortest path graph problem with non-negative weights (e.g. the NSSP problem). In their paper "An Experimental Study of a Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances" the authors indicate they are presenting "first results to demonstrate near-linear speedup for such large-scale unstructured graph instances."
Massive multi-threading programming models also benefit the very powerful, yet exceedingly low-cost and low-power, graphics processor products now available from NVIDIA. The scientific and technical literature demonstrates an explosion of compute unified device architecture (CUDA)-enabled applications and algorithms in an astounding number of algorithmic and application areas. Essentially, CUDA-enabled graphics hardware and software can work together to support large numbers of simultaneously running threads to deliver one to two orders of magnitude, 10x to 100x, speedups over conventional hardware. (Some researchers have even reported three orders of magnitude increase when running algorithms that heavily utilize the GPU special processing units for transcendental functions.) Several of my earlier Scientific Computing articles discuss the graphics processor revolution for HPC including "GPGPUs: Neat Idea or Disruptive Technology," "The Future Looks Bright for Teraflop Computing" and "Back to the Future: The Return of Massively Parallel Computing."
It is worth noting that the OpenCL framework (which is still in the early stages of adoption) promises to deliver many of the advantages of GPU multithreaded programming for vendor-independent GPU platforms from ATI and NVIDIA, as well as general-purpose computers. This will be a technology to watch.
Common to both hardware GPU and Cray architectural advances are hardware schedulers that enable very low-latency context switching between threads. With supported thread-counts of thousands to millions of simultaneously executing threads, both the architectures are able to hide latency and achieve linear or near-linear speedups on an enormous number of important floating-point and irregular memory access problems.
Now comes the hard part: convincing people to rewrite portions of their software to exploit these remarkable capabilities. Both NVIDIA and Cray have C-language compilers augmented with a few simple additions (i.e. keywords or pragmas) plus some libraries that ease the transition into massively multi-threaded programming. I have programmed in both environments and am very excited by how straightforward they have been to use and the superb scaling behavior both architectures exhibit on large real-world problems.
Part of the challenge facing the adoption of graphics processors in the HPC and scientific communities is that the technologically disruptive high floating-point rates can currently be achieved with only single-precision (32-bit) arithmetic. I addressed this question in my previous Scientific Computing column, "Numerical Precision: How Much is Enough?" In addition, I suspect future generations of graphics processors will ease or eliminate this restriction.
A fear also has been expressed that graphics processors are not designed with enough internal data protections to prevent silent errors from occurring in long-running or production environments. This is a valid concern but, to my knowledge, there have been no reported instances of these silent errors. Yet, some of my colleagues still express concern about moving their software to graphics processors because they fear a silent error might introduce some non-physical artifact into their results. This concern also appears to be eroding with time as more software and scientific publications contribute ever more stories about the wonderful performance and game-changing nature of CUDA and GPU hardware.
A challenge with the Cray XMT is that it is still evolving and has not yet seen large-scale deployment. This is changing, as more supercomputing facilities are purchasing these systems, but these machines are certainly not a generic commodity off the shelf (COTS) product available to anybody at commodity prices.
As mentioned at the beginning of this article, benchmarks comparing new architectures to mainstream x86 processors are an essential part of demonstrating the value of these new products. Without some large performance benefit, why should customers risk adopting a new technology (with associated uncertainty, learning and transition costs) if mainstream processor technology can perform nearly as well?
Part of the challenge is that the x86 computer architects have not been standing still. For example, memory bandwidth has been a significant challenge facing the commodity x86 HPC vendors and customers. (For more information, please see my articles "Avoid that Bus!" and "The Future of HPC" in the archive section of the Scientific Computing website). Happily, the newest Core I7 architecture demonstrates that Intel has listened and is addressing the memory bandwidth issue in their processors.
While the x86 architecture does not have the excellent ultra low-latency schedulers that both NVIDA and Cray have in their products, it does have a few not-so-heavily utilized capabilities that can greatly increase x86 performance for both floating-point limited and random memory access limited applications. Unfortunately, scientists and programmers need to learn about these capabilities and to use them in order to provide fair "best performance" comparisons.
Most people are aware that x86 processors contain streaming SIMD extensions (SSE) that allow multiple operations to occur in parallel within the processor. Use of these instructions can greatly increase the performance--especially floating-point performance--of x86 processors. Relying on compilers and even excellent libraries such as the Intel Math Kernel Library (MKL) and AMD Core Math Library (ACML) Mil not necessarily achieve the best performance. Experience has shown that using the SSE compiler intrinsic operations available with most compilers can achieve greater that 2x increases in floating-point performance over a compiler or SSE optimized library because local register reuse and processor pipelining can be better exploited.
Large page table support is another capability that can significantly increase application performance for both floating-point and random memory access graph-type algorithms. The reason is that using large-pages avoids translation lookaside buffer (TLB) misses. In virtual memory systems like the x86 architecture, a TLB provides an on-chip cache to improve the speed of virtual address translation. As a result, application addresses can be translated to physical RAM addresses with minimal overhead and no additional RAM accesses. While TLB caches are fast, they are also quite small and the overhead and performance penalty incurred by a TLB miss is significant.
So, why do large pages benefit floating-point applications? The simplest answer can be understood in considering some array operation--however trivial--that requires stepping through memory in strides that are greater than the standard page size used by the system. These are common scenarios that frequently occur when working with two- or higher dimensional matrices. Because of the stride size, each memory access requires looking up a new page in the TLB. If the array is sufficiently large, then each memory access will cause a TLB miss and corresponding performance drop. Using large-pages in these cases will result in fewer TLB misses and a corresponding increase in performance because the processor does not have to wait (or wait as long) for data. Avoiding TLB misses is one of the motivations for the newer AMD and Intel processors' support of 1 GB pages.
Graph algorithms can also clearly benefit from the use of large-page sizes because irregular memory access patterns can clearly cause a TLB miss per memory access. For example, I have demonstration programs that run 4x to 7x faster (depending on problem size) when run in large 2 MB pages on an AMD Opteron computer.
Unfortunately, the literature on application performance benefits is confusing. Part of the reason is that the large-page capability available in recent processors varies widely between manufacturers, processor families and among different generations of processors within individual processor families. As a result (and depending on the processor benchmarked) the use of large-pages can actually cause a performance decrease.
The popular AMD Opteron provides a clear example of a mainstream processor that provides very limited support for large-page sizes. Collin McCurdy, et. al. noted in their paper "Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors" that the AMD Opteron processors only support large 2 MB pages with a TLB cache that has only eight entries, which is only 1/32 the size of the commonly used 4k-page size! Further, the 2 MB TLB cache has other performance-limiting characteristics. (Specifically, the large TLB cache is not backed by a second-level cache and is only four-way set associative while the 4k TLB cache is both fully associative and backed by a second level cache to further reduce off-chip memory accesses.)
Please note that large-page support is moving ever more into the mainstream because important commercial applications such as Oracle run significantly faster with large pages. Both Microsoft and Linux support the use of large pages. With the release of the Barcelona processors, AMD now provides reasonable TLB cache capability that allows the unrestricted mixing of TLB entries that reference multiple page sizes.
Unfortunately, the confusion over the possible benefits of large-pages has limited or delayed the acceptance of this commodity processor capability for scientific and HPC computing. For example, it has taken me several years to convince my colleagues and management at PNNL to fund an investigation into the performance benefit of large-pages for our applications.
Acronyms ACML AMD Core Math Library | COTS Commodity off the Shelf | CUDA Compute United Device Archtecture | GPGPU General-Purpose Computation on Graphics Processing Units | GPU Graphics Processing Unit | MKL Math Kernel Library | NSSP Single Source Shorfest Path Problem with Non-negative Weights | SIMD Single Instruction, Multiple Data | SSE Streaming SIMD Extensions | TLB Translation Lookaside Buffer
Rob Farber is a senior PNNL research scientist working with the William R. Wiley Environmental Molecular Sciences Laboratory, a Department of Energy national scientific user facility located in Richland, Washington. He may be reached at editor@ScientificComputing.com.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||High Performance Computing|
|Date:||Sep 1, 2009|
|Previous Article:||Scientific process automation improves data interaction: workflow infrastructure automates time-intensive manual processes.|
|Next Article:||Computing for a changing world.|