Keeping "performance" in HPC; a look at the impact of virtualization and many-core processors.
Let's assume we have a balanced supercomputer system, as described in my February 2007 column. This means there are no obvious resource bottlenecks (e.g. memory bandwidth and capacity, network bandwidth, storage bandwidth, and so forth) to run an application well. So, what other factors can affect application performance, and how do virtualization plus many-core processors make the situation worse? In a nutshell, the culprits are monolithic design, OS jitter, latency, cache pollution and TLB trashing as discussed below.
Whew! That certainly is a lot of material to cover in the short space of this column, and the list is not exhaustive. These are some common issues that challenge many HPC applications--especially tightly coupled and multi-threaded scientific codes. The good news is these are current topics of discussion in the HPC community and there are many sources of in-depth information available in the literature and on the Internet.
Resilience is a key motivator in considering virtualization for petascale and beyond supercomputers, as the virtual machine can literally be migrated off failing hardware without the application's (or user's) knowledge. In the near term, there are some immediate technical challenges to overcome:
* Transparent virtual machine migration does not currently work on some types of supercomputer network interconnects.
* As last month's column discussed, current virtualization technology imposes some performance limitations for applications that require native network and/or storage I/O performance and these will probably be corrected with next-generation technology.
Looking further into the future, Gavrilovska, et. al. observes that current hypervisor designs used for virtualization are monolithic, with all cores in the system executing the same functionality. (1) This will introduce significant performance overheads for many-core architectures. (2)
Some types of HPC problems require that many or all nodes in the supercomputer be tightly-coupled together. In essence, this means that each computational node frequently depends upon input from another node or nodes. The problem with tightly coupled designs is that any delays in moving information from any node to any other node can cause a delay for all the nodes. In other words, small delays can quickly add up to big drops in performance. Communications latency is generally the principal limitation for tightly coupled software. However, OS jitter, an unwanted variation in timing caused by system and other processes running on the computer, also can cause significant delays. To support tightly coupled applications, many HPC sites try to remove as many system processes as possible. Virtualization may not be a viable solution for these types of software designs, as it can introduce delays (e.g. increase latency) due to context switching between multiple virtual machines and/or the host operating system.
However, running tightly coupled applications on the native hardware host operating system (without virtual machines or other processes) should eliminate all but the communications latencies--at least as well as the current technology. It is unclear at this time if tightly-coupled software designs will be viable for petascale and beyond computing. I certainly hope they are!
Finally, we mentioned two other types of problems: cache pollution and TLB trashing. We can view these as problems associated with running multiple programs (or program threads) on the same processor core.
Cache memory is very high-performance memory that the CPU can use with very little delay. It is generally located on the processor and is much faster to access than external memory such as the sticks of RAM that are plugged into your computer. Cache pollution occurs when multiple programs (or multiple threads) attempt to use the same processor core cache. In this case, valuable high-performance space for one program gets "polluted" with useless data and program instructions for another thread. If the processor core cannot find what it needs in the high-performance cache, then it has to pay in performance, as it is forced to access the slower external memory. For this reason, cache "pollution" is bad.
TLB trashing is very similar in concept to the cache pollution described above. A translation lookaside buffer (TLB) is a cache used to improve the speed of virtual address translation. This cache has a fixed number of entries containing parts of the page table that translate virtual addresses into physical addresses. A performance penalty occurs when the virtual address is not in the TLB because slower external memory must be accessed to find the needed address information. So, if multiple programs (or threads) are competing for space in the TLB, then a performance penalty occurs due to TLB trashing as the addresses from one process "trash" the address for another process.
Changes in operating system and virtuafization designs are required to eliminate these types of problems. One of the simplest solutions will be to change the current paradigm of having any job run on a processing core to only one job can run on a processing core. That would eliminate problems such as cache pollution, TLB trashing and OS jitter. Happily, the future looks very bright as hardware advances appear to be giving us these types of options and there appears to be active research on these problems. Happy"performance" computing! SC
1. A hypervisor is software which runs on a hardware machine and manages one or more operating systems.
2. See "High-Performance Hypervisor Architectures: Virtualization in HPC Systems,"
Rob Farber is a senior research scientist in the Molecular Science Computing Facility at the William R. Wiley Environmental Molecular Sciences Laboratory, a Department of Energy national scientific user facility located at Pacific Northwest National Laboratory. He may be reached at editor@ScientificComputing.com.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||HIGH PERFORMANCE COMPUTING|
|Date:||Jul 1, 2007|
|Previous Article:||Wave which way; sensor technology for directional underwater sound.|
|Next Article:||Soothsayer: how to sense impending doom like a pro.|