Implementing efficient fault containment for multiprocessors.
Unfortunately, the performance advantages of future large-scale multiprocessors may be offset by reliability problems if they run current operating systems. With current multiprocessor operating systems, the entire machine must be rebooted when a nontrivial hardware or software fault occurs. The effect is to terminate all applications even though the initial fault may have affected only a few of the applications. This design will become increasingly inappropriate on future large-scale machines, where both the frequency of faults and the amount of work lost in a reboot may be much higher than on current machines.
The key, to improving the reliability of shared-memory multiprocessors is to provide fault containment. If the system confines the effects faults to a small portion of the machine, each application will achieve better reliability since it will be vulnerable only to faults that occur in the parts of the system it uses. This strategy has been used successfully in many distributed systems, but to our knowledge has not previously been applied to a multiprocessor. The prevailing wisdom has been that the tight coupling of components in a multiprocessor makes fault containment impossible.
This article describes our experience with implementing efficient fault containment in a prototype operating system, called Hive . Hive is a version of Unix that is targeted to run on the Stanford FLASH multiprocessor (Figure 1). Novel algorithms in Hive work together with unique hardware features provided by FLASH to limit the impact of both hardware faults and operating system bugs. Hive is based on Silicon Graphics IRIX 5.2 (a version of Unix SVR4) and preserves binary compatibility with it. By working with a commercial operating system, we believe we have encountered most of the issues that will arise when applying our techniques in practice. The techniques are not specific to IRIX, but should be usable in most multiprocessor operating systems.
In this article we make three main points. First, we propose fault containment as an appropriate reliability strategy for many multiprocessor applications. Second, we describe the architecture and current implementation of Hive. Finally, we give the results of performance and fault-injection experiments that demonstrate the potential of this approach.
Fault containment is the reliability strategy characteristic of many current distributed systems. The separate machines in a distributed system can share resources; for example, a user working at one machine can use files from another. However, the failure of one machine rarely damages applications running on other machines. The distributed system limits the effects of faults so only a fraction of the overall workload is affected.
In a shared-memory multiprocessor, fault containment implies defending applications against faults that occur in parts of the machine they do not use. The situation in a multiprocessor is not exactly analogous to a distributed system, since an application writer has little control over which resources the application uses. For example, even a small application running on a multiprocessor might use every memorandomly. We account for this issue by restating the goal of fault containment in more abstract terms:
A system provides fault containment if the probability
that an application is damaged by a fault is proportional to
the amount of resources used by the application, not to the
total amount of resources in the system.
In other words, small applications should have the same probability of being affected by a fault whether they run on a small or on a large-scale shared-memory multiprocessor. Note that our definition of fault containment differs from the terminology used in dependable computing systems (see "Definitions" sidebar).
Fault containment provides significant benefits for many types of multiprocessor workloads. It improves reliability for multiprogrammed workloads consisting of independent applications, such as software development, computer-aided design, and engineering simulation. Even workloads consisting of one or a few large parallel applications can frequently be decomposed into independent smaller tasks that benefit from fault containment. Examples of such workloads include decision support and data mining, where an initial long-running query can be split into smaller independent subqueries. Another type of workload that benefits from fault containment is a server workload, such as a multimedia or Web server client or set of clients can be assigned to a largely independent server process.
Only workloads that cannot do useful work after a partial failure receive no benefit from fault containment. Such workloads usually consist of a single large parallel process; for example, applications in computational physics or protein-folding analysis. These applications have traditionally relied on regular checkpointing (perhaps provided automatically by the system) to recover from faults. This strategy, effectively complements fault containment: checkpointing helps those applications that fault containment does not help, while many of the workloads where fault containment is beneficial are too dynamic and complex to be checkpointed without high performance overheads. A system that combines fault containment with checkpointing support can efficiently increase reliability for the majority of multiprocessor workloads.
Fault Containment vs. Fault Tolerance
A more traditional approach to improve reliability would be to provide fault tolerance. Fault-tolerant systems, such as the Tandem Guardian operating system , attempt to mask all effects of faults in order to support applications that require continuous operation. A fault-tolerant system is usually implemented through automatic replication of applications. The system keeps the backup copies of each application synchronized with the primary copy, so if the primary fails, little computation is lost and there is only a brief interruption in service.
Unfortunately, fault-tolerant systems can add significant performance overheads. Those that are relatively efficient, such as Guardian, require applications to be written specifically to take advantage of their fault-tolerant features. By choosing instead to implement the fault-containment abstraction, Hive improves the reliability of unmodified applications while preserving performance competitive with existing multiprocessor operating systems. If needed, fault tolerance can still be provided through replication at the application level.
Implementing Fault Containment in Hive
Fault containment is significantly more difficult to achieve in a shared-memory multiprocessor than in the distributed systems where it has previously been implemented. Consider some of the potential faults in the shared-memory multiprocessor shown in Figure 2. The run queue, a critical operating system data structure shared by all the processors, is stored in memory module 1. In case (a), memory module 1 fails, causing the run queue data to become inaccessible; no new processes can be scheduled anywhere in the machine. In case (b), processor I acquired the run queue lock to ensure mutual exclusion, but failed before releasing the lock; any other processor trying to access the run queue will spin forever waiting for the lock to be released. In case (c), processor 2 uses an uninitialized pointer for a store instruction and happens to smash the data in the run queue; the operating system will probably crash when the next processor to access the run queue finds it in an inconsistent state. (See the "Failure Model" sidebar for a more precise statement of the failure model used in the Hive design).
These problems make it impractical to add fault containment to a standard multiprocessor operating system, which has only one copy of each internal data structure. Our solution in Hive is to partition the multiprocessor and run an internal distributed system of multiple kernels (Figure 3). Each kernel, called a cell, runs as an independent multiprocessor operating system and defends itself and its processes against faults in other cells. However, the cells cooperate to present the appearance of a single operating system to users and applications (a single-system image), and cooperate to share memory and other resources as required for performance. We implemented Hive by starting with an existing Unix kernel (IRIX 5.2) and making extensive modifications to the process management s stem, virtual memory system, file system, and I/O system.
When a cell crashes due to a hardware or software fault, processes with threads running on the cell are terminated. All other processes continue normally, although those using resources (such as memory pages) lost in the crash receive an error the next time they try to access those resources. The surviving cells reboot the crashed cell and reintegrate it into the system.
Cells interact primarily by exchanging remote procedure calls (RPCs), just as is done in distributed systems. In fact, many of the problems encountered in implementing Hive are similar to those solved by previous single-system-image distributed systems such as Sprite [91 and Locus , which run on clusters of workstations. For example, we had to modify the process creation mechanism so the new child process could be created on a separate cell, enhance the signal delivery mechanism, implement distributed process groups, and so on.
The novel problems that arise in a cell-based multiprocessor operating system are: (1) the need to defend against wild writes (accidental stores to random addresses), (2) the need to detect cell failures quickly, and (3) the need to share memory, processors, and other resources flexibly among the cells. We discuss the solution to each of these problems in turn.
Wild-write defense: Each cell must be able to defend itself against corruption by wild writes from other cells. Wild writes have been observed to occur in a significant fraction of operating system software faults .
To help solve the wild-write problem, we added firewall hardware to FLASH. The firewall provides an array of protection bits for each 4KB page of main memory. A processor is only allowed to write to a page if the bit corresponding to that processor is set. If there are more processors than bits in the protection array, multiple processors map to each protection bit.
Cells use the firewall to completely protect their internal operating system pages. Application pages are protected to the extent possible. In particular, Hive guarantees that application pages used only by processes running on the local cell are not writable from outside the cell. This provides maximum fault containment benefits to processes that fit within a single cell.
When a cell fails, there is no information that indicates whether it issued wild writes or which pages it modified (maintaining such information would require excessive hardware overhead to track writes). Hive makes the conservative assumption that all writable pages are corrupted and discards them. It is safer to discard a valid page, perhaps forcing the user to run an application again, than to preserve a corrupted page, which could cause an application to generate incorrect output.
The combination of the firewall and the preemptive discard policy provides a very strong defense against wild writes. However, the short time between a wild write and the detection of cell failure results in a small but nonzero probability that an application will read corrupt data and generate incorrect output. This problem appears fundamental to any implementation of fault containment in a shared-memory system.
In fault-injection experiments to date we have not observed any incorrect output due to wild writes. We will need extensive experience with the system to measure this probability in Hive. However, it is clear that the probability can be reduced by minimizing the delay until faults are detected. Aggressive failure detection is therefore an important component of the Hive design.
Fast failure detection and recovery: Hive detects failures by running a set of heuristic checks during normal operation. Each cell continually checks other cells. This includes the checks done in previous distributed systems: setting timeouts on RPCs and performing consistency checks on messages received from other cells. Hive also runs more frequent checks that use the ability to read directly from the memory of other cells. Examples include monitoring the clock variables of other cells, and checking type tags when reading remote kernel data structures.
The failure of a heuristic check run by a cell provides a hint that some other cell may have failed. It is only a hint since the error may have been in the cell where the check algorithm executed. For instance, in one case we observed a bug which caused the clock interrupt handier in a cell to execute too frequently. Pending operations on that cell timed out prematurely, leading the damaged cell to wrongly suspect that another cell had failed.
When a hint is raised on any cell, Hive suspends user-level processes on all cells and runs a distributed agreement protocol that allows the cells to vote on which cells have failed. Each cell independently tests all other cells to form its own idea of the new live set. This information is then exchanged with the other cells in a multiround flooding algorithm that guarantees agreement despite the potential failure of cells while the algorithm is running . If the system reaches consensus that the live set is unchanged (i.e., the hint was a false alarm), each cell immediately resumes normal operation.
If the system reaches consensus that one or more cells have failed, each cell independently runs a recovery process that discards pages writable by the failed cells, kills processes with threads on the failed cells, and performs other cleanup actions. No user-level processes are allowed to resume on any cell before all surviving cells are read for normal operation. This ensures that all potentially corrupt data has been discarded, and simplifies the implementation of the recovery algorithms.
An important characteristic of the distributed agreement protocol is that it completes quickly if no cells have failed. The system can execute the agreement protocol frequently (once every second or so) without significant performance impact. Therefore as Hive matures and we encounter types of failures that are not detected by the current checks, we can add heuristic checks specific to those failure types. The checks can be biased to give false alarms rather than false negatives. This suggests that the system can be iteratively improved to the point where it has a high probability of detecting faults quickly (perhaps within 10 milliseconds), and thus that applications will have a low probability of encountering corrupt data.
Resource sharing: Efficient resource sharing is the primary feature that justifies investing in a shared-memory multiprocessor. Therefore the cells must share memory, I/O devices, processors, and other resources flexibility. Hive requires mechanisms that support sharing and policies that ensure the system runs efficiently. We discuss the mechanisms here; policy design will be addressed in future work.
Hive provides two memory sharing mechanisms (Figure 4). One type, which we call physical-level sharing, allows free memory pages to flow to cells where demand is high. The other, called logical-level sharing, enables processes on different cells to share data.
Physical-level memory sharing is implemented by the memory allocation modules of the different cells. When a cell is short on memory, it can borrow a set of pages from some other cell. The allocation module of the original owner moves the pages to a reserved list and ignores them until the borrower returns them or crashes. This mechanism supports load balancing of memory, preventing a situation where a cell starts paging even though there is free memory elsewhere in the system.
Logical-level data sharing is implemented by the virtual memory system and the file system, which ensure that any two processes that open or map the same file can share the same cached pages of that file in memory. This supports efficient use of the file cache, a performance-critical resource for many workloads, and allows processes such as those forked from the same parent to share data pages efficiently across cell boundaries. Where necessary to improve locality, the virtual memory system can also replicate read-shared pages to multiple cells or migrate write-shared pages among cells.
Sharing of I/O devices is implemented just as in other distributed systems. A request to access a remote I/O device is forwarded to the cell that physically owns the device. That cell then executes the request and returns the result to the client cell. However, the system avoids the extra data copy which this design would normally require (to or from the memory of the cell which owns the device). The intercell memory sharing mechanisms allow devices to issue DMA accesses directly to the memory of the client cell.
Processor sharing is provided by a mechanism called spanning tasks. A spanning task is a collection of processes, one per cell, that are coordinated by the operating system to appear as a single user-level process. Changes to application-visible state in one cell (such as the current working directory) are propagated to all component processes of the spanning task.
Spanning tasks provide policy support for intercell sharing. A user-level process called Wax with a component on each cell provides hints to optimize global resource allocation (Figure 5). Moving the policy algorithms to user level allows the threads of Wax to synchronize using standard locks and nonblocking data structures, and to make decisions based on a global view of the system state. After any cell crash, Wax simply exits and restarts with clean data structures.
Hive relies on hardware features provided by the Stanford FLASH multiprocessor, which is currently being developed. Until FLASH is available, we use SimOS  for debugging, fault injection, and performance measurement. SimOS is a complete hardware simulation environment that exposes an operating system to the concurrency and resource stresses it would experience on the actual hardware.
For the experiments described here, we configured SimOS as shown in Table 1. The performance experiments use a model that is close to the FLASH machine, while the fault-injection experiments use SimOS in a faster mode that provides less timing accuracy but enables a larger number of experiments. We used a version of Hive that implements all of the features described here except spanning tasks and Wax.
[TABULAR DATA NOT REPRODUCIBLE IN ASCII]
We selected the test workloads shown in Table 2. Raytrace and ocean (taken from the Splash-2 suite ) are parallel applications that use the system in ways characteristic of scientific computing environments. Pmake (parallel make) is characteristic of the use of a multiprogrammed compute server. The applications executed in the test workloads are unmodified binaries that run on IRIX as well as Hive. In all cases the file cache was warmed up before running the workloads.
Performance experiments: Figure 6 compares Hive's performance in various configurations (numbers of processors per cell) against IRIX 5.2 running on the same hardware. As could be expected, the fault containment boundaries of Hive add little overhead to parallel scientific applications such as ocean and raytrace, since such applications place few demands on the operating system.
Under parallel compilation (pmake), which contains significant file stem and scheduler activity, the cell boundaries and increased complexity of Hive add a moderate amount of work. The single-cell Hive run is 12% slower than IRIX due primarily to extra synchronization added to the page fault handler. The two-cell run is an additional 8% slower due to RPC latency and the high cost of the checks required when a process is forked across a cell boundary. These costs are recouped as the system increases to eight cells, since partitioning the machine significantly reduces both cache miss and synchronization overheads in the kernel.
These overall times mask an important trend. As the number of cells increases, the number of useful kernel instructions executed (excluding the idle loop and spinning on locks) increases significantly. IRIX executes about 219 million useful kernel instructions to run pmake, while the eight-cell configuration of Hive requires 775 million to do the same job. Therefore future commercial Unix kernels that are tuned for lower cache miss and synchronization overheads are likely to outperform Hive at small system sizes. However, at the larger system sizes for which Hive is designed, the increased number of instructions executed by cell-based operating systems will pay off in lower overall execution times.
Fault-injection experiments: Table 3 summarizes the results of the fault-injection experiments. We injected faults while the system was executing the pmake workload, waited for the system to detect and recover from the fault, then checked all output files against reference copies to ensure that any wild writes due to the fault had not corrupted application output. In all cases, Hive successfully limited the effects of the fault to the cell where it was injected and no output files were corrupted.
For hardware faults, we halted a random node, both at random times and at times when the cells were cooperating closely. For software falts, we corrupted pointers in pathological ways. The values used to smash the pointers included: random memory addresses, an address one or two words away from the original contents of the pointer, and the address of the pointer itself.
We also measured the null recovery time (time to return to normal operation after a false alarm) with the accurate machine model used for the performance experiments. As the system increases from two to eight cells, the null recovery time decreases from 35.6msec to 11.0msec. This is due to the current implementation which performs a full preemptive discard check even in a null recovery. The preemptive discard check is faster with multiple cells because the cells run the check in parallel and each cell checks fewer pages.
With an implementation that skips the preemptive discard check in a null recovery, the dominant cost will be the live set algorithm. It takes 1.6msec on two cells, 2.0msec on eight cells, and we predict around 80msec for 64 cells (the maximum that FLASH can support). These times are fast enough to allow a false alarm every second or two without significant performance overhead, so we expect to be able to iteratively strengthen the wild write defense as described in the section on fast failure detection.
Limitations of the Hive
There are three limitations to the reliability mechanisms used in Hive. The first is the assumption that hardware faults are fail-fast. While this may appear difficult to achieve, previous experience with distributed systems indicates that it is a reasonable assumption. (In existing workstation networks, any machine can bring down the entire system by flooding the network or sending corrupt packets, yet this seldom happens.) The challenge for multiprocessor designers will be to provide sufficient checks in the hardware to assure fail-fast behavior, without impacting the performance or cost of large-scale machines.
A second limitation is the assumption that it is acceptable to temporarily pause user-level processes while the distributed agreement protocol is checking for faults. Even though these interruptions are brief, they ma-,, not be acceptable in systems with tight scheduling deadlines such as video servers. It would be possible to continue execution of some processes during fault detection, at some increased risk of data corruption due to wild writes.
A final limitation of our approach is the length of time it takes to detect kernel software faults. Even with aggressive failure detection that provides a high probability of detecting faults within 10 milliseconds or so, a modern processor will execute millions of instructions before the fault is detected. The open question is what probability remains of corrupting an application: the kernel fault has to lead to a wild write, the address of the wild write must be unprotected by the firewall, and the corrupt data must be read by an application on another cell before the page is discarded. This is clearly a small probability, and may be close to the probability of an undetected error in current systems (such as a double-bit error in a parity-protected cache). If so, the wild write defense will be strong enough for most applications. Measuring this probability will require extensive testing with a large-scale system running stressful workloads.
Our work on Hive lies on the border between three major areas of systems research: reliability, distributed systems, and operating systems for large-scale multiprocessors.
As discussed earlier, fault-tolerant systems such as the Tandem Guardian  provide a more powerful abstraction than the fault containment that Hive provides. However, they add costs (in performance or in required modifications to applications) that have prevented use of these systems for general-purpose applications.
Other approaches to software reliability include microkernels such as Mach  and object-oriented operating systems such as Spring , both of which partition the operating system into components in a way that makes it easier to test and debug. This reduces the frequency of software faults but does not help the system to recover from faults when they occur. There are potentially interesting synergies between these approaches and the fault containment strategies implemented in Hive. For example, running Hive on top of a trusted microkernel would enable containment of software faults without custom firewall hardware.
Hive is closely related to other tightly-coupled distributed systems. Systems such as Sprite  and Locus  have shown how to synthesize a single system image from multiple kernels that distrust each other. More recent distributed systems, such as Solaris MC at Sun Microsystems  and NOW at U.C. Berkeley , are designed as highly available single-system-image platforms for applications that currently run on multiprocessors. Hive differs from these systems in its focus on taking advantage of shared-memory hardware and defending against the problems caused by that hardware. Our techniques are readily applicable to a distributed operating system like Solaris MC or NOW if the platforms they run on are extended with hardware support for sharing user memory across the network.
There has also been considerable recent interest in operating systems for large multiprocessors. A partitioned kernel architecture like Hive is a promising way to improve performance scalability. Processes running on separate cells share few kernel resources, so operating system parallelism can be improved by increasing the number of cells. This approach to scalability has been explored in the Hurricane and Tornado operating systems developed at the University of Toronto .
The experiments we have performed to date show that the fault containment mechanisms of Hive are effective and add low performance overheads when implemented in a commercial operating system. However, it is difficult to predict the reliability of a complex system before it is used extensively in practice, and we have not yet been able to measure the performance of Hive on a large-scale multiprocessor.
The results are nonetheless encouraging. They suggest that fault containment can be efficiently implemented in a shared-memory multiprocessor. This is exciting because the prevailing wisdom has been that fault containment is impossible in a shared-memory system. Efficient fault containment will significantly increase the reliability of multiprocessors, and thereby enable the use of much larger systems than were practical with previous operating system designs.
[1.] Anderson, T., Culler, D., and Patterson, D. A case for NOW (networks of workstations). IEEE, Micro 15, 1 (Feb. 1995), 54-64.
[2.] Bartlett, J., Gray, J., and Horst. B. Fault tolerance in Tandem computer systems. in Evolution of Fault-Tolerant Computing, A. Avizienis, H. Kopetz, and J.C. Laprie, Eds. Springer-Verlag, New York, 1987.
[3.] Chapin, J., Rosenblum, M., Devine, S., et al. Hive: Fault containment for shared-memory multiprocessors. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (Copper Mountain, Colo.. Dec. 9-6. 1995). ACM/SIGOPS, New York. 1995. pp. 12-25.
[4.] Johnson, B. Design and Analysis of Fault-Tolerant Digital Systems. Addison-Wesley, Reading, pp. 12-25.
[5.] Khalidi, Y., Bernabeu, J., Matena, V., et al. Solaris MC,: A multi-computer OS. In Proceedings of the USENIX 1996 Annual Technical Conference (San Diego, Calif., Jan. 22-26, 1996). USENIX Association, Berkeley, Calif, 1996, pp. 191-204.
[6.] Kuskin, J., Ofelt, D., Heinrich, M., et al. The Stanford FLASH multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., Apr. 18-21, 1964). IEEE Computer Society Press, 1994, pp. 302-313.
[7.] Lynch, N. Distributed Algorithms. Morgan Kaufmann, San Francisco, 1996.
[8.] Mitchell, J.G., Gibbons, J.J., Hamilton, G., et al. An overview of the Spring system. In Digest of Papers, Spring COMPCON 94 (San Francisco, Calif., Feb. 28-Mar. 4, 1994). IEEE Computer Society Press, 1994, pp. 122-131.
[9.] Ousterhout, J., Cherenson, A., Douglis, F., et al. The Sprite network operating system. Computer 21, 2 (Feb. 1988), 23-36.
[10.] Popek, G. and Walker, B., Eds. The LOCUS Distributed System Architecture. MIT Press, Cambridge, Mass., 1985.
[11.] Rosenblum, M., Herrod, S.A., Witchel, E., and Gupta, A. Complete computer system simulation: The SimOS approach. IEEE Parallel and Distributed Technology: Systems and Applications 3, 4 (Winter 1995) 34-43.
[12.] Sullivan, M. and Chillarge, R. Software defects and their impact on system availability -- A study of field failures in operating systems. In Proceedings of the 21st International Symposium on Fault-Tolerant Computing (Montreal, Canada, Jun. 25-27). IEEE Computer Society Press, 1991, pp. 2-9.
[13.] Unrau, R., Krieger, O., Gamsa, B., and Stumm, M. Hierarchical clustering: A structure for scalable multiprocessor operating system design. Journal of Supercomputing 9, 1/2 (Mar. 1995), 105-134.
[14.] Woo, S., Ohara, M., Torrie, E., et al. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22 d Annual International Symposium on Computer Architecture (Santa Margherita Ligure, Italy, Jun. 22-24). ACM/SIGARCH, 1995, pp. 24-36.
[15.] Young, M., Tevanian, A., Rashid, R., et al. The duality of memory and communication in the implementation of a multiprocessor operating system. In Proceedings of the 11th symposium on Operating Systems Principles (Austin, Tex., Nov. 8-11). ACM./SIGOPS, 1987, pp. 63-76.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||contains related article|
|Author:||Rosenblum, Mendel; Chapin, John; Teodosiu, Dan; Devine, Scott; Lahiri, Tirthankar; Gupta, Anoop|
|Publication:||Communications of the ACM|
|Date:||Sep 1, 1996|
|Previous Article:||Operating system support for high-speed communication.|
|Next Article:||Operating system support for persistant and recoverable computations.|