Printer Friendly

Cache considerations for multiprocessor programmers.

CACHE CONSIDERATIONS for MULTIPROCESSOR PROGRAMMERS

In their quest for faster machines, computer architects design computers to exploit the locality present in most programs. Locality takes several forms: Spatial locality arises because two contemporaneous memory references are likely to access nearby words. Temporal locality occurs because a recently referenced memory word is likely to be accessed again. A cache maintains a copy of data in a local memory that requires less time to access than the original data. Typically, a cache is a high-speed buffer adjacent to a processor that holds copies of memory locations likely to be used soon [12]. Caches are managed by hardware to increase their speed. Caches favor programs in which most references are local and permit them to execute faster than similar programs with less locality. Most uniprocessor programs exhibit considerable locality, even if their programmer was unaware of the existence of caches.

Parallel computing introduces a new type of locality, called processor locality, in which contemporaneous references to a memory word come from a single processor, rather than many different ones [1]. Many multiprocessors which devote substantial resources to providing a large cache for each processor to hide this feature, allowing programmers to write correct programs without reasoning about these caches. Although programmers find it not only possible, but easy to write multiprocessor programs in which each process has substantial locality, interactions among processes reduce performance and diminish the benefit of moving from a uniprocessor to a multiprocessor. This article describes the behavior of these hidden caches and presents some guidelines for programmers who wish to use them more effectively.

We are most interested in cache-coherent, shared-memory multiprocessors (multis) [5]. Many commercial multiprocessors, such as the Sequent Symmetry, Encore Multimax, and Alliant FX/8 are multis. Figure 1 shows the typical structure of these machines. Each processor contains a local cache that reduces the expected long delay of referencing main memory through an interconnection network (e.g., shared bus). As long as a processor accesses data that is not shared with any other processor, the cache works like a uniprocessor's cache, keeping a copy of recently used locations. However, when a memory locations is shared among processors, a cache-coherence protocol ensures that each processor sees a consistent view of the datum, even though it may be stored in more than one cache [4]. These protocols can reduce a program's performance by requiring expensive, nonlocal operations to invalidate or update shared data in other caches. These operations directly affect the processors referencing the shared data and indirectly slow all processors by increasing contention for the interconnection network and main memory.

Even multiprocessors that are not multis distinguish local and remote memories. On some computers, such as the BBN Butterfly, a portion of each processor's address space is local and can be accessed at low cost. On other machines, such as hypercubes, all memory is local and remote memory can only be referenced through a message to another processor. Programmers on these computers typically cache code and data in a processor's local memory. Some systems for these machines present an illusion of shared memory by caching pages in local memory [9]. Even though these local memories are not managed in hardware, many of the considerations we will discuss are applicable.

In future multiprocessors, the relative cost of coherence protocol operations will be larger than it is at the present time because technological improvements are not reducing communication costs as rapidly as computation times. Users are also demanding increasingly large systems, which require more communication. Properly exploiting all types of locality is critical to using tomorrow's multiprocessors efficiently. Eggers has presented empirical results demonstrating that today's multiprocessor programs frequently misuse the cache, thereby reducing its performance [7].

In the future, perhaps languages and compilers for parallel computers will take the following issues into consideration. In the meantime, the programmer must understand and efficiently use caches in order to take full advantage of multis.

This article restates results by Eggers and other computer architecture researchers in a manner comprehensible to programmers who may know little or nothing about caches. We will also introduce four simple models that help programmers appreciate the implications of multiprocessor caches (summarized in Table I). The Appendix contains a more detailed description of the underlying hardware.

No-Caches Model

The no-caches model assumes that all memory references go to main memory. The advantage of this model is that a programmer does not need to worry about locality, since all memory references are equally expensive. Nonlocal communication can only be reduced by eliminating memory references (i.e., keeping data within a processor's registers or recomputing results). This is the multiprocessor model which many programmers use. It is adequate for the purpose of discussing the functionality of concurrent programs. It fails, however, to capture the need for locality.

Infinite-Word-Caches

Model

Our simplest cache model assumes an infinite cache of single memory locations. Once a location is referenced, it remains in a uniprocessor's cache forever. On a multiprocessor, the word disappears from a processor's cache when another processor writes into it. (1)

This model's principal software implication is that programmers should avoid unnecessary interleaving of references by more than one processor to the same memory word (unless all references are reads). To appreciate this point, consider the common programming paradigm of maintaining a central queue of tasks and having a process running on each processor remove tasks, execute them, and return new tasks to the queue. Although the arrangement described is convenient, it ignores locality since a datum may be modified by many tasks executing on different processors. Nonlocal operations will transfer the modified location between the processors' caches. If writes are frequent enough, the traffic generated by these operations will heavily load the memory system and can reduce the whole systemhs performance. One way to avoid this problem is maintain a separate task queue for each processor [3] (2). In this case, repeated operations on an object will usually execute on the same processor. If a processor empties its queue, it can remove tasks from another processor's queue.

The interleaving problem can be most severe for variables used for interprocess synchronization, such as locks. A test-and-set operation that obtains a lock always modifies a memory location, regardless of whether the lock is free. After a process executes a test-and-set, the lock resides exclusively in that processor's cache. Two or more processes contending for a lock aggravate the situation by causing the lock to "ping-pong" between caches, generating large amounts of network traffic and slowing other processors. A simple solution is to test the state of the lock before performing a test-and-set instruction [11]. Only when the lock is free, should the more expensive operations be used:

repeat / * Wait until lock is free before trying test-and-set*/ while (lock [is not equal to] Free) do skip od; until (test-and-set(lock)=Free);

With proper care, this solution, called test-and-test-and-set, works well for processors connected through a shared bus [3]. An equivalent technique for synchronization over more general interconnection networks is currently the subject of research [8, 13].

Infinite-Block-Caches

Model

Most real caches do not hold invididual memory locations. Instead, they hold groups of words surrounding the referenced locations. These word groups form what is called a block, and are loaded together when any constituent location is referenced. Blocks of size B words are usually aligned, meaning that the address of the first word is a multiple of B. Typical values for B are 4, 8 or 16 words. Cache blocks exploit spatial locality. A program typically uses data in locations near the word it is currently referencing. These nearby words are brought into the cache along with the first referenced location.

These blocks, however, may cause problems when different processors modify adjacent locations. The first write transfers the block to one processor's cache. The second write moves it to the other processor's cache. This sequence is called false sharing since no information is transferred [7]. False sharing arises when the data of two processors lie adjacent in memory. For example, in

declare integer data [100]; declare lock lock [100];

each element of a data vector is protected by a lock in the lock vector. If locks occupy a single memory word and cache blocks contain four words (typical values), a block could hold four different locks, each of which may ping-pong among eight different processors, no more than two of which ever use it. A more effective way to arrange this data is to group related items together and keep unrelated items in separate cache blocks:

structure dataNlock { integer data; lock lock; /* Cache blocks are 4 words long */ integer pad1, pad 2;} declare dataNlock lockeddata[100];

The last two fields (pad1 and pad2) enlarge the structures so each lock-value pair resides in a distinct cache block (assuming that the array lockeddata is allocated starting on a four-word boundary).

Finite-Block-Caches Model

One feature not accounted for by the above models is the finite size of real caches, which often hold only 1K to 64K words. A cache of size C words with B-word blocks tends to contain the C/B blocks surrounding the most recent memory references. Finite caches limit locality on both uniprocessors and multiprocessors.

In the uniprocessors, limited chaches are the principal cause of cache misses. To reduce the number of misses, data should be organized with common access patterns referencing adjacent words; this enables the cache to hold the last C- referenced words. If references are B or more addresses apart, the cache holds only the last C/B-referenced words. In this case, the effective cache size is reduced by a factor of B (which is often 4 to 8). In addition, a programmer should try to reuse words before they are pushed out of the cache. For instance, consider arithmetic operations on vectors. The following two loops compute A[left arrow]BXC and E]left arrow]AXD, where A, B, C, D, and E, are vectors of length N.

for i[left arrow]1 to N do A[i][left arrow]B[i] * C[i]; od; for j[left arrow]1 to N do E[j][left arrow]A[j] * D[j]; od;

If N is large, when the first loop finishes, the first locations of A may have been flushed fromthe cache. A better approach is to write these loops as a single loop and use values before they are flushed from the cache: (3)

for i[left arrow] to N do A[i][left arrow]B[i] * C[i]; E[i][left arrow]A[i] * D[i]; od;

Optimizing programs to take into account finit chaches is less important on multiprocessors than uniprocessors. In many programs; the finit cache size will not be the dominant cause of cache misses. Many misses will be the result of factors discussed previously. Also, reducing cache misses is more complex on a multiprocessor due to interactions with other processors. For example, a change that keeps more items in a cache by packing them tightly may introduce false sharing between processors, degrading performance. Programmers should not optimize multiprocessor programs for finite caches unless the amount of data each processor uses is very large and the changes do not cause harmful interactions with other processors.

Conclusion

A program running on a multiprocessor no longer has a single, sequential order of execution. The temporal and spatial locality of a processor is easily disturbed by actions of othe procesors. Some of these interactions are visible to a programmer, while others are artifacts of hardware. A programmer who understands the basics of multiprocessor caches can reduce the extraneous interference and improve a program's performance.

Here are three rules-of-thumb to consider when writing a parallel program:

1. Try to perform all operations on a datum in the same processor to avoid unnecessary communication.

2. Align data to prevent locations used by different processors from occupying the same cache block.

3. Cluster work and re-use parts of the data quickly, instead of making long passes over all the data.

Programming languages do not currently facilitate this style of programming. A programmer must be aware of the underlying behavior of the multis and write programs that properly exploit shared caches.

(1) This model is most accurate for write-invalidate cache-coherence protocols (see Appendix). For the other class of protocols (write-update), this model correctly indicates that nonlocal references are caused by a active sharing, but it does not reflect the exact costs of sharing.

(2) This has the additional benefit of reducing the bottleneck caused by a single queue.

(3) On vector machines with caches, programmers (or compilers) may have to compromise vectorizability to attain chache performance. [10]. In this example, however, coalescing the loops does not prevent vectorization.

References

[1] Agarwal, A. and Gupta, A. Memory-Reference Characteristics of Multiprocessor Applications under MACH. In Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Santa Fe, N.M., 1988) pp. 215-226.

[2] Agrawal, A., Simoni, R., Horowitz, M., and Hennessy, J.An Evaluation of Discretionary Schemes for Cache Coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture (Hawaii 1988) pp. 280-289.

[3] Anderson, T.E., Lazowska, E.D., and Levy, H.M. The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors. In Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Berkeley, Calif. 1989) pp. 49-60.

[4] Archibald, J. and Baer, J.-L. Cache-Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model. ACM Trans. Comput. Syst. 4, 4 (1986) 273-298.

[5] Bell, C.G. Multis: A New Class of Multiprocessor Computers. Science, 228 (1985), 462-466.

[6] Eggers, S.J. and Katz, R.H. A. Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation. In Proceedings of the 15th Annual International Sympsium on Computer Archtecture (Honolulu, Hawaii 1988), pp. 373-382.

[7] Eggers, S.J. and Katz, R.H. The Effect of Sharing on the Cache and Bus Performance of Parallel Programs. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS II) (Boston, Mass. 1989) pp. 257-270.

[8] Goodman, J.R., Vernon, M.K., and Woest, P.J. Efficient Synchronization Prmitives for Large-Scale Cache-Coherent Multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IIIe (Boston, Mass. April 1989), pp. 64-77.

[9] Li, K. and Hudak, P. Memory Coherence in Shared Vitual Memory Systems. ACM Trans. Comput. Syst. 7, 4 (November 1989), 321-359.

[10] Liu, B. and Strother, N. Programming in VS Fortran on the IBM 3090 for Maximum Vector Performance. IEEE Computer, 21, 6 (June 1988), 65-76.

[11] Rudolph, L. and Segall, Dynamic Decentralized Cache Schemes for MIMD Parallel Processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture (Ann Arbor, Mich. 1984) 340-347.

[12] Simith, A.J. Cache Memories. ACM Comput. Surv., 14, 3 (1982) 473-530.

[13] Yew, P.-C., Tzeng, N.-F. and Lawrie, D.H. Distributing Hot-Spot Addressing in Large-Scale Multiprocessors. IEEE Trans. Comput., C-35 (1987) 388-395.

MARK D. HILL is an assistant professor in the Computer Sciences Department at the University of Wisconsin, Madison. His research interests center on computer architecture, with an emphasis on performance consideration and implementation factors in memory systems.

JAMES R. LARUS is an assitant professor in the Computer Sciences Department at the University of Wisconsin, Madison. His Research intrests center on programming language and compilers for parallel computers.
COPYRIGHT 1990 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1990 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:cache memory
Author:Hill, Mark D.; Larus, James R.
Publication:Communications of the ACM
Article Type:technical
Date:Aug 1, 1990
Words:2566
Previous Article:The second annual Computer Bowl.
Next Article:A bridging model for parallel computation.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters