Printer Friendly

Garbage Collection for a Client-Server Persistent Object Store.

1. INTRODUCTION

A primary strength of object-oriented programming languages and database systems is their ability to model complex objects and the interrelationships among them. This modeling power, while allowing for a more natural representation of many real-world application domains, also raises new problems in the design of fundamental lower-level functions such as storage management. It has long been recognized that explicit storage management (using, for example, "malloc()" and "free()" statements) places a heavy burden on the developers of large programs. Manual detection and reclamation of garbage increases code complexity and is highly error-prone, raising the risk of memory leaks and dangling pointers. For these reasons automated garbage collection (GC) has long been an active area of investigation for object-oriented programming languages.

More recently, garbage collection has also attracted the interest of designers of persistent systems. Several features of persistent systems further motivate the use of garbage collection. First, the data these systems manage are shared, and therefore the knowledge of data interconnection may be distributed among many programs, users, or both. It is thus extremely difficult, if not impossible, for programmers to determine when to explicitly reclaim storage. Second, sharing and persistence increase the potential damage caused by storage management errors, as mistakes made by one program can inadvertently affect others.

The notion of persistence (and conversely, garbage) can be tied to reachability. Reachability is a simple, implicit way to express persistence that is orthogonal to object type [Atkinson et al. 1990]. Objects that can be reached by traversing a path from some designated persistent roots can persist beyond the execution of the programs that created them. Objects that are not reachable from a root or from the transient state of an on-going program are garbage; such objects are inaccessible, and thus they can be safely reclaimed. Persistent systems pose additional problems for garbage collection, beyond those typically addressed by solutions for nonpersistent systems. These additional complications include the following:

Concurrency:

A garbage collector must coexist with an unbounded number of concurrent transactions running in parallel and must not adversely impact their ability to execute.

Atomic Transactions:

The rollback of partially completed transactions violates the fundamental assumption of garbage collection that unreachable objects always remain unreachable.

Fault Tolerance:

Persistent systems provide resilience to system failures. Garbage collection must also operate in a fault-tolerant manner.

Live Data Volume:

Modern garbage collection algorithms assume that the volume of live (i.e., nongarbage) objects is small. Persistence invalidates this assumption.

Disk-Resident Data:

The data managed by a persistent system typically reside on disk--the collector cannot ignore I/O.

Client Caching:

Many client-server systems cache and update data at clients. The dynamic, replicated, and distributed nature of these updates makes garbage collection problematic.

We have designed a concurrent garbage collection algorithm for client-server persistent stores and have implemented it in the context of an Object-Oriented Database Management System (OODBMS) [Amsaleg 1995]. Our algorithm addresses the above problems by extending a traditional garbage collection approach and integrating it closely with the mechanisms provided by the persistent object store. In particular, our algorithm is a variant of a "snapshot-at-beginning" scheme. Such schemes allow garbage collection to be interleaved with ongoing programs [Wilson 1992; Abdullahi and Ringwood 1998]. Our algorithm is tailored to cope with the problems introduced by persistence and is designed to be efficiently integrated with the buffer management, concurrency control, logging, recovery, and cache consistency mechanisms of a persistent object store. Similarly to other recent work on persistent garbage collection [Cook et al. 1994b; Yong et al. 1994] we use a partitioned approach in order to avoid having to scan the entire (disk-resident) database before reclaiming any space. In contrast to the other work, which has advocated copying or reference counting, we use a noncopying Mark and Sweep algorithm. The collector has the following characteristics:

--It is server-based, but requires no callbacks to clients and performs only minimal synchronization with client transactions.

--It works in the context of ACID transactions [Gray and Reuter 1993] with standard implementation techniques such as two-phase locking and Write-Ahead-Logging.

--It is incremental and nondisruptive; it holds no transactional locks on data and requires minimal logging. It has been designed to be efficient with respect to I/O.

--It is fault-tolerant--crashes of clients or servers during garbage collection will not corrupt the state of the persistent object space.

--It can be integrated into an existing object store, requiring few changes to the client software. It can coexist with performance enhancements such as intertransaction caching at clients and STEAL buffer management between clients and servers.

--It has been implemented, tested, and measured in client-server EXODUS [EXODUS 1993].

The remainder of this article is structured as follows: Section 2 describes some fundamental features of client-server persistent systems. Section 3 presents an overview of basic garbage collection techniques and motivates the choice for Mark and Sweep as a basis for the solution. Section 4 describes three problem scenarios that must be addressed when designing a Mark and Sweep garbage collector for a client-server persistent object store. Section 5 presents assumptions and properties that enforce the correctness of the solution. Section 6 describes the partitioned Mark and Sweep collection algorithm at a high level. Section 7 details the implementation of the algorithm in EXODUS. Section 8 presents an overview of the performance studies on the implemented system. Section 9 discusses related work. Section 10 presents conclusions and future work.

2. ARCHITECTURAL ISSUES

Many desirable features for persistent object stores, such as concurrent and correct access to shared data and resilience to failures, have been defined and studied in the context of object-oriented database management systems (OODBMS). In such systems, correctness in the presence of concurrent access and/or failures is tied to the notion of transactions. A transaction is a unit of work, possibly consisting of multiple data accesses and updates, that must commit or abort as a single atomic unit. Transaction executions respect the ACID properties of Atomicity, Consistency, Isolation, and Durability [Gray and Reuter 1993]. In a database system, the recovery component is primarily concerned with preserving Atomicity and Durability, while concurrency control is concerned with preserving Isolation. In the following sections we describe the key aspects of these components that are relevant to this article.

2.1 Buffer Management and Crash Recovery

The amount of work that a system must perform during recovery depends on how the buffer manager handles data that are updated by in-progress and/or committing transactions. The buffer manager is responsible for coordinating the transfer of data between main memory (i.e., volatile storage) and disk (i.e., nonvolatile storage). If the buffer manager allows an update made by an uncommitted transaction to overwrite the most recent committed value of a data item on nonvolatile storage, it is said to support a STEAL policy (the opposite is called NO-STEAL) [Effelsberg and Haerder 1984]. If the buffer manager ensures that all updates made by a transaction are reflected on nonvolatile storage before the transaction is allowed to commit, then it is said to support a FORCE policy (the opposite is NO-FORCE). High-performance buffer managers typically support the more flexible STEAL and NO-FORCE policies.

Support for the STEAL policy implies, in the event of a crash or abort, that uncommitted transactions may need to be rolled back (UNDO). Rolling back a transaction involves restoring the values of any nonvolatile copies of data that were overwritten by that transaction to their previous committed state. A NO-FORCE policy raises the possibility that some committed data values may be lost during a system crash because there is no guarantee that they have been placed on nonvolatile storage. This means that the system may need to reinstate (REDO) the effects of committed transactions to preserve the durability of committed updates.

UNDO and REDO typically rely on the use of a log. A log is a sequential file that contains records describing transactions and the updates they performed. One or more records are appended to the log for each update performed by a transaction. In a system with STEAL/NO-FORCE buffer management, log records contain both the old and new data values of items to enable UNDO and REDO respectively. Log records are also written for transaction management activities such as commit or abort. When a log record is created, it is assigned a Log Sequence Number (LSN), which serves to uniquely identify that record in the log. Log records from multiple transactions are interleaved in the log as they are created, but the log records pertaining to each transaction are chained together.

Database systems traditionally rely on a Write-Ahead-Logging (WAL) protocol to ensure, in the event of a crash, that the log contains sufficient information to perform the necessary UNDO and REDO work. The WAL protocol ensures (1) that all log records pertaining to an updated page are written to stable storage before the page itself is written to stable storage, and (2) that a transaction cannot commit until all of its log records have been written to stable storage. The ARIES algorithm, developed at IBM [Mohan et al. 1992], is a popular recovery approach that supports STEAL/ NO-FORCE buffer management together with a WAL protocol.

2.2 Concurrency Control

While the recovery component of the DBMS is responsible for preserving the Atomicity and Durability properties, the concurrency control component is responsible for preserving Isolation. The most prevalent implementation technique for concurrency control is locking. Typically, two types of locks are supported: shared and exclusive. Many concurrency control policies enforce a two-phase locking (2PL) policy in which, after a transaction has released a lock, it is not allowed to obtain any additional locks. Systems often enforce Strict 2PL, in which all locks are held until the end of the transaction. Strict 2PL avoids problems such as cascading aborts [Gray and Reuter 1993].

2.3 Client-Server Architecture

Most modern OODBMS are intended to run on groups of workstations and servers. Each server is the actual owner of some data and is ultimately responsible for preserving the ACID properties of the user transactions manipulating the data it owns. Each client workstation runs a process that manages the buffer of the client workstation, participates in the control of concurrent access, and interacts with the servers to provide database access for the application(s) running at the workstation.

Client-server OODBMS architectures typically use data-shipping, in which data items are sent from servers to clients so that query processing is performed at the client workstations. Data-shipping systems can be structured as page servers, which send physical units of data between servers and clients, and object servers, in which clients and servers interact using logical units of data [DeWitt et al. 1990; Carey et al. 1994]. For concreteness, in this article we focus on page servers. Upon request, data pages required by an application are copied and brought into the client's local memory. Clients perform updates on their cached page copies. As with centralized systems, locking can be used to ensure the correctness of database access when multiple clients are running concurrently.

While many server buffer managers implement a STEAL/NO-FORCE policy, client buffer managers may choose to implement a STEAL/FORCE policy for sending updated pages to servers. A STEAL policy allows clients to send updated items back to servers at any time and in any order during the execution of a transaction, which can improve performance. The servers may therefore receive uncommitted data, which is handled by the server buffer manager. For simplicity, clients can use a FORCE policy in which all pages updated by a client are copied to the servers prior to transaction commit.(1) Copying pages to the servers instead of discarding them allows clients to cache data pages across transaction boundaries. Such caching requires that a cache consistency protocol ensure that transactions do not access stale pages (see Franklin et al. [1997] for a taxonomy of such protocols).

Like centralized systems, client-server systems can use Write-Ahead-Logging to cope with failures. When a client updates data, log records are generated and spooled at the client. When the client's buffer manager decides to send back a page to the server, it sends the log records pertaining to that page before sending the page itself. In other words, the WAL protocol is applied between the clients and the server memory and also between the server's memory and its disks.

3. CHOOSING A GARBAGE COLLECTION ALGORITHM

As stated in the introduction, garbage collection has been intensively studied in the context of traditional (nonpersistent) programming languages. A survey of algorithms appears in Wilson [1992]. In this section, we briefly describe the basic garbage collection algorithms that have been developed for programming languages (other collection techniques for persistent stores will be described in Section 9), and motivate our choice of Mark and Sweep as the basis for our algorithm.

3.1 Basic Garbage Collection Techniques

Garbage collectors reclaim objects that are no longer in use in order to reuse their storage. To distinguish live objects from garbage objects, a garbage collector uses the notion of reachability from a root. Traditionally, the root is defined as the state of running programs, i.e., stacks, variables, and registers. Objects that can be reached by traversing any pointer path from the root are live. Objects unreachable from the root are garbage.

The basic GC algorithms can be categorized as reference-counting or tracing-based. In a reference-counting system, each object has an associated count of the references to it. Each time a program creates an additional reference to an object, the object's count is incremented. When an existing reference to an object is destroyed, the corresponding count is decremented. The memory occupied by an object can be reclaimed when the object's count drops to zero.

Tracing-based collectors are divided into Mark and Sweep and Copy-based algorithms. Mark and Sweep collectors are two-phase algorithms. The Mark phase begins at the root and marks (or "colors") every reachable object; The marks can be represented with a bit embedded in each object or in separate color maps that record for every memory page the colors of the objects stored in that page. Once all live objects are marked, the Sweep phase scans through the memory and reclaims any unmarked objects. Copy-based collectors divide memory into two disjoint areas called from space and to-space. Programs manipulate from-space objects while the to-space is left empty. Instead of marking and sweeping, copying collectors copy (usually in a depth-first manner) the from-space objects reachable from the root into the to-space. Once all live objects have been copied, the contents of the from-space are discarded, and the roles of the from and to spaces are exchanged. The objects are typically copied linearly to the to-space, thereby compacting memory.

The basic implementations of Mark and Sweep and Copy-based algorithms are "stop-the-world," i.e., user programs are suspended during the entire collection process. For many applications, however, stop-the-world algorithms cannot be used because of their disruptive behavior. Preserving the response time of user applications requires the use of incremental techniques. Incremental garbage collection raises the possibility that while the collector is tracing the object graph, program activity may change other parts of the object graph. If care is not taken, the collector could miss tracing some reachable objects and thus may erroneously reclaim them. Properties that incremental GC algorithms must respect to guarantee correctness were first described in Dijkstra et al. [1978] and Baker [1978], and a classification of incremental GC techniques has been proposed by Wilson [1992].

3.2 The Case for Incremental Partitioned Mark and Sweep

Our persistent garbage collector is based on a Mark and Sweep approach. Before describing our algorithm, we first briefly describe the motivation behind this choice. Using reference counting to collect garbage in a persistent system is problematic for two reasons. First, reference counting cannot reclaim circular structures [McBeth 1963]; unreclaimed cycles of garbage waste storage space and cause fragmentation. Second, in a persistent store, the management of reference counts must be done in a fault-tolerant manner, which can result in high overhead.

Of the tracing-based techniques, we believe that a Mark and Sweep approach is more efficient and easier to integrate into an existing system than a copying approach, for the following reasons:

--Copying causes objects to move between pages, which can require locks and can complicate recovery [Kolodner and Weihl 1993; Yong et al. 1994].

--The clustering produced by a copying-based collector may be in conflict with database requirements or user-specified hints (e.g., Benzaken and Delobel [1990] and Gruber and Amsaleg [1994]). For example, generation scavenging [Ungar 1984] tends to cluster objects by their age.

--While copying is sometimes preferred to Mark and Sweep because it compacts the objects, persistent systems typically use slotted pages, which allow objects to be moved within a page, thus mitigating the potential fragmentation problems of sweeping.

In order to minimize the impact of garbage collection on the performance of concurrent transactions, particularly when dealing with a large, disk-resident object space, our Mark and Sweep algorithm has two additional properties. First, we use an incremental approach so that garbage collection and regular transaction processing can be interleaved. Second, we use a partitioned approach, so that space can be reclaimed without having to first traverse the entire object space. In the following sections, we describe the algorithm and its implementation in detail. First, however, we describe several problem scenarios that the algorithm must be designed to avoid.

4. PROBLEM SCENARIOS

We have identified three specific types of problems that can arise when implementing a server-based incremental Mark and Sweep collector in a client-server persistent store. Some of these problems arise with any type of incremental collection, while others result from specific properties of a client-server persistent store. We describe these problems in the following sections.

4.1 Problem 1: Partial Flush of Multipage Updates

The fundamental problem for incremental tracing-based collectors arises because the object graph can be changing while the collector is running. If the collector encounters partial results of updates it can erroneously reclaim live objects. This problem is exacerbated in the client-server environment because updates are performed at clients and are asynchronously sent to the server. As a result, a garbage collector running at the server may have an inconsistent view of interobject references that span page boundaries. Furthermore, it is not possible, in general, to produce a correct ordering of page flushes, as there can be circular dependencies among pages due to multiple updates.

There are two specific problems that can arise: (1) partial flush of updated objects and (2) partial flush of newly created objects. Figure 1 shows an example of the first problem. In this figure (and in the subsequent ones) the solid black squares represent the persistent root, and objects are represented by circles. Updated objects are shaded in order to distinguish between before and after states of the update operations. In step (1), the client changes the parent of object c from object A to object s, which resides on a different page than c. The page containing c is then sent to the server, but the page containing B (the new parent) is not. In step (2) the server runs garbage collection, and reclaims object c because it appears to be unreachable at the server. In step (3) the page containing B is sent to the server. As shown in step (4), the object space is now corrupted, as object B contains a dangling reference.

[Figure 1 ILLUSTRATION OMITTED]

A similar problem can arise with newly created objects (see Figure 2). In step (1), the client creates a new object c that is referenced by object A, residing on a different page. The page containing the new object c is then sent to the server, but the page containing the parent object A is not sent. In step (2), the garbage collector incorrectly determines that object c is unreachable, and reclaims it. This will result in a dangling reference when the page containing A is sent to the server.

[Figure 2 ILLUSTRATION OMITTED]

4.2 Problem 2: Transaction Rollback

A problem that does not arise in traditional, nonpersistent garbage collection is the rollback problem. Allowing clients to send updated pages to the server at any time (as allowed by the STEAL policy; see Section 2) makes it possible that a garbage collector running at the server will encounter dirty (i.e., uncommitted) pages. If a page containing uncommitted updates is garbage collected, the subsequent rollback of those updates may be difficult. An example of this problem is shown in Figure 3.

[Figure 3 ILLUSTRATION OMITTED]

In step (1), a transaction updates object A at the client by removing its reference to object B and then flushes a copy of the updated page to the server. In step (2), after receiving the updated page, the server runs the garbage collector which reclaims object B, as it appears to be unreachable. In step (3), the transaction is rolled back. If the rollback does not take garbage collection into account (as shown in the figure) then object A will be left with a dangling reference to an object that no longer exists. This problem arises because transaction rollback violates a fundamental assumption of garbage collection, namely that unreachability is a stable property of an object. It is important to note that this is a nontrivial problem for implementing rollback, as the page containing object B was never actually updated by the aborted transaction.

4.3 Problem 3: Overwriting Collected Pages

A third set of problems involves the overwriting of a garbage-collected page. These overwriting problems do not arise for incremental garbage collection in a nonpersistent, uniprocessor environment. One instance of the over writing problem occurs during transaction REDO. If a transaction uses space on a page that was made available by the garbage collector, then if the garbage-collected page is not on stable storage at the time of the crash, the REDO of the operation may be performed on an uncollected version of the page, and hence, the REDO could fail due to lack of free space on the page. This problem is illustrated by Figure 4.

[Figure 4 ILLUSTRATION OMITTED]

In step (1), the collector cleans a page that is subsequently requested by

the client. In step (2), the client creates three objects on a copy of that page. The disk-resident copy of the page, however, does not yet reflect the collection--it still contains a garbage object that consumes space (object "Y" in the figure). In step (3), the client commits, sending along the updated page and the associated log records; the log records are forced to stable storage. In step (4), the system crashes before the updated page reaches the disk. Recovery is then initiated. REDO fails because it attempts to recreate the three new objects on the uncollected version of the page, which has free space for just two of them.

A less serious instance of the overwriting problem can arise when the server cleans a page and a client updates its (uncollected) cached copy of that page. In this case, the client will overwrite the work of the collector when it sends the dirty page back to the server. Unlike the previous problem, this overwriting is a matter of efficiency rather than correctness. The probability of such overwriting, however, is increased significantly by intertransaction page caching, which allows clients to retain cached page copies over long periods of time.

5. SOLVING THE PROBLEMS

Correctness properties for incremental tracing-based garbage collectors have been identified by Dijkstra et al. [1978] and Baker [1978], and are described by Wilson [1992]. Our solution builds on these properties by extending them to account for transaction semantics and describes how to efficiently implement the extended properties in a client-server persistent environment. As such, the solution depends on certain assumptions about the system architecture. In this section, we describe these assumptions and the extended correctness properties. We also discuss how in combination, the assumptions and properties can be used to solve the correctness problems described in the preceding section.

5.1 Assumptions

Our incremental Mark and Sweep collector depends upon four assumptions about the system architecture, which follow from the architectural issues described in Section 2:

Assumption A1:

All user operations that access or modify object pointers are done using two-phase locks that are held until the end of transaction.

Assumption A2:

The validity of object identifiers (OIDs) is maintained only for those OIDs that reside in the database or in the transient state (e.g., program variables, stacks, etc.) of active transactions. For example, an OID extracted from the database and stored elsewhere by one transaction is not guaranteed to be valid for use by a subsequent transaction.

Assumption A3:

The system follows the Write-Ahead-Logging (WAL) protocol between clients and the server. That is, a client sends all log records pertaining to a page to the server before it sends a dirty copy of that page, and sends all log records pertaining to a transaction prior to the commit of that transaction.

Assumption A4:

Clients follow a FORCE policy with respect to the server. That is, all pages dirtied by a transaction are copied to the server prior to transaction commit. Note that the force can be to the server's memory (i.e., no disk I/O is required), and clients can retain copies of the pages in their local cache.

Assumptions A1 and A2 are fundamental to the algorithm's correctness; A3 and A4 are important details on which our specific implementation of the algorithm is based. A1 and A2 are supported by many systems including [O.sub.2] [Bancilhon et al. 1991], ObjectStore [Lamb et al. 1991], and EXODUS. A1 is simply strict two-phase locking, as is traditionally used to implement degree-three transaction consistency [Berenson et al. 1995]. A2 is enforced by many systems in order to allow them to reorganize the database to improve performance. Furthermore, ObjectStore enforces A2 in order to be able to dynamically map large databases into a (smaller) virtual address space. A3 must be supported by any client-server DBMS that implements WAL-based recovery at servers (e.g., EXODUS [Franklin et al. 1992] and ARIES/CSA [Mohan and Narang 1994]). A4 simplifies recovery and avoids the need for client checkpoints in such systems. The trade-offs involved in relaxing A4 are discussed in Franklin et al. [1992; 1993] and Mohan and Narang [1994]. For the purposes of garbage collection, relaxing A4 would require additional coordination with the buffer manager to determine when certain page copies have arrived at the server.

5.2 Properties

Our algorithm enforces three properties in order to ensure correctness.

Property P1:

When a transaction cuts a reference to an object, the object is considered to be live until the first collection that is initiated after the completion (i.e., commit or abort) of that transaction.

Property P2:

A new object is not eligible for reclamation until the first collection that is initiated after the completion of the transaction that created it.

Property P3:

Space reclaimed by the sweeper can be reused only when the freeing of that space is reflected in the log or on stable storage.

Properties P1 and P2 are similar to the correctness properties enforced for pointer updates and object creations by snapshot-at-beginning collection algorithms [Wilson 1992]. Snapshot-at-beginning collectors ensure that no objects become inaccessible to the garbage collector as the result of operations that occur in the midst of an ongoing collection cycle. P1 and P2 extend this notion to account for transaction semantics. The extensions are somewhat subtle; First, P1 and P2 are stronger than the snapshot-at-beginning properties because they include some operations that occur before a garbage collection cycle starts. That is, any operation that is included in a transaction that overlaps with a garbage collection cycle is covered, even though the operation may have been performed long before the collection was initiated. Second, P1 and P2 do not insist that a transaction commits (nor to they insist that it aborts). They simply require that the transaction completes. This formulation of the properties is possible because of the all-or-nothing property of ACID transactions. Any garbage collection cycle that begins after the completion of a transaction is guaranteed to see either all of the changes made by the transaction (for committed transactions) or none of them (for aborted transactions). Obviously, the notion of transactions was not considered in the development of the nonpersistent garbage collection solutions documented in Wilson [1992]. Property P3 is unique to persistent garbage collection and is required in order to cope with the effects of transaction REDO recovery.

Property P1, in conjunction with assumption A1, protects against the rollback problem (problem 2) described in Section 4.2, as it ensures that only objects made unreachable by committed transactions are eligible for reclamation. It also (in conjunction with assumption A4) avoids the partial flush problems of Section 4.1 (problem 1); a reference cut will be ignored by any garbage collector that runs concurrently with the cutting transaction and a collector that runs after the transaction commits will see only the after-images of all pages updated by the transaction (due to A4). Thus, no collection will be done based on a partially flushed update.

In a similar manner, Property P2 protects against the incorrect reclamation of newly created objects due to partial flushes. Taken together, P1 and P2 allow the collector to run incrementally and concurrently with transactions, without having to obtain locks. P1 and P2 protect objects from the problems that could arise when the collector observes uncommitted states of the database. Furthermore, because P1 and P2 protect objects for the duration of any transaction that could cause them to be reclaimed, the collector can ignore the program state (i.e., program variables, stacks, etc.) of any transactions that are concurrently executing at the clients.

It is important to note that P1 and P2, along with the assumptions listed in Section 5.1, ensure that any object that is considered to be garbage at the end of a garbage collection pass is truly unreachable, and can safely be collected. The key insight is that assumption A1 requires that all accesses to pointers in the database be performed in the context of strict two-phase locking, and A2 requires that all object pointers used by a transaction be obtained from the database. Observe that an object becomes garbage only when the transaction that cuts the last reference to the object commits. Such a transaction must have obtained a write lock on the last reference, so therefore, A1 ensures that at the time the last reference is cut, no other concurrent transaction could have accessed the reference. A2 ensures that no subsequent transactions will be able to use the reference. Thus, A1 and A2 taken together ensure that the only transaction that could possibly "reattach" an object to the object graph is the transaction that caused it to become disconnected from the graph in the first place. Assumption A4 ensures that any garbage collection pass that begins after the commit of such a transaction will see all of the pages updated by that transaction, and thus, will see any such reattachment.

While P1 and P2 protect against problems that could arise during normal transaction operation and rollback, P3 protects against problems that could arise during crash recovery. In particular, it ensures that REDO can, if necessary, reclaim space freed by a sweeper before attempting to REDO operations that may have used that space. Thus, P3 protects against the REDO problem raised in Section 4.3. Note that the sweeping of a page never needs to be undone because P1 and P2 ensure that any objects that are reclaimed must have been disconnected from the object graph by committed transactions.

6. AN INCREMENTAL MARK AND SWEEP FOR A PERSISTENT SYSTEM

We now present our incremental Mark and Sweep garbage collection algorithm for client-server persistent object stores. The crux of the algorithm is the efficient preservation of the properties defined in the preceding section. We first focus on the collection of a monolithic object space and then describe how to extend the approach to allow the independent collection of database partitions.

6.1 Protecting Cut References

Property P1, which protects objects that have had references to them cut from being collected prematurely, is enforced using a data structure called the Pruned Reference Table (PRT). Each entry in the PRT contains the OID of the referenced object and the identifier of the transaction that cut the reference to it. By considering the PRT as an additional root of persistence, the garbage collector is forced to be conservative with respect to uncommitted reference cuts. That is, any object that has an entry in the PRT will be marked live and will be traversed by the marker (if it is not already marked) so that its children objects will be marked as well.(2) Thus, a single PRT entry transitively protects all of the objects that are reachable from the protected object.

In order to make the necessary entries in the PRT, all updates to pointer fields in objects must be trapped. Traps of this form are typically implemented using a write barrier [Wilson 1992; Yong et al. 1994]. A write barrier detects when an assignment operation occurs and performs any bookkeeping that is required by the garbage collector. Since the garbage collector (and hence, the PRT) resides at the server while updates are performed at clients, a write barrier in the traditional sense would be highly inefficient. To solve this problem, we rely on the fact that clients follow the WAL protocol (Assumption A3). The WAL protocol ensures that log records for updates will arrive at the server prior to the arrival of the data pages that reflect the updates. At the server, every incoming log record is examined for the purpose of transaction management. When the server encounters a log record for a reference cut, an entry in the PRT is made. Note that unlike previous work that exploits logs (e.g., Kolodner and Weihl [1993] and O'Toole et al. [1993]), this algorithm processes log records as they arrive at the server--prior to their reaching stable storage.

When a transaction terminates (commits or aborts), it flags its entries in the PRT. These flagged entries are removed prior to the start of the next garbage collection (i.e., the start of the next marking phase). The PRT entries cannot be removed exactly at the time a transaction terminates even though all dirtied pages are copied to the server on commit. This is because an on-going marker may have already scanned the relevant parts of the object graph using the previous copies of the objects. The next time the marker runs, however, it is known that it will see the effects of all the updates made by the committed transaction, so the PRT entries for that transaction can be safely removed. If no collection is in progress when a transaction terminates, the PRT entries for that transaction can be removed immediately.

Figure 5 illustrates how the PRT corrects the transaction rollback problem raised in Section 4.2. In step (1), the client transaction T1 updates object A by removing its reference to B (i.e., setting it to NULL). This update automatically generates a log record containing the identifier of the updating transaction (T1), the identifier of the updated object (A), the before-image of the update (@B) and the after-image (0). The client buffer manager later decides to flush the updated page. The Write-Ahead-Logging assumption ensures that all log records pertaining to a page will be sent back to the server before the page itself. The log record is received at the server during step (2). The server processes the record and extracts the information needed to generate a PRT entry. The entry references B and is associated with transaction T1; the entry will later force the collector to mark B during its mark phase. When the updated page is received at the server (step (3)) B is protected by the PRT. Thus, the garbage collector will not interfere with a subsequent rollback of transaction T1.

[Figure 5 ILLUSTRATION OMITTED]

In addition to solving the rollback problem, the PRT also solves the problem of the partial flush of pages containing updated objects (Section 4.1). With respect to the case illustrated by Figure 1, the PRT would store the reference to object c that was cut, thereby preventing c from being erroneously collected.

6.2 Protecting New Objects

While the PRT mechanism solves the problems that can arise when references to objects are cut, it does not solve the problems that arise with newly created objects. We therefore introduce an additional structure called the Created Object Table (COT). The COT is used to enforce P2. As with the PRT, the COT is maintained at the server and is updated based on log records received from the clients. When a log record reflecting the creation of an object arrives at the server, an entry is made in the COT. This entry contains the OID of the new object and the identifier of the transaction that created it.

In contrast to the PRT, which is used during marking, the COT is used during the sweeping phase. The sweeping phase scans the object space linearly and reclaims any objects that have not been colored by the marker. Before reclaiming an object, the sweeper checks to see if it is referenced from the COT, and, if so, the object is not reclaimed.

It is important to note that the marker never needs to traverse the references inside objects protected by the COT. This is because an object referenced by an object in the COT must be one of the following: (1) a new object, (2) a previously existing object that is also referenced from another existing object, or (3) a previously existing object that is referenced solely by this new object.

Figure 6 illustrates these three categories. In this figure, new objects are represented using triangles, existing objects using circles. Figure 6(a) shows objects that belong to the first category. Such objects are already protected by the COT. Figure 6(b) shows an example of an object from the second category. This object (B) will not be erroneously collected because it is also referenced by a live object A. Figure 6(c) shows an example of an object from the third category. In order for a preexisting object B to be referenced from a new object (e.g., object C), the transaction (T1) that created the new object must have followed a reference to it. Due to two-phase locking (Assumption A1 in Section 5.1) T1 must have acquired a read lock on the object containing that reference (object A in the figure). If at the end of T1 object B is no longer referenced by any preexisting objects, then T1 must have cut the reference that it originally followed. In this case, B is protected by a PRT entry associated with transaction T1.

[Figure 6 ILLUSTRATION OMITTED]

The sweeping phase of the garbage collection algorithm and the COT interact as follows. The sweeper is run incrementally, with the sweep of a single page being the finest granule of operation. The WAL rule (assumption A3) ensures that the sweeper sees all COT entries for the pages it sweeps. As with the PRT (and for similar reasons), entries in the COT are also flagged for removal when the transaction that created them terminates; the flagged entries are removed at the beginning of the next garbage collection cycle.

6.3 Fault Tolerance

We use a simple approach toward fault tolerance: in the event of a server crash during garbage collection, the interrupted collection cycle is simply discarded. A new garbage collection cycle will be restarted after the recovery process is complete. We restart the collector from scratch, that is, with an empty COT and PRT. The contents of the COT and PRT are not needed after recovery because, at that point, the server must have a transaction-consistent snapshot of the database.

The ability to discard the COT and PRT contents in the event of a failure provides two important performance benefits. First, these data structures can be managed in volatile memory. Second, we do not need to log the creation and removal of entries in these tables. If a system crash occurs during the marking phase, then no changes had yet been made to data pages, so there is no danger of corrupting the database; thus, no recovery work is required. If a system crash occurs during the sweeping phase, then some pages will have been swept, and some will not have been. Unswept pages can simply be ignored--they will still contain garbage objects, but these will be reclaimed by the next garbage collection. As described in Section 4.3, however, some pages that have been swept could cause errors during the recovery process (see problem number 3). This problem is avoided by enforcing P3.

Recall that P3 requires the effects of a page sweep to be reflected in the log or on stable storage before the space freed on the page is reused. If slotted pages are being used, then one way to enforce P3 is to have the sweeper create and log a slot map for each page that it modifies. A slot map for a page contains a single bit for every slot in the page, which indicates whether or not the corresponding slot is used. The slot map of a page serves as a logical log record for the sweep. This log record allows the page sweep to be redone in the event of a crash so that subsequent REDOs of operations that used the freed space will be able to complete successfully. Note that no before-images need to logged for page sweeps, as page sweeps never need to be undone.

6.4 Preventing Overwrite of Collected Pages

Logging protects the system from problems that can arise during transaction REDO. Logging, however, does not protect against the problem of sweeper work being wasted due to page overwrites (as described in Section 4.3). It is possible to reduce the likelihood of this problem in the following way. The sweeper tries to obtain a no-wait/instant read lock on a page. Such a lock is released immediately upon being granted, and the sweeper does not block if the lock is not granted. If the lock is not granted, then some transaction has a write lock on the page. In this case, the sweeper simply skips the page, as any work that it does will only be overwritten when the transaction holding the lock commits. This overwrite problem is exacerbated in systems that perform intertransaction client caching. For such systems it is possible to have the sweeper exploit the cache consistency mechanism in order to reduce the potential for clients to update unswept page copies. One such mechanism is described in Section 7.2.

6.5 Collecting Partitions

The preceding sections described the collector in the context of a monolithic object space. Such an algorithm is not appropriate for a persistent store because it requires that the entire persistent store be scanned before any space can be reclaimed. Partitioning is a well-known technique that has been used to enable the independent collection of portions of an object space. Partitioning schemes were initially introduced to cope with large (at the time) address spaces for programming languages [Bishop 1977], and have since been adapted for distributed garbage collection algorithms (e.g., [Shapiro et al. 1990; Maheshwari and Liskov 1994]) as well as in other work on persistent collection [Cook et al. 1994b; Yong et al. 1994]. In this section, we give a conceptual overview of the mechanisms required to extend the monolithic algorithm to allow independent garbage collection of disjoint partitions of the object space. Section 7.2.2 presents the actual implementation and describes in detail how the partitioned garbage collector works.

The object space is divided into disjoint partitions of objects (i.e., each object belongs to exactly one partition). A partition is the unit of object space that the garbage collector must traverse before it is allowed to reclaim any space. Small partitions allow the freeing of space to be done more incrementally, but can cause efficiency problems if they result in a large number of cross-partition object references (i.e., objects referenced by objects in a different partition).

There are two reasons why cross-partition references should be minimized. First, (as described below) cross-partition references incur additional management overhead during garbage collection and normal operation. Second, for simplicity, we have implemented a partitioned collection approach that maps easily to the monolithic scheme, but does not collect cycles of garbage that span partition boundaries (of course, cycles contained within a single partition are collected). Such a limitation will not be a problem in practice if partitions are chosen so that such cycles are rare. If the number of cross-partition cycles is kept small, then they can be collected periodically, using a separate algorithm such as the one proposed by Le Fessant [1998]. Thus, when choosing the partitioning scheme for an object space, the size of the partitions must be traded off against the number of cross-partition references that would arise. In addition, cross-partition references can often be further reduced by partitioning the object space along physical boundaries (e.g., file extents) or logical considerations (e.g., by class or relation).

Each partition of the object space has its own persistent roots, called local roots. To handle cross-partition references, each partition also has a list of incoming references that originate from other partitions. This list is called the Incoming-Reference List (IRL). Conceptually, for every reference to an object from outside the partition, the IRL contains the OID of the (local) destination object and the ID of the partition in which the (foreign) source object resides. The IRL of a partition serves as an additional root of persistence for the collector when that partition is collected. Figure 7 shows an example of three partitions containing objects with cross-partition references. Note that the objects themselves point directly to each other and do not involve the IRL.

[Figure 7 ILLUSTRATION OMITTED]

The IRL mechanism is transparent to programmers; therefore, any updates that may require IRL modifications must be trapped. This is handled in the same manner as the PRT and COT--the server examines incoming log records. When the creation of a cross-partition reference is detected, the server checks the IRL of the destination partition to determine if an entry originating from the source partition protecting the destination object already exists. If not, then such an IRL entry is created. In either case, an entry is made in the COT to protect the (existing or new) IRL entry. As usual, the COT entry is flagged for removal when the transaction that caused its creation terminates.

In addition to the IRL mechanism, a further change that is required is to augment the PRT entries with an additional field indicating the partition of the source object containing the pointer that was cut. This information is easily obtained from the log record(s) pertaining to the relevant update. Thus, for the partitioned algorithm, each PRT entry contains the ID of transaction that cut the reference, the source partition ID, and the OID of the destination object of the cut reference.

Garbage collecting a partition works as follows. The marking phase traverses the partition using the following as roots: (1) the local persistent roots; (2) the IRL of the partition; and (3) destination objects of the PRT entries whose source partition is the partition being collected. Objects in the current partition encountered during the marking phase are marked and traversed as in the nonpartitioned case. If the marker encounters a reference to an object in a different partition (either through traversing an object or directly because of a PRT entry), it searches for the corresponding entry in the remote IRL, marks the entry, and stops traversing that path.

At the end of the marking phase, any unmarked remote IRL entries whose origin is the currently collected partition can be removed (i.e., swept), provided that the entry is not protected by the COT. Note that the actual objects in the remote partition area are not collected at this point. They will be reclaimed if appropriate when garbage collection is run on that partition.

There are two drawbacks to this simple approach. First, as discussed above, it does not collect cycles of garbage that span partition boundaries. Second, in contrast to the PRT and the COT, IRL entries must be fault-tolerant. In the event of a crash, IRLs that were created due to committed transactions may have to be restored during recovery in order to avoid erroneously collecting some remotely referenced objects. IRL entries must therefore be maintained in database pages (rather than with in-memory structures), and their creation must be logged. When an IRL entry is created, the server generates a log record reflecting that creation. This log record belongs to the set of log records associated with the transaction that created the cross-partition reference. If a crash occurs after the commit of that transaction, the log record will provide enough information to recreate the IRL entry if necessary (i.e., if the IRL page did not make it to disk before the crash). As with regular pages, a log record reflecting the sweeping of a page containing IRL entries must be created when space is freed on the page in order to avoid potential overflow problems during transaction REDO (see Section 6.3).

6.6 Summary of the Algorithm

We now briefly summarize the algorithm. During normal transaction processing, the log records received from the client(s) are processed for the purpose of garbage collection. Each time a cut reference is discovered in a before-image, that reference is stored in the PRT (Pruned Reference Table) along with the associated transaction identifier and ID of the source partition. Each time the creation of an object is discovered in an afterimage, the reference of that new object is stored in the COT (Created Object Table) along with the ID of the creating transaction. After-images in log records are also used to create IRL (Incoming-Reference List) entries if they contain remote references. Such new entries are protected by the COT, and the creation of each IRL entry is logged. When a transaction terminates (commits or aborts), the associated entries in the PRT and/or in the COT are flagged for later removal.

A garbage collection cycle starts by selecting a partition to collect and by removing all flagged entries from the PRT and COT.(3) The collector then begins its marking phase. It first marks every object reachable from the local persistent roots. It then marks objects reachable from the IRL of the current partition. Finally, it marks all objects reachable from the relevant PRT entries. During the marking phase, the collector may mark entries in the IRLs of other partitions, but it does not traverse any objects in those remote partitions.

Once the entire graph of the partition has been traversed, the sweeping phase linearly scans the partition. Before loading a given page, the sweeper checks to see if a write-lock exists on that page. If so, it skips the page and processes the next one. Otherwise it loads the page in memory, and removes all unmarked objects in that page that are not referenced by the COT. When sweeping a page, the sweeper generates a logical log record that represents the new state of the page's slots. The sweeping phase also reclaims all unmarked IRL entries (but not the associated objects) in remote partitions that originate from the partition being collected that are not protected by the COT. As with sweeps of data pages, the sweeping of IRL pages is also logged. At the end of the sweeping phase, collection of the selected partition is complete.

7. IMPLEMENTING THE COLLECTOR

As stated in the introduction, the design goals for garbage collection are to impose minimal overhead on client transactions, to be efficient and effective in collecting garbage, and to be relatively straightforward to integrate in existing client-server database systems. In order to assess the algorithm in light of these requirements, we have implemented a complete version of it in the client-server version of the EXODUS storage manager [EXODUS 1993]. The implementation has been extensively tested, including its fault tolerance aspects. In this section, we describe the implementation and provide details of how the algorithm is incorporated into the system.

7.1 The EXODUS Storage Manager

Our implementation is based on the EXODUS storage manager v3.1 [EXODUS 1993]. EXODUS is a client-server, multiuser system which runs on many Unix platforms. It has been shown to have performance that is competitive with existing commercial OODBMS [Carey et al. 1993]. EXODUS supports the transactional management of untyped objects of variable length and has full support for indexing, concurrency control, recovery, multiple clients, and multiple servers. Data are locked using strict two-phase locking at the page or coarser granularity. Recovery is provided by an ARIES-based WAL protocol as described by Franklin et al. [1992].

The EXODUS server is multithreaded--every user request is assigned a thread when it arrives at the server; the server also has its own threads for log management, etc. The server supports asynchronous I/O (using multiple I/O processes), so some threads can run while other threads are waiting for I/O.

EXODUS is a page server. Updates are made by clients to their local copies of pages, and the resulting log records are grouped into pages and sent asynchronously to the server responsible for the updated pages. Dirty pages can be sent back to the server at any time and in any order during the execution of a transaction. A WAL protocol ensures that all necessary log records arrive at the server before the relevant data page. At commit time, copies of any remaining dirty pages are sent to the server. The client retains the contents of its cache across transaction boundaries, but no locks are held on those pages. Cache consistency is maintained using a check-on-access policy (based on "Caching 2PL" [Carey et al. 1991]). For recovery purposes all pages are tagged with a Log Sequence Number (LSN) which serves as a timestamp for the page. The server keeps a small list of the current LSNs for pages that have been recently requested by clients. When a client requests a lock from the server, the server checks this table, and if it cannot determine that the client has an up-to-date copy of the page, it sends a copy of the page along with the lock grant message.

As detailed by Carey et al. [1986], EXODUS extends a traditional slotted page structure to support objects of arbitrary length. "Small" data items (those that are smaller than a page) and the headers of larger ones are stored on slotted pages. Internally, objects are identified using physical OIDs that allow pages to be reorganized without changing the identifiers of objects.

To summarize, the EXODUS storage manager supports the fundamental assumptions on which the garbage collector depends (see Section 5.1), and it has desirable properties such as slotted pages. In addition, the EXODUS server uses nonpreemptive threads and asynchronous I/O, which simplify the implementation of the garbage collector. The system, however, also provides features that present challenges for garbage collection, such as client caching, a STEAL policy between clients and servers, asynchronous interactions between clients and servers, a streamlined recovery system, and optimizations to avoid logging in certain cases.

7.2 Implementation Details

In order to add garbage collection to the EXODUS server, we created a new type of server thread called the gcThread. When a collection starts on a partition, a new gcThread is spawned. The gcThread initializes the garbage collector data structures, runs the marker phase, and then runs the sweeper phase. At the end of the sweeper phase, the gcThread terminates. The implementation required approximately 4000 lines of new or modified code on the server side; the bulk of which was for the gcThread itself. The client side required only 200-300 lines of new or modified code. The algorithm was implemented with only minor changes to the description in Section 6.

At present, the scheduling of the gcThread works as follows: when the gcThread gets the processor, it starts a timer (currently set at 50 msec.). If the timer expires during the marking phase, then the marker finishes examining the current object and gives up the processor. If it expires during the sweeping phase, then it finishes sweeping the current page and gives up the processor. The gcThread is awakened (after a specified sleep time) when there are no client requests waiting for the processor.

There are several limitations to the implementation. First, the collector collects only small-format objects, i.e., those that are smaller than 8KB. The extension to large-format objects is straightforward, but was not necessary for our purposes. Second, because the EXODUS storage manager does not know the types of the objects that it stores, we store a bitmap in the initial bytes of the data portion of each object. The bitmap indicates which CID-sized ranges of bytes in the object contain actual OIDs and is used by the marker during its traversal. These bitmaps are created automatically when objects are allocated using a C++ constructor. Finally, in the current implementation, operations that update multiple pointers are logged by the client using a separate log record for each pointer update. This limitation could easily be removed, if desired, by adding bitmap information to log records. This information would be used by the server to locate pointers in the log records.

We now highlight several important aspects of the implementation.

7.2.1 Building the PRT and the COT. During normal processing, the server scans incoming log records to determine if the logged updates require any entries to be made in the garbage collector data structures. In particular, the server makes an entry in the PRT for any MODIFY OBJECT log record that involves overwriting an OID. PRT entries are clustered with respect to the partition containing the source (i.e., modified) object, so it is easy to locate all the references that have been pruned in a given partition.

Likewise, the server makes an entry in the COT for any CREATE OBJECT log record. During bulk loading and other create-intensive operations, EXODUS uses an optimization that reduces logging. When a new page is allocated to hold new objects, the individual object creations are not logged; rather, the entire page is logged when it is copied back to the server. This optimization, while providing better performance for EXODUS, deprives the garbage collector of information on individual object creations. As a result, in the EXODUS implementation of the collector, we enter page IDs in the COT rather than individual OIDs. When the sweeper encounters a page that has an entry in the COT, it simply skips the entire page.

7.2.2 Maintaining the IRLs. The Incoming-Reference List of a partition is implemented as an indexed set of pages. All of the entries in a single IRL page correspond to cross-partition references originating from the same source partition, and each IRL page stores the identifier of that partition. If an IRL contains more than one page of entries for a given source partition, these pages are linked together. The first page of each IRL is an index that indicates the starting and ending pages of the linked list for each source partition that has entries in the IRL.

The server makes an entry in an IRL for any CREATE OBJECT or MODIFY OBJECT log record whose after-image contains a cross-partition reference. Furthermore, when a newly allocated page arrives at the server (recall that object creations on such pages are not individually logged), the server scans all of the objects on the page to determine if any new IRL entries are required.

When a new cross-partition reference is encountered, the server checks the IRL of the destination partition to see if an entry from the source partition to the destination object already exists. To perform this check, the server scans the appropriate linked list in the IRL to find an entry referencing the destination object. If no such entry exists, then the server creates the entry in the last page of that group (a new page might be allocated if necessary). This creation is logged (see Section 7.2.5), and the log record is associated with the transaction that created the new crosspartition reference. An entry is made in the COT to protect this new IRL entry. If an appropriate IRL entry already exists there is no need to create a new one, but an entry still must be made in the COT to protect the IRL entry from being erroneously collected. As usual, the COT entries are flagged for removal when the transaction that caused them to be created terminates. Of course, more sophisticated management of free space in the IRL is possible, and the trade-offs are fairly straightforward.

7.2.3 Implementing the Marking Phase. When the collection of a partition is initiated, a "color map" is allocated in volatile memory for each page of the partition to be collected. The color map contains one bit for every object of that partition. Color maps are also allocated in volatile memory for remote IRL pages that contain entries from the partition being collected (recall that such pages are chained together in each IRL). All bits in the color maps are initially cleared. When an unmarked object or IRL entry is encountered during the traversal of the partition (as explained in Section 6.5), the corresponding bit is set, indicating that the object is "marked." It is important to note that colors are maintained only during garbage collection and are not part of the persistent object space. No logging or recovery is performed for color maps.

Unless care is taken, a potential termination problem could arise during the marking of a partition. Recall that the PRT is used as a root of persistence. If user transactions keep pruning references while the collector is running, then it is conceivable that the collector would have to keep traversing protected objects as new PRT entries are added. This potential problem is avoided by scanning the PRT entries only after completing the scans from the local roots and the IRL. As such, at the point when the collector initiates its scan from the very first PRT slot, (1) any reference that has been cut before that time is already in the PRT and (2) any reference (from another object or an IRL entry) that could be cut after that time has already been traversed by the collector. Thus, in the implementation, any entry added to the PRT after the collector has started using the PRT is not used until the next collection cycle.

7.2.4 Implementing the Sweeping Phase. After the marking phase is completed, the sweeping phase scans the object space of the partition being collected and removes unmarked objects that are not protected by the COT. Any unmarked and unprotected remote IRL entries whose source is this partition are also reclaimed. IRL pages that are full of garbage entries are deallocated. The use of in-memory color maps enhances the performance of the sweeping phase by allowing unnecessary I/O to be avoided. If a color map indicates that all objects on a page are live then the sweeper simply skips the page, without loading or accessing it. Likewise, if a color map indicates that all objects on a page are garbage, then the page is deallocated from the database, again without having to load the page.

A final important detail of the sweeping phase is the way it exploits the EXODUS cache consistency mechanism to reduce the potential for clients to overwrite the effect of sweeping (see Section 6.4). The sweeper updates the sequence number (LSN) on each page that it modifies. Any transaction that subsequently tries to lock such a page will then be sent the swept copy even if it already has a cached copy of the page (because such a cached copy would appear to be out-of-date).

7.2.5 Logging for Fault-Tolerance. EXODUS has no general support for allowing nonrecovery-related server threads (in this case the gcThread) to create log records. Although there are workarounds, we chose to avoid logging the new state of the page slots (after sweeping) in the gcThread. We accomplish this by setting a flag in a page's header when the page is swept; the flag is cleared when the page is read from disk. If a client obtains a page that has the flag set, then it logs the slot allocation information (from the page header) and clears the flag, prior to performing any updates on the page. In this way, log records are generated only for those swept pages that had not been written to disk prior to being obtained by a client. Note that the log record is generated only by the first transaction to update such a swept page.

Logging the creation of IRL entries does not need such a trick. The reason is that IRL entries are created as side effects of updates made by transactions, and can therefore be logged by the server threads processing these updates. The server, however, must take care to preserve the ordering and the chaining of transactions' log records when inserting IRL-related records in the log. Since preserving this ordering is quite difficult, we adopted a simpler solution. Instead, we delay the logging of IRL updates until the commit of the corresponding transaction, and append the IRL log records to the log at commit time rather than inserting them into the flow of user-related log records. This delayed logging requires that the server maintain (in a volatile table) IRL log information for each transaction until the transaction terminates. This information can be removed once the logging is performed.

8. PERFORMANCE MEASUREMENTS

In this section we describe a study of three different performance aspects of the implementation: the overhead added to normal client processing, the standalone performance of the collector, and the performance of the collector and client transactions when running concurrently.

For all of the experiments presented here, the EXODUS server was run on a SPARCstation LX with 32MB of main memory. The log and the database were stored on separate disks, and raw partitions were used in order to avoid operating system buffering. The size for both data and log pages was set to 8KB. All times were obtained using gettimeofday() and getrusage().

The experiments were run on one or more synthetic database partitions consisting of simple linked lists of objects. We used an EXODUS "volume" as a partition. Objects in the database were always allocated in contiguous pages in an EXODUS file, each page being fully packed with objects. Each object is 80 bytes long, and along with EXODUS page and object headers, 84 objects can fit on a page. The databases used for our experiments are synthetic and do not correspond to a real application. These synthetic database partitions, however, enable us to easily vary important parameters and to clearly evaluate their impact. In our experiments, we varied two major parameters: (1) the percentage of garbage and (2) the clustering of objects. The percentage of garbage determines the amount of work done by the collector. Increasing the garbage percentage changes the cost of the marking phase, since the marker traverses fewer objects. It also changes the cost of the sweeping phase, since more objects must be reclaimed. Varying the clustering impacts the collector's locality and thereby its I/O load. We define clustering as the number of live objects on a page that the marker can scan before crossing a page boundary.

Figure 8 shows an example of four single-partition object spaces, each having different settings for the percentage of garbage and/or clustering. The upper part of Figure 8 shows two different partitions with perfect clustering, i.e., all the live objects of a page can be traversed before crossing a page boundary. With regard to the percentage of garbage, Figure 8(a) shows a partition in which there are no garbage objects. In contrast, Figure 8(b) shows a partition in which each page contains 50% garbage objects. An increasing amount of garbage per page (below 100%, however) does not decrease the number of pages read during a collection cycle, since there is at least one object in each page that is not garbage. The lower part of Figure 8 shows two different clustering configurations in the case where there are no garbage objects. Figure 8(c) illustrates the medium clustering (1/2) case in which half of the live objects in a page can be traversed before a page boundary is crossed. In contrast, Figure 8(d) depicts the worst possible clustering case--a page boundary will be traversed by the marker for each visited object. For simplicity, we have chosen to depict different clusterings with no garbage, but any combination of clustering and percentage of garbage is possible.

[Figure 8 ILLUSTRATION OMITTED]

8.1 Bookkeeping Overhead

The first experiment measures the overhead that is incurred during normal transaction operation with no garbage collection running. The overhead in this case is due to the extra work required to maintain the garbage collector data structures (e.g., the PRT and COT). This includes the extra log-related work that clients must perform and the processing of log records at the server. In this experiment, a single client process was run on a SPARCstation IPC with 32MB of memory; it was connected to the server over a 10 Mbit/sec. Ethernet.

Four different operations were tested: (1) object allocation, (2) modification of references in existing objects, (3) modification of nonreference data in existing objects, and (4) read-only access to objects. For each test, the operation was performed on every object before committing. In the tests shown here, a data partition contained 100,000 objects (1190 pages) and was fully clustered. Client and server cache sizes were 1500 pages--more than enough to hold all of the accessed pages, so all clusterings would have similar performance here. Each benchmark was run 10 times, and the results were averaged. For each experiment, we report times for both a cold and a hot server cache (except for allocate, which creates all of the pages it accesses). We first measured the bookkeeping overhead using an object space containing a single partition. We then repeated the experiments for an object space made of multiple partitions in order to investigate the overhead imposed by the management of IRLs.

8.1.1 Single-Partition Bookkeeping Overhead. The results for the first experiment in which the object space consists of a single partition are shown in Table I. For each test, times are presented for both the unmodified and modified (with GC code) EXODUS systems. In addition, results are shown for cases with and without committing the transactions. With garbage collection, transaction commit incurs the extra overhead of flagging PRT and/or COT entries.
Table I. Client Performance (msec), Single Partition

                         Cold Server Cache

Operation        Original    w/GC     Overhead
Allocate          38176      40382     5.80%
w/Commit          54989      59049     7.40%
Update Ref        52543      53165     1.10%
w/Commit          62736      63381     1.00%
Update Value      51091      51616     1.00%
w/Commit          61365      61998     1.00%
Read-only         27586      27792     0.70%
w/Commit          27622      27828     0.70%

                          Hot Server Cache

Operation         Original    w/GC    Overhead
Allocate
w/Commit
Update Ref         34367     34920     1.60%
w/Commit           44607     45160     1.20%
Update Value       33104     33438     1.00%
w/Commit           43327     43671     0.70%
Read-only          13403     13565     1.20%
w/Commit           13439     13601     1.20%


As can be seen in the table, the overhead imposed on normal operations by the garbage collection code is quite small in all of the cases tested. The highest overheads were seen for the allocation of new objects; this is due to the full-page logging for newly allocated pages in EXODUS, which causes the server to scan new pages in order to locate any cross-partition pointers. Individual object creations on already existing pages would incur a much smaller cost.

8.1.2 Multiple-Partition Bookkeeping Overhead. We also measured the impact of the management of the IRL on client operations using the same benchmark, but with remote references instead of local ones. Table II shows the overhead of managing the IRLs when we allocate objects containing a remote reference and when we update remote references. This table shows the time required to perform the measured operations on 100,000 objects for cases with and without IRL management. The case where the IRLs are not involved corresponds exactly to the one detailed in the previous experiment, i.e., where all references point to objects belonging to a single partition and no references cross partition boundaries. When IRLs are involved, there are two partitions, and all references stored in the objects of one partition point to objects stored in the other partition. It is possible to compare the overhead of IRL bookkeeping and the unmodified version of EXODUS by adding the overheads presented in Tables I and II.
Table II. Additional Overhead Due to Remote References (msec.)

                        Cold Server Cache

Operation         No IRL      w/IRL     Overhead
Allocate           40382      41271      2.20%
w/Commit           59049      62361      5.60%
Update Ref         53165      54303      2.10%
w/Commit           63381      66649      5.10%

                        Hot Server Cache

Operation         No IRL      w/IRL     Overhead
Allocate
w/Commit
Update Ref         34920      36329      4.00%
w/Commit           45160      49090      8.70%


With remote references, the server has to create IRL entries in addition to processing log records. The time to create all the IRL entries adds an overhead of 2.2% over the processing of log records by the collector. There is thus an overhead of 5.8 + 2.2 = 8.0% in comparison to the unmodified version of EXODUS. At commit time, IRL log records are eventually built and appended to the log, increasing the overhead (5.6%) accordingly. Similar behavior is observed for the reference updates. The results for updating values and read-only operations are not shown in Table II because IRLs are not involved in these operations.

8.2 Off-Line Garbage Collector Performance

The second experiment examines the cost of the garbage collector when it is run in isolation (i.e., without any concurrent user transactions). All experiments presented in this section were run with a server cache of 1000 pages. The size of a partition was varied from 500 pages (4MB) to 10,000 pages (80MB). To investigate the cost of performing off-line garbage collection, we varied the clustering, size, and garbage percentage of a partition, as well as the number of partitions involved in the collection. In the following sections, we first present the results of these investigations in the case where the collector runs against an object space consisting of a single partition, and then turn to the case in which multiple partitions are used.

8.2.1 Off-Line Single-Partition Experiments. Figure 9 shows the performance of the marking phase of the collector with a single, fully clustered partition for various percentages of garbage. In this case the marker performance scales linearly with the partition size for all % garbage values except for 100%, which remains near 0 because in that case there is nothing for the marker to do. Response time improves as the garbage % is increased because the marker traverses only live objects, and there are fewer of these.

[Figure 9 ILLUSTRATION OMITTED]

The performance of the sweeping phase for the corresponding cases is shown in Figure 10. In this case all % garbage values have similar sweeper performance except for 0% and 100%. If the color map shows that a page has 0% garbage, then the sweeper does not bother to fetch the page from disk. Likewise, if a page contains only garbage, the sweeper can free the page without fetching it from disk. The difference in performance here comes from the overhead for freeing a page in EXODUS.

[Figure 10 ILLUSTRATION OMITTED]

In all other cases, the sweeper performance is virtually independent of the percentage of garbage in a page, as it must fetch each page that needs to be swept. When the partition is smaller than 1000 pages, all of the pages that the sweeper would need to fetch are already in the buffer because of the marker. Once the partition exceeds the cache size, then this prefetching effect is completely lost (in this case), as the layout of the objects in the partition causes the marker and sweeper to scan the partition in the same order.

Figure 11 shows how the performance of the marking phase varies for different clustering factors. When the partition is small enough to fit in memory (1000 pages, here) then marking performance is not affected by clustering. However, once the partition exceeds the size of the buffer, then the marker begins to incur I/O due to interpage references. As can be seen in the figure, full clustering is the best case for the marking phase, as it allows marking to process pages sequentially, minimizing I/O. As the clustering factor is reduced, the I/O requirements of the marker increase. Since the sweeper processes pages linearly, it is not affected by the clustering factor.

[Figure 11 ILLUSTRATION OMITTED]

8.2.2 Off-Line Multipartition Experiments. In this section we investigate the performance of the collector when multiple partitions containing cross-partition references are used. We measure three basic overhead factors: (1) the cost of using a partition's IRL as a root of garbage collection; (2) the cost of marking remote IRLs; and (3) the costs of reclaiming IRL entries. We evaluate these three factors using a single partition (called partition P) made of 10,000 pages (i.e., 840,000 objects).

To determine the time required for using IRLs as part of the traversal root, we created IRL entries for partition P by creating a second partition of 840,000 objects, each pointing to a unique object in P (i.e., a bijective mapping), resulting in 1400 IRL pages containing 840,000 entries for partition P. Using these IRL entries during the marking phase adds an overhead of 11.4% in comparison to the case where there are no incoming cross-partition references. Basically, the additional cost is the reading of the IRL pages from disk. It is important to note that during marking the IRL of the partition being collected can be processed linearly, so only minimal additional buffer space is required.

We next evaluated the time required to mark the entries of a remote IRL. Recall that marking remote IRL entries is performed as a side effect of a local marking phase. As in the previous experiment, we used partition P to evaluate the overhead of marking remote IRL entries. We first created a new partition of 10,000 pages and then referenced each object in this new partition from an object in P (again, using a bijective mapping), resulting in the creation of 840,000 IRL entries for the new partition. When the marker is run on partition P, it marks these remote IRL entries, but the marking is performed on the remote IRL color maps, not on the IRL entries themselves. These color maps are quite small (a single bit for each remote IRL entry) and are created in memory, so there is no I/O involved. The measured overhead for marking remote IRL entries in this case was under 1%.

Finally, we evaluated the time required to reclaim useless IRL entries. For this case, we first cut most existing interpartition references.(4) We then performed a collection of partition P. The remote IRL entries for the cut cross-partition references remain unmarked after the marking traversal of P and, therefore, are reclaimed when P is swept. The remote IRL contains 1400 pages, so sweeping these pages has the same cost as sweeping 1400 data pages. In the case where the 1400 IRL pages are present in the cache at the server, it takes 791 msec. to sweep them. In the case where they are not cached, the sweeping cost jumps to 11,127 msec.

In summary, these experiments show that the costs related to IRL entries are in line with what would be expected. Using and collecting them requires additional I/O, while marking them is cheap because color maps are maintained in memory. It is important to note, however, that while these experiments focused on the costs of IRLs, in many cases the IRL mechanism can actually result in improved performance compared to a monolithic collector. Since IRL entries simply represent references from other objects, any cases where I/O is required for IRLs would involve object traversals in the absence of IRLs; these object traversals would incur I/O. The compact representation of IRLs allows them to be scanned linearly when being used as persistent roots (thereby limiting buffer requirements), and since IRL entries are smaller than objects and clustered by source partition ID, they are likely to require much less I/O for marking than following pointers to the remote objects themselves. In the likely case where memory is large enough to hold a partition and the necessary IRL entries but is not large enough to hold the entire object space, performing collection one partition at a time is likely to require substantially less I/O than collecting the entire object space as a monolithic entity.

8.3 On-Line Performance

The third set of experiments examines the performance of the system when client transactions and the collector execute concurrently. Although garbage collection can improve performance in the long run by reducing the amount of wasted space in the database, performance can suffer while the collector is running, due to synchronization with transactions and the load the collector places on the server. Given that our collector performs no synchronization with client transactions, and (as shown in Section 8.1) imposes only small off-line costs on clients, the main impact that it will likely have on client responsiveness is the load placed on the server when the collector is running. In this section, we first investigate the issues related to the scheduling of the collector in order to balance its execution with the execution of the transactions. We then explore the interactions between the collector and the transactions when they execute concurrently.

8.3.1 Scheduling the Garbage Collector. Because the collector is fully incremental, the negative impact on client performance can be traded off against the execution time of the collector by varying the aggressiveness with which the collector is scheduled at the server. Favoring transactions at the server will reduce the slowdown experienced by clients when the collector is running, but this slowdown will be incurred over a longer time. Furthermore, slowing down the collector can hurt performance by impacting its ability to quickly free up wasted space.

In order to find a balance between transaction and garbage collector execution, we experimented with the scheduling of the gcThread at the server. As described in the previous section, garbage collection is fairly expensive in an I/O-bound setting. As a result, we chose to investigate scheduling issues in the same environment, where the negative impact on client performance is expected to be maximal. In this experiment, the transaction workload was generated by five client processes, each running on a separate workstation and manipulating a different partition. Each client repeatedly ran transactions consisting of a read-only, full traversal of its own partition; the garbage collector was continuously run on a sixth partition.

We used partitions consisting of 1200 pages (10MB) each, with full clustering and 5% garbage.(5) None of these partitions contained any cross-partition references. The client caches were 200 pages each, and the server buffer was 1200 pages. Given the sequential nature of the transactions and the garbage collector in this workload, however, the hit rate at both client and server buffers was effectively zero.

Section 7.2 described the integration of the gcThread with the EXODUS scheduler. In order to vary the scheduling of the collector, we adjusted the value of the sleep time (i.e., minimum delay) that the scheduler imposes on the garbage collector before allocating it a new processor time slice. This delay is imposed on the collector whenever there are client requests waiting for service. Multiple client requests are allowed to run before the gcThread is given the processor.

In this experiment we varied the collector sleep time from 0 msec. to 1 second. Figure 12 shows the results for the range from 0 msec. to 100 msec. The lowest line shows the performance of the collector running alone, with a 200-page server buffer (i.e., the same amount as is allocated when running concurrently with the transactions). The other flat line shows the average response time for a client when the five clients are run without the garbage collector. The other two lines show the performance of the transactions and the collector when they are executed concurrently. As would be expected, when the collector sleep time is small, its response time is low, and the transactions suffer. As the sleep time is increased, this relationship reverses.

[Figure 12 ILLUSTRATION OMITTED]

What is striking in this figure is that at a sleep time of about 20 msec., the transactions and the collector (which have similar I/O patterns in this experiment) are roughly balanced. Beyond this point, however, the transactions quickly approach their minimal response times, while the collector jumps to a response time of between 600 and 760 seconds.(6) Despite additional investigation, we are unable to fully explain all the interactions between the collector and the EXODUS thread scheduler that cause this sudden jump in the collector response time. This experiment (and others), however, showed that a sleep time of 20 msec. provides a reasonable balance between collector and transaction scheduling, so we use that value in the following experiments.

8.3.2 On-Line Experiments. We also ran experiments to explore the interaction of collector and client transaction performance in the presence of an increasing client workload in an I/O-bound scenario. Here, we varied the number of clients from 1 to 5. We used the same partition sizes and cache sizes as in the previous experiment, so the caches are basically ineffective here as well.

Figure 13 shows the performance of an increasing number of clients each operating on their own partitions, with and without concurrent garbage collection. The response times shown are the average response times of the clients over 10 trials of the experiment. The line labeled "R only" shows the response time of the clients reading the objects of their 1200-page partition while the GC is not active. The curves labeled "R + GC, 0%" and "R + GC, 5%" show the performance of read-only clients when the collector is run concurrently on a separate partition containing 0% or 5% garbage, respectively. When the collector runs concurrently with 5 clients, the time required for each to read the objects of their partition is increased by 32% in the 0% garbage case (167 seconds, total time) and by 48% in the 5% garbage case (178 seconds, total time). The extra overhead in the 5% garbage case is caused by the additional I/O of the sweeping phase of the collector. Figure 13 also shows the performance of the clients when they update all of the objects in their partition and commit. The curve labeled "W only" shows the client performance when the collector is not active. When the collector is active ("W + GC, 0% and 5%"), the clients experience an overhead of 16%. In this case, the times for collecting the 0% and 5% partitions are roughly the same.

[Figure 13 ILLUSTRATION OMITTED]

Figure 14 shows the time the garbage collector requires to collect its partition in the same experiments shown in Figure 13. The two flat lines at the bottom of the figure show the performance of the collector when it runs alone on a partition containing 0% or 5% garbage. The four other curves show the performance of the collector when read-only clients ("GC + R, 0%" and "GC + R, 5%") and updating clients ("GC + W, 0%" and "GC + W, 5%") are run concurrently. In these cases the collector is slowed by competition for the server's disk. With read-only clients, the collector completes the collection in around 75 seconds for 0% garbage and around 150 seconds for 5% garbage. When concurrent writers execute, the collector needs 183 seconds with 0% garbage and 336 seconds with 5% garbage.

[Figure 14 ILLUSTRATION OMITTED]

We also investigated the interaction of the collector and the transactions in a more CPU-intensive context. In this case, we ran the clients and the collector concurrently on a single partition that was resident in the server memory. Table III shows the results for the CPU-bound (single partition) and I/O-bound (multiple partition) cases when 5 clients are run concurrently. Results are shown for both full (100%) and medium (50%) clustering.
Table III. Client and Collector Performance with 5
Clients (in seconds)

                          Single Partition (CPU-bound)
Experiment-5 Clients                  Alone       w/GC
0% Garb., Full Clust.                 48.19      52.59
5% Garb., Full Clust.                 47.79      52.90
0% Garb., 1/2 Clust.                  99.11     105.65
5% Garb., 1/2 Clust.                  97.23     104.34

                        Multiple Partitions (I/O-bound)
0% Garb., Full Clust.                126.96     167.74
5% Garb., Full Clust.                120.77     178.09
0% Garb., 1/2 Clust.                 255.29     286.28
5% Garb., 1/2 Clust.                 250.43     299.26

Experiment-5 Clients             Overhead   Collector
0% Garb., Full Clust.              9.13%      29.27
5% Garb., Full Clust.             10.69%      30.99
0% Garb., 1/2 Clust.               6.59%      34.00
5% Garb., 1/2 Clust.               7.31%      30.90

0% Garb., Full Clust.             32.12%      75.48
5% Garb., Full Clust.             47.46%     152.58
0% Garb., 1/2 Clust.              12.13%     316.82
5% Garb., 1/2 Clust.              19.49%     461.54


8.3.3 Summary of the On-Line Results. The tests performed in this section represent a very conservative study of the garbage collector for several reasons. First, the garbage collector is run continuously--after completing the collection of a partition, it immediately begins a new collection. In practice, the garbage collector would be run only periodically, and, as discussed in Section 8.1, only a small overhead is imposed on client transactions when the collector is not running.

Secondly, in these tests, the client transactions read the same number of pages regardless of the amount of garbage in the pages; thus, only the performance costs, but not the benefits of garbage collection (i.e., reduced I/O), are shown here. Finally, the tests present a constant, heavy load to the server, which is an undesirable (but sometimes unavoidable) condition under which to run garbage collection. For example, in the multiplepartition case, the server is completely I/O-bound.

For these reasons, we believe that the overhead imposed upon clients while the collector is running is reasonable. It is important to stress the fact that unlike garbage collectors that require synchronization with client transactions, in this collector the bulk of the overhead is due to the I/O requirements of detecting and reclaiming garbage objects. Such overhead would be incurred by any tracing-based garbage collector and can be adjusted due to the incremental nature of the collector.

9. RELATED WORK

Having described and analyzed the performance of our solution in detail, we now survey the related work in garbage collection for persistent environments. We first cover solutions designed for centralized systems and then turn to the work done for distributed systems.

9.1 Centralized Systems

Early work on garbage collection for persistent systems focused on algorithms for centralized repositories. Butler [1987] simulated the behavior of several kinds of collectors running against a centralized OODBMS. This work clearly identified the need for incremental approaches to garbage collection in order to preserve the responsiveness of user transactions. This study, however, did not address issues such as concurrency, fault tolerance, and replication that are crucial to the development of a practical garbage collector for persistent data.

9.1.1 Argus. The first paper to specifically address the fault tolerance issues with persistent garbage collection was by Kolodner et al. [1989]. This paper describes a copying-based collector that was integrated with the Argus language. Although Argus supports distributed transactions, the algorithm is specified only in a centralized context. The algorithm as originally described uses a stop-the-world approach; it was later extended to be incremental [Kolodner and Weihl 1993].

The incremental collector uses a destructive approach to copying live objects between the from-space and the to-space in which each from-space object is overwritten with a forwarding pointer when it is copied to the to-space. This type of copying raises two potential problems that can impede crash recovery: the loss of forwarding pointers and the loss of copied objects.

The collector is made fault-tolerant by writing a GC-specific log record each time an object is copied from from-space to to-space. Such log records contain the original address of the object in from-space, the new address of the object in to-space, and the before-image of the data that were overwritten by the forwarding pointer. Using these log records, the work done by the collector before the crash as well as the updates performed by user transactions can be redone by the system. Upon completion of recovery, both the collector and the user transactions can resume.

The need to log all copy operations raises several problems for a persistent system. First, logging every object copy is expensive and greatly increases the size of the log. Second, since the collector and the transactions share the same log, the unpredictable interleaving of those records in the log necessitates great care in the ordering of recovery actions. Such strong coordination between the collector and the recovery system is undesirable, as it requires deep changes in the algorithms that handle failures. We have avoided the need for such changes in our algorithm, as recovery is one of the most complex and idiosyncratic components of a persistent system.

9.1.2 Detlefs' Concurrent Atomic Garbage Collector. In his thesis, Detlefs [1991] described a similar approach for concurrent atomic garbage collection. His algorithm is based on a destructive copy collector and uses the page-protection scheme of Appel et al. [1988] to address the problems of concurrency between users and the collector. Instead of logging every copy operation, Detlefs' scheme uses page pinning and synchronous disk writes to ensure correctness in the presence of crashes.

The algorithm begins by copying the root objects into the to-space. It then initiates a page-by-page scan of the to-space, copying the reachable from space objects it encounters, appending them to the to-space, and overwriting the copied from-space objects with forwarding pointers. During this process, the algorithm pins all the affected pages in memory, namely, the to-space page being filled and all the from-space pages containing the objects that have been copied to that to-space page. When the current to-space page is filled, a log record summarizing the effects of the collector for the objects on the page is written in the log. The algorithm then synchronously writes the to-space page and all the updated from-space pages to disk. After the writes those pages are unpinned, and the collector continues.

The order in which the pages are written, together with the log records summarizing the collector's actions, provides enough information to enable the recovery process to preserve the consistency of the object space after a crash. Compared to the solution by Kolodner and Weihl [1993], this algorithm is likely to impose longer pauses in the execution of user programs, as each pause requires multiple synchronous random writes to the disk. Kolodner's algorithm (as well as ours) is less disruptive, does not require synchronous writes, and is better integrated with the recovery system.

9.1.3 Nondestructive Copying Collector. In contrast to these two destructive algorithms, a nondestructive copy-based garbage collector is described by O'Toole et al. [1993]. In this approach, copy operations are made nondestructive by maintaining forwarding pointers in an in-memory table instead of overwriting from-space objects. Since copying is nondestructive, user transactions can continue to operate freely on from-space objects while the collector concurrently replicates live objects into to-space, leaving the original objects untouched. Because the transactions and the collector execute concurrently, however, the replicas (to-space objects) can become inconsistent if a transaction updates an object in from-space.

The consistency of objects is eventually enforced using the REDO information stored in the transaction log. The collector replays the REDO information associated with transactions on the to-space objects before exchanging the roles of the from-space and to-space. When the to-space contains an exact replica of the entire from-space object graph (in terms of values, as the addresses of objects are different), the collector atomically exchanges (flips) the roles of to-space and from-space.

To cope with failures, the flip is performed only after the to-space has been fully written to stable storage. The collector may temporarily stop user transactions to perform the remaining collection work (finishing replication, writing to-space to disk, etc.). The collector is fault-tolerant because, in the event of a crash, user transactions can always be resumed using the current from-space. The current to-space, which is either empty or a partial replica of the from-space, can always be ignored.

Additional work must be done by the system to ensure the consistency of data when recovering from a crash. Initially, the logged identifiers of all updated objects refer to objects stored in the from-space. Once the spaces are flipped, these identifiers become invalid, since the from-space is discarded. As such, at the time of a flip it is necessary to update all of the log records associated with ongoing transactions so that their object identifiers reference the objects in the to-space.

The algorithm has been implemented in Standard ML of New Jersey. The performance measurements described in this article basically evaluate the gain from using this approach instead of a stop-and-copy collector. Not surprisingly, the incremental solution performs better. Still, the approach suffers from the overheads of forcing the to-space to disk and retraversing the log to modify object identifiers, which can significantly slow a persistent system.

9.1.4 GC-Consistent Cuts. Skubiszewski and Valduriez [1997] detail a concurrent Mark and Sweep collection scheme that uses a new synchronization mechanism called GC-consistent cuts. Their argument is that traditional barrier-like synchronization mechanisms degrade the overall performance of the system, since reference modifications must be trapped and then made available to the collector, even when the collector is not actively running. Instead, they avoid the potential for interference between the collector and user transactions by having the collector analyze copies of database pages, while the transactions execute on the original pages themselves. The copies of pages are made at specific times, and the set of copies of all pages is called a GC-consistent cut. Such a cut is built in a way that ensures that all objects that are garbage in the cut are also garbage in the real database. A cut is used by the Mark and Sweep collection scheme as follows: the Marking Phase detects garbage objects in the cut; and the Sweeping Phase sequentially scans the real database pages and deletes the objects that have been found to be garbage in the cut.

Performance measurements of this collection scheme are still in progress. However, in the worst case, one potential problem is that the size of a cut can be identical to the size of the whole database. Moreover, it is possible that several cuts coexist at the same time. The algorithm described in this study will be part of a future release of [O.sub.2], a commercial object-oriented DBMS. An early version of the collector is used in version 4.6 of [O.sub.2].

9.1.5 Transactional Cyclic Reference Counting. In contrast to the previously described approaches that use tracing, Ashwin et al. [1997] present a nonpartitioned garbage collection scheme that is based on reference counting. The algorithm uses a technique proposed by Brownbridge [1985], which copes with cycles of garbage objects. It uses the notion of strong and weak pointers together with periodic and localized Mark and Sweep phases.

References to objects are labeled as strong or weak, such that a cycle of objects cannot consist solely of strong references--it must have at least one weak reference. Therefore, each object has to maintain two separate reference counts, one for the number of strong references to the object and one for the weak references. These counts are updated as references are created and deleted. When both counts for an object drop to zero, then the object is garbage and can be collected. When only the strong counter of an object drops to zero (the weak counter is nonzero), then this object is potentially involved in a self-referential cycle of garbage. In this case, a localized Mark and Sweep is initiated from that object. In addition to marking the objects it traverses, this localized Mark and Sweep updates the strength of references to ensure that every object has at least one strong reference to it. At the end, all objects that are not marked are garbage and are deleted.(7)

The algorithm is extended to work correctly in the presence of multiple concurrent transactions and system failures. Their solution depends on assumptions about the system architecture that are very similar to the ones presented Section 5.1. They analyze log records to get information about updates of references. They maintain a temporary table that keeps the references of objects that are pruned or created by on-going transaction (this table is analogous to our PRT and COT). In addition, the algorithm must persistently maintain (1) the strong and weak reference counts for each object; (2) an extra bit per reference that indicates its strength; and (3) a table containing the identifiers of objects that have a zero strong count. All updates to the reference counts, the strength bits, or to the persistent table must be logged.

In the event of a system failure, the recovery manager performs logical UNDO (i.e., increment or decrement) for reference count updates, and creates weak references to UNDO reference deletions. Measurements indicate that this technique is in certain cases cheaper than a typical partitioned Mark and Sweep, since it concentrates on local cycles of garbage. In the worst case, however, it is possible that the localized Mark and Sweep would have to scan large parts of the whole database, since the database is not partitioned.

9.2 Distributed Systems

In addition to the work on centralized systems, a number of research groups have developed persistent garbage collectors that work in distributed environments. As with garbage collectors for centralized repositories, persistent garbage collectors for distributed systems require a very careful approach to coping with system failures, rollback of transactions, and durability of committed updates. Furthermore, they must deal with the additional complications that arise due to distribution.

For scalability and efficiency reasons, a collector for a distributed system typically combines independent per-site collectors with a global intersite collector. Coordinating local and global collections requires carefully keeping track of the reference exchanges between sites. This bookkeeping is necessary to avoid erroneously collecting objects. For example, an object located at one site may be referenced from live objects at remote sites but not referenced by any local live object. Such an object must not be reclaimed by a local collector, since it is reachable from the root of a remote site. Keeping track of intersite references is similar to the problem of maintaining cross-partition references in a centralized scheme, but is more difficult in the distributed environment because of the potential for lost, duplicated, or delayed messages, individual site crashes, etc. [Abdullahi and Ringwood 1998].

We now briefly describe several persistent garbage collectors that have been designed to work in a distributed environment. We detail these particular solutions because they are incremental and support persistence.

9.2.1 Thor. Maheshwari and Liskov [1994; 1997] present garbage collection algorithms in the context of MIT's Thor client-server distributed OODBMS. Their 1994 paper focuses on distributed collection across multiple servers rather than on the local collection at a server. It presents a variant of reference counting, called reference listing, in which each server keeps track of the identities of the remote sites that refer to its local objects. Whenever an object is fetched from a server, the server records the identifier of the object along with the identity of the fetching site in a local table (called its "inlist"). Each server and each client also maintains an "outlist" consisting of the identifiers of the remote objects to which it contains references. Outlists are sent to remote servers and used to update their inlists.

Maheshwari and Liskov [1994] describe the extension of this basic scheme to cope with server and client failures as well as with the optimistic concurrency control scheme used by Thor. An optimistic concurrency control policy does not lock objects, but rather allows multiple transactions to access and update the same objects in parallel; conflicting operations are eventually detected and resolved by aborting (i.e., rolling back) one of the involved transactions. Such a concurrency control scheme adds a further complication for garbage collection, namely that a transaction at one client may cut a reference to an object X while a transaction at another client may simultaneously be holding a valid copy of that same reference. In this situation, X should not be garbage collected, as it is still live from the point of view of the second client. This problem is solved by augmenting the inlist to contain the identifiers of all objects directly reachable from a fetched object in addition to the identifier of the fetched object itself.

When a server crashes, the recovery process must rebuild the server's inlists. Inlists from other servers are kept on stable storage, but the inlists for clients are kept in volatile storage to reduce the overhead on object fetching. Thus, the server rebuilds it's client inlists by contacting the clients to retrieve their outlists. Client crashes are handled using a protocol that ensures that all servers have a consistent view of the client's status.

A more recent, complementary paper describes several aspects of a partitioned garbage collection scheme that runs within a single server [Maheshwari and Liskov 1997]. This work presents solutions for two problems that arise with partitioned garbage collection. The first problem has to do with the efficient storage and maintenance of interpartition references. As stated in Section 6.5, for a persistent system, the structures that support such references (e.g., IRLs) must be maintained in a persistent manner. The scheme described by Maheshwari and Liskov [1997] provides an efficient way to maintain information about both incoming and outgoing interpartition references and includes optimizations such as delaying updates to these structures that reduce the disk accesses used to maintain them.

The second problem addressed in their paper is that of collecting cycles of garbage that span partitions. Their technique uses a global but incremental marking scheme. Once a global marking phase has been initiated, each partition trace propagates marks from its root set to the references in its outlists. The global marking phase terminates when marks are known to have propagated fully through all partitions. Enforcing termination requires a global phase counter as well as local phase counters for each partition. In addition, log records are processed such that updates to object references that are made during the execution of the global marking phase do not preclude termination. Comparing the values of counters and checking for unmarked objects enables the detection of garbage objects. Global marking is piggybacked on regular tracing, which reduces the overhead of the global marking phase. The data structures required for global marking can be kept on disk to save primary memory and to allow global marking to be resumed after a crash. This latter property is important because the global marking phase may run for a long time.

9.2.2 Larchant. Delaying insertions in the lists of incoming references is also an interesting property of the collector designed by Ferreira and Shapiro [1995] for the Larchant system. Their work addresses the issue of garbage collecting a persistent distributed shared virtual memory. In their system, objects, grouped in units called bunches, can be mapped (i.e., entirely replicated) at one or more sites.

Cross-bunch references are handled using a reference-listing scheme. An outgoing pointer is described by a stub and an incoming pointer by a scion. A stub identifies its target scion and vice-versa. New lists of stubs are created as a side effect of local garbage collections and then sent to other sites. These lists allow receivers to build new incoming lists of scions. Ferreira and Shapiro's algorithm can delay updates to the lists of incoming references by carefully keeping track of the invalidation of out-of-date copies and by using causal communication that enforces a safe order on the creation and deletion of scions.

The algorithm is able to reclaim cycles of garbage that span bunches by tracing a group of bunches mapped at a site as a single unit (i.e., ignoring bunch boundaries). Groups are formed based on heuristics that tend to maximize the amount of interbunch garbage to collect and to minimize the cost of the collection. This scheme can reduce I/O in an opportunistic fashion, by forming groups from bunches that are in memory at a site.

For local collections at each site, Ferreira and Shapiro's algorithm uses the concurrent replication-based approach of O'Toole et al. [1993] detailed above. Consequently, the paper does not deal with fault tolerance issues, assuming they are handled by the underlying layers.

9.2.3 Yong, Naughton, and Yu. Yong et al. [1994] examine the performance of several reclamation algorithms for client-server persistent object stores. Some of the results are obtained from an implementation of an incremental partitioned Mark and Sweep algorithm, although few details of this algorithm or its implementation are given. Similar to our algorithm, their partitioned Mark and Sweep collector runs at the server and can execute concurrently with client transactions. Their algorithm differs in that it uses a special write barrier and that the collector obtains (non-two-phase) locks on data items. The write barrier traps updates at the clients and adds any new object references to a local list. This list is shipped to the server when a client commits or when the client receives a callback message from the server. The server requests these lists and the contents of application process stacks from the clients before the collector enters its sweep phase. The lists and other references are used as additional roots of persistence during the marking phase. In terms of locks, the marking phase obtains and holds a read lock on a page while it is accessing the page. These locks cause the marker to synchronize with the transactions. In contrast, our partitioned Mark and Sweep algorithm does not hold any locks on pages, does not send callbacks to clients, and can ignore the program state of on-going transactions. Measurements of the implementation showed that the write barrier has only minimal impact on client performance; our measurements support this result.

Several other algorithms are examined by Yong et al. [1994], including a partitioned copy-based collection algorithm. This algorithm obtains non-two-phase exclusive transactional locks for moving objects and uses callbacks, it also requires the use of logical OIDs. No details, however, are given to explain how failures are handled. Based on the results of a simulation study, the copy-based algorithm is advocated over partitioned Mark and Sweep due to its ability to recluster the database. In contrast, we have chosen to allow clustering to be treated separately by the system in order to gain the efficiency and relative ease of implementation of Mark and Sweep.

9.3 Partitioning Issues

We briefly describe here two other papers related to tracing-based garbage collection of persistent systems. These papers assume the existence of a tracing-based collector and focus on issues related to partitioned tracing-based collections. Specifically, these papers investigate the questions of which partition to collect and at what rate.

Cook et al. [1994b] explore heuristics for selecting the partition that is the most cost-effective to collect at a given time. Six different policies are evaluated. Two policies dominate the others. The first one is based on counting the number of references cut inside each partition. The intuition is, that, since cutting references is what turns objects into garbage, the partition which has the most reference cuts is likely to contain the most garbage.

The second policy refines the first by considering some object references to be more important than others. For example, cutting the last reference to the root of a tree of objects makes the whole tree unreachable, whereas cutting references to objects at the leaf of a tree makes only that object unreachable. To approximate the volume of garbage eventually generated by the destruction of a reference, every reference is given a weight that reflects the depth of the object starting from the root of the partition. This second policy uses the weights of destroyed references to help determine which partition to collect.

In a complementary study, Cook et al. [1994a] evaluate the rate at which various partitions should be collected. Collecting a partition too often does not amortize the cost of the collection, since not enough garbage is reclaimed. Cleaning a partition too infrequently wastes space and increases I/O. The conclusion of the paper is that the collection rates should take into account the dynamic behavior of applications. The proposed solution maintains statistics about the partitions and about the cost and effectiveness of their previous garbage collection. The algorithm uses these statistics to evaluate the cost and benefits of the next collection for a partition and allows the collector to be triggered at an advantageous time.

10. CONCLUSIONS

Persistent object stores, as represented by current OODBMS technology, employ features that raise problems for incremental garbage collection algorithms. These features include client caching, STEAL management of client buffers, transactions, and fault tolerance. We described an efficient garbage collection algorithm for persistent object stores, based on a partitioned Mark and Sweep approach. The correctness of the algorithm is enforced by exploiting the flow of log records between clients and server, as well as the locking, logging, and recovery protocols of such systems. The collector is incremental, but requires very little synchronization with client transactions (e.g., it holds no transactional locks), performs minimal logging, and requires no client callbacks or special hardware. The algorithm has been implemented, tested, and analyzed in the EXODUS storage manager. The garbage collector impacted fewer than 300 lines of code in the client side of the system, and required no changes to the complex transaction management, buffer management, and caching protocols already in place.

Our performance studies showed that the collector bookkeeping mechanisms added little overhead to client operations, and that the performance impact of the collector is primarily due to the I/O load for detecting and reclaiming garbage, which is unavoidable in any tracing collector. The study also raised scheduling issues that must be addressed in order to moderate the impact of garbage collection I/O activity on overall system performance. While the current implementation is efficient, it is possible to improve it by relaxing some of the assumptions and correctness properties on which the collector is based. The implementation can also be used to investigate other related issues such as data partitioning, partition selection, and alternative collection approaches such as scavenging.

(1) It is important to note that this copy is to the server's memory only. No disk I/O is required, so the FORCE policy is cheaper here than at the server buffer.

(2) As described in Section 7.2.3, object marking is implemented using color bitmaps associated with each page of the database. These bitmaps are not stored within the objects themselves.

(3) We currently choose the partition to collect in a round-robin manner. More sophisticated techniques for choosing the partition can be used to increase the amount of garbage collected or reduce the impact of garbage collection on the server's memory contents [Cook et al. 1994b]. We discuss such techniques in Section 9.3.

(4) We did not destroy all interpartition references but actually left a single valid IRL entry per IRL page. Destroying all IRL entries would have made their sweeping flee, since there would have been 100% of garbage in those pages.

(5) To keep the garbage percentage fixed at 5% in this experiment, the sweeper does not actually reclaim the garbage objects; however, it still performs all checking and I/O.

(6) The collector performance continues to oscillate within this range for the cases we tested (up to a one-second sleep time).

(7) Brownbridge's algorithm is known to fail when there are two mutually referential cycles. In this case, each intersecting cycle of garbage objects regards the other as an "external" reference. Ashwin et al. [1997] does not mention this bug.

REFERENCES

ABDULLAHI, S. E. and RINGWOOD, G.A. 1998. Garbage collecting the Internet: a survey of distributed garbage collection. ACM Comput. Surv. 30, 3, 330-373.

AMSALEG, L. 1995. Conception et realisation d'un glaneur de cellules adapte aux SGBDO client-serveur. Ph.D. Dissertation. Universite Paris 6 Pierre et Marie Curie, Paris, France.

APPEL, A. W., ELLIS, J. R., and LI, K. 1988. Real-time concurrent collection on stock multiprocessors. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '88, Atlanta, GA, June 22-24), R. L. Wexelblat, Ed. ACM Press, New York, NY, 11-20.

ASHWIN, S., ROY, P., SILBERSCHATZ, A., and SUDARSHAN, S. 1997. Garbage collection in object-oriented databases using transactional cyclic reference counting. In Proceedings of the 23rd International Conference on Very Large Databases (VLDB '97, Athens, Greece, Aug.). VLDB Endowment, Berkeley, CA, 366-375.

ATKINSON, M. P., BAILEY, P. J., CHISHOLM, K. J., COCKSHOTT, P. W., and MORRISON, R. 1990. An approach to persistent programming. In Readings in object-oriented database systems, S. B. Zdonik and D. Maier, Eds. Morgan Kaufmann series in data management systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, 141-146.

BAKER, H. 1978. List processing in real time on a serial computer. Commun. ACM 21, 4 (Apr.), 280-294.

BANCILHON, F., DELOBEL, C., and KANELLAKIS, P. 1991. Building an Object-Oriented Database: The [O.sub.2] Story. Morgan Kaufmann, San Mateo, Calif.

BENZAKEN, V. and DELOBEL, C. 1990. Enhancing performance in a persistent object store: Clustering strategies in [O.sub.2]. In Proceedings of the 4th International Workshop on Persistent Object Systems (Martha's Vineyard, MA, Sept.). 403-412.

BERENSON, H., BERNSTEIN, P., GRAY, J., MELTON, J., O'NEIL, E., and O'NEIL, P. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 1-10.

BISHOP, P. 1977. Computer systems with a very large address space and garbage collection. Ph.D. Dissertation. MIT Laboratory for Computer Science, Cambridge, MA.

BROWNBRIDGE, D. R. 1985. Cyclic reference counting for combinator machines. In Proceedings of the ACM Conference on Functional Programming Languages and Computer Architecture (Nancy, France, Sept.), J.-P. Jouannaud, Ed. Springer Lecture Notes in Computer Science, vol. 201. Springer-Verlag, New York, NY, 273-288.

BUTLER, M. 1987. Storage reclamation in object oriented database systems. In Proceedings of the ACM SIGMOD Annual Conference on Management of Data (SIGMOD '87, San Francisco, CA, May 27-29), U. Dayal, Ed. ACM Press, New York, NY, 410-425.

CAREY, M., DEWITT, D., and NAUOHTON, J. 1993. The 007 benchmark. In Proceedings of the 1993 ACM SIGMOD Conference on Management of Data. ACM Press, New York, NY, 12-21.

CAREY, M., DEWITT, D., RICHARDSON, J., and SHEKITA, E. 1986. Object and file management in the Exodus extensible system. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Japan, Aug.). VLDB Endowment, Berkeley, CA, 91-100.

CAREY, M., FRANKLIN, M., LIVNY, M., and SHEKITA, E. 1991. Data caching tradeoffs in client-server DBMS architectures. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data (SIGMOD '91, Denver, CO, May 29-31), J. Clifford and R. King, Eds. ACM Press, New York, NY, 357-366.

CAREY, M., FRANKLIN, M., and ZAHARIOUDAKIS, M. 1994. Fine-grained sharing in a page server OODBMS. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY.

COOK, J., KLAUSER, A., WOLF, A., and ZORN, B. 1994a. Effectively controlling garbage collection rates in object databases. Tech. Rep. CU-CS-758-94. Department of Computer Science, University of Colorado at Boulder, Boulder, CO.

COOK, J., WOLF, A., and ZORN, B. 1994b. Partition selection policies on object database garbage collection. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 371-382.

DETLEFS, D. 1991. Concurrent, atomic garbage collection. Ph.D. Dissertation. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

DEWITT, D. J., FUTTERSACK, P., MAIER, D., and VELEZ, F. 1990. A study of three alternative workstation server architectures for object-oriented database systems. In Proceedings of the 16th International Conference on Very Large Databases (Brisbane, Australia, Aug. 13-16, 1990), D. McLeod, R. Sacks-Davis, and H. Schek, Eds. Morgan Kaufmann Publishers Inc., San Francisco, CA, 107-121.

DIJKSTRA, E., LAMPORT, L., MARTIN, A., SCHOLTEN, C., and STEFFENS, E. 1978. On-the-fly garbage collection: An exercise in cooperation. Commun. ACM 21, 11 (Nov.), 966-975.

EFFELSBERO, W. and HAERDER, T. 1984. Principles of database buffer management. ACM Trans. Database Syst. 9, 4 (Dec. 1984), 560-595.

EXODUS PROJECT GROUP. 1993. EXODUS storage manager architectural overview. Computer Science Department, Univ. of Wisconsin at Madison, Madison, WI. ftp://ftp.cs.wisc.edu/exodus/sm/doc/arch_overview.3.0.ps.

FERREIRA, P. and SHAPIRO, M. 1995. Garbage collection in the Larchant persistent distributed shared store. In Future Trends in Distributed Computing Systems (Cheju Island, Korea, Aug.).

FRANKLIN, M. 1997. Concurrency control and recovery. In The Computer Science and Engineering Handbook, A. B. Tucker, Ed. CRC Press, Inc., Boca Raton, FL, 1058-1077.

FRANKLIN, M., CARRY, M., and LIVNY, M. 1993. Local disk caching in client-server database systems. In Proceedings of the 19th International Conference on Very Large Data Bases (VLDB '93, Dublin, Ireland, Aug.). Morgan Kaufmann Publishers Inc., San Francisco, CA, 641-654.

FRANKLIN, M. J., CARRY, M. J., and LIVNY, M. 1997. Transactional client-server cache consistency: alternatives and performance. ACM Trans. Database Syst. 22, 3, 315-363.

FRANKLIN, M., ZWILLING, M., TAN, C., CARRY, M., and DEWITT, D. 1992. Crash recovery in client-server EXODUS. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data (SIGMOD '92, San Diego, CA, June 2-5), M. Stonebraker, Ed. ACM Press, New York, NY, 165-174.

GRAY, J. and REUTER, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA.

GRUBER, O. and AMSALEG, L. 1994. Object grouping in EON. In Distributed Object Management, T. Ozsu, U. Dayal, and P. Valduriez, Eds. Morgan Kaufmann, San Mateo, CA, 117-131.

KOLODNER, E. and WEIHL, W. 1993. Atomic incremental garbage collection and recovery for large stable heap. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93, Washington, DC, May 26-28), P. Buneman and S. Jajodia, Eds. ACM Press, New York, NY, 177-185.

KOLODNER, E., LISKOV, B., and WEIHL, W. 1989. Atomic garbage collection: Managing a stable heap. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data (SIGMOD '89, Portland, OR, June), J. Clifford, B. Lindsay, and D. Maier, Eds. ACM Press, New York, NY, 15-25.

LAMB, C., LANDIS, G., ORENSTEIN, J., and WEINREB, D. 1991. The ObjectStore database system. Commun. ACM 34, 10 (Oct. 1991), 50-63.

LE FESSANT, F., PIUMARTA, I., and SHAPIRO, M. 1998. An implementation of complete, asynchronous, distributed garbage collection. SIGPLAN Not. 33, 5, 152-161.

MAHESHWARI, U. and LISKOV, B. 1994. Fault-tolerant distributed garbage collection in a client-server object-oriented database. In Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.).

MAHESHWARI, U. and LISKOV, B. 1997. Partitioned garbage collection or a large object store. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '97, Tucson, AZ, May 13-15), J. M. Peckman, S. Ram, and M. Franklin, Eds. ACM Press, New York, NY, 313-323.

McBETH, J. 1963. On the reference counter method. Commun. ACM 6, 9 (Sept.), 575.

MOHAN, C. and NARANG, I. 1994. ARIES/CNA: A method for database recovery in client-server architectures. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 55-66.

MOHAN, C., HADERLE, D., LINDSAY, B., PIRAHESH, H., and SCHWARZ, P. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Database Syst. 17, 1 (Mar. 1992), 94-162.

O'TOOLE, J., NETTLES, S., and GIFFORD, D. 1993. Concurrent compacting garbage collection of a persistent heap. ACM SIGOPS Oper. Syst. Rev. 27, 5 (Dec. 1993), 161-174.

SHAPIRO, M., GRUBER, O., and PLAINFOSSE, D. 1990. A garbage detection protocol for a realistic distributed object-support system. Tech. Rep. INRIA-1320. INRIA, Rennes, France.

SKUBISZEWSKI, M. and VALDURIEZ, P. 1993. Concurrent garbage collection in [O.sub.2]. In Proceedings of the 19th International Conference on Very Large Data Bases (VLDB '93, Dublin, Ireland, Aug.). Morgan Kaufmann Publishers Inc., San Francisco, CA, 356-365.

UNGAR, D. 1984. Generation scavenging: A non-disruptive high-performance storage reclamation algorithm. In Proceedings of the SIGSOFT/SIGPLAN Workshop on Practical Software Development Environments. ACM, New York, NY.

WILSON, P. 1992. Uniprocessor garbage collection techniques. In Proceedings of the International Workshop on Memory Management (St. Malo, France, Sept.). Lecture Notes in Computer Science, vol. 637. Springer-Verlag, New York, 1-43.

YONG, V., NAUGHTON, Z., and YU, J. 1994. Storage reclamation and reorganization in client-server persistent object stores. In Proceedings of the 10th IEEE Conference on Data Engineering. IEEE Press, Piscataway, NJ, 120-133.

Received: April 1997; revised: April 1998 and June 1999; accepted: June 1999

A preliminary version of this article appeared in the Proceedings of the 21st International Conference on Very Large Data Bases (Zurich, Switzerland, Sept. 1995), pp. 42-53. L. Amsaleg was supported in part by a fellowship from INRIA Rocquencourt, by NSF RIA IRI-94-09575, and by UMIACS. He performed part of this work as a Visiting Researcher at the University of Maryland. M. Franklin's work was supported by NSF RIA IRI-94-09575 and a grant from the University of Maryland General Research Board. He performed this work at the University of Maryland and as a Visiting Researcher at INRIA Rocquencourt. O. Gruber's work was performed while he was with INRIA Rocquencourt.

Authors' addresses: L. Amsaleg, Campus Universitaire de Beaulieu, IRISA/CNRS, Rennes, 35042, France; email: Laurent.Amsaleg@irisa.fr; M. J. Franklin, EECS Computer Science Division, University of California, Berkeley, 387 Soda Hall, No. 1776, Berkeley, CA 94720-1776; email: franklin@cs.berkeley.edu; O. Gruber, T. J. Watson Research Center, IBM, 30 Saw Mill River Road, Hawthorne, NY 10532; email: ogruber@us.ibm.com.

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
COPYRIGHT 1999 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:AMSALEG, LAURENT; FRANKLIN, MICHAEL J.; GRUBER, OLIVIER
Publication:ACM Transactions on Computer Systems
Date:Aug 1, 1999
Words:21379
Previous Article:RecPlay: A Fully Integrated Practical Record/Replay System.
Next Article:Ace: A Language for Parallel Programming With Customizable Protocols.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters