# Tracing the Lineage of View Data in a Warehousing Environment.

1. INTRODUCTIONIn a data warehousing system, materialized views over source data are defined, computed, and stored in the warehouse to answer queries about the source data (which may be stored in distributed and legacy systems) in an integrated and efficient way [Chaudhuri and Dayal 1997; Widom 1995]. Typically, online analytical processing and mining (OLAP and OLAM) systems operate on the data warehouse, allowing users to perform analysis and make predictions [Chaudhuri and Dayal 1997; Han et al. 1998]. In many cases, not only is the view data itself useful for analysis, but knowing the set of source data that produced specific pieces of view information can also be useful. Given a data item in a materialized view, determining the source data that produced it and the process by which it was produced is termed the data lineage problem. Some applications of view data lineage follow:

* OLAP and OLAM: Effective data analysis and mining requires facilities for data exploration at different levels. The ability to select a portion of relevant view data and "drill-through" to its origins can be very useful. In addition, an analyst may want to check the origins of suspect or anomalous view data to verify the reliability of the sources or even repair the source data.

* Scientific Databases: Scientists apply algorithms to commonly understood and accepted source data to derive their own views and perform specific studies. As in OLAP, it can be useful for the scientist to focus on specific view data, then explore how it was derived from the original raw data.

* Online Network Monitoring and Diagnosis Systems: From anomalous view data computed by the diagnosis system, the network controller can use data lineage to identify the faulty data within huge volumes of data dumped from the network monitors.

* Cleansed Data Feedback: Information centers download raw data from data sources and "cleanse" the data by performing various transformations on it. Data lineage helps locate the origins of data items, allowing the system to send reports about the cleansed data back to their sources, and even link the cleansed items to the original items.

* Materialized View Schema Evolution: In a data warehouse, users may be permitted to change view definitions (e.g., add a column to a view) under certain circumstances. View data lineage can help retrofit existing view contents to the new view definition without recomputing the entire view.

* View Update Problem: Not surprisingly, tracing the origins of a given view data item is related to the well-known view update problem [Bancilhon and Spyratos 1981; Dayal and Bernstein 1978; Stonebraker 1975]. In Section 10.2, we discuss this relationship and show how lineage tracing can be used to help translate view updates into corresponding base data updates.

In general, a view definition provides a mapping from the base data to the view data. Given a state of the base data, we can compute the corresponding view according to the view definition. However, determining the inverse mapping--from a view data item back to the base data that produced it--is not as straightforward. To determine the inverse mapping accurately, we not only need the view definition, but we also need the base data and some additional information.

The warehousing environment introduces some additional challenges to the lineage tracing problem, such as how to trace lineage when the base data is distributed among multiple sources, and what to do if the sources are inaccessible or not consistent with the warehouse views. At the same time, the warehousing environment can help the lineage tracing process by providing facilities to merge data from multiple sources and to store auxiliary information in the warehouse in a consistent fashion. In this paper we provide a complete solution for tracing the lineage of relational view data in a warehousing environment. In summary, we:

* Formulate the view data lineage problem by giving a declarative, inductive definition of lineage for arbitrarily complex relational views, including aggregation.

* Develop lineage tracing algorithms for relational views with aggregation, including proofs of correctness. We separately consider the problem under set semantics and bag (multiset) semantics.

* Address issues of lineage tracing in a warehousing environment and show how to perform lineage tracing consistently and efficiently for views defined on distributed, legacy sources.

The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 motivates the data lineage problem using detailed examples, and Section 4 formalizes the problem. Sections 5, 6, and 7 present set-semantics lineage tracing algorithms for select-project-join (SPJ) views, views with aggregation, and views with union and difference operators, respectively. Section 8 extends our results to bag semantics, and Section 9 discusses additional issues for lineage tracing in a warehousing environment. Section 10 revisits some related problems (e.g., the view update problem) to further clarify their relationship with the data lineage problem. Section 11 concludes and discusses future extensions to our work.

A selection of proofs for some theorems in this paper are provided in the Appendix. Remaining proofs are similar in spirit to the ones here, and all proofs are included in Cui et al. [1997].

2. RELATED WORK

The problem of tracing view data lineage is related to work in several areas. Deductive database techniques perform top-down recursive rule-goal unification to provide proofs for a goal proposition [Ullman 1989]. The provided proofs find the supporting facts for the goal proposition, and therefore can also be thought of as providing the proposition's lineage. In this paper we take a different approach that designs relational queries for lineage tracing. In addition, we allow views with aggregation, which are not considered by deductive databases, and we consider lineage tracing in multisource warehousing environments.

Stonebraker [1975] provides an algorithm to translate updates on SPJ views to updates on their base tables. Data cube "rolling-up" and "drilling-down" enable a user to browse a summary view and underlying detailed data at any level and any dimension of the aggregation [Gray et al. 1996]. Although both papers address problems that roughly include retrieving lineage information for specific types of relational views, neither of them formally define the view data lineage problem or tackle it in the general case.

Scientific databases support view data lineage using metadata and annotations [Hachem et al. 1993]. This approach is useful for providing schema-level lineage tracing. However, for applications that require lineage at a finer (instance-level) granularity, their techniques may introduce high storage overhead. Woodruff and Stonebraker [1997] propose a framework to compute instance-level data lineage lazily using the view's weak inverse mapping. However, the system requires the view definer to provide the view's weak inverse, an expectation that may not always be practical. Our algorithms trace instance-level lineage automatically for the user and maintain the necessary auxiliary information to ensure that views have inverses. In effect, our techniques automatically generate a weak inverse in the sense of Woodruff and Stonebraker [1997] for a wide class of relational views.

Kawaguchi et al. [1997] introduce derivation sets in the context of concurrency control for materialized views. Their definition of a derivation set and our lineage definition are related, in the sense that both definitions attempt to isolate the base tuples that affect specific view tuples. However, Kawaguchi et al. [1997] define derivation sets in a database-independent way, so their definition is much looser than ours. Furthermore, Kawaguchi et al. [1997] consider a more restricted class of views than we do. We compare the two approaches in more detail in Section 6.2.

The lineage problem also relates somewhat to the problem of reconstructing base data from summary data as in Faloutsos et al. [1997], which uses a statistical approach and some constraint knowledge. However, that approach provides only estimated lineage information and does not ensure accuracy.

We have implemented a lineage tracing system [Cui and Widom 2000a] as part of the WHIPS [Wiener et al. 1996] data warehousing prototype at Stanford. We have also studied lineage tracing performance issues in detail in Cui and Widom [1999; 2000b].

3. MOTIVATING EXAMPLES

In this section we provide examples that motivate a precise definition of data lineage and show how lineage tracing can be useful. Consider a data warehouse with retail store data over three source tables, whose schema and sample contents are shown in Figures 1-3. The store and item tables are self-explanatory. The sales table contains, for each store and item, the price of the item and the number sold at that store.

Fig. 1. store table.

s_id s_name city state 001 Target Palo Alto CA 002 Target Albany NY 003 Macy's San Francisco CA 004 Macy's New York City NY

Fig. 2. item table.

i_id i_name category 0001 binder stationery 0002 pencil stationery 0003 shirt clothing 0004 pants clothing 0005 pot kitchenware

Fig. 3. sales table

s_id i_id price num 001 0001 4 1000 001 0002 1 3000 001 0004 30 600 002 0001 5 800 002 0002 2 2000 002 0004 35 800 003 0003 45 1500 003 0004 60 600 004 0003 50 2100 004 0004 70 1200 004 0005 30 200

Example 3.1 (Lineage of SPJ View). Suppose an analyst wants to follow the selling patterns of California stores. A materialized view Calif can be defined in the data warehouse for this purpose. Figure 4 shows the SQL and relational algebra definitions of Calif. The materialized view for Calif over our sample data is shown in Figure 5.

[Figure 4 ILLUSTRATION OMITTED]

Fig. 5. View definition for clothing.

s_name i_name num Target binder 1000 Target pencil 3000 Target pants 600 Macy's shirt 1500 Macy's pants 600

The analyst browses the view table and is interested in the second tuple <Target, pencil, 3000>. He or she would like to see the relevant detailed information and asks question Q1: "Which base data produced tuple <Target, pencil, 3000> in view Calif?" Using the algorithms in Section 5, we olbtain the answer in Figure 6. The answer tells us that the Target store in Palo Alto sold 3000 pencils at a price of 1 dollar each.

Fig. 6. Lineage for for <Target, pencil, 3000>.

store s_id s_name city state 001 Target Palo Alto CA item i_id i_name category 0002 pencil stationery sales s_id i_id price num 001 0002 1 3000

Example 3.2 (Lineage of Aggregation View). Now let us consider another warehouse view, Clothing, for analyzing the total clothing sales of large stores (those that have sold more than 5000 clothing items). The SQL and relational algebra definitions of the view are shown in Figure 7. We extend traditional relational algebra with an aggregation operator, denoted [[Alpha].sub.G, aggr(B)], where G is a list of group by attributes, and aggr(B) abbreviates a list of aggregate functions over attributes (details in Section 4.1.). The materialized view contains one tuple, < 5400 >.

[Figure 7 ILLUSTRATION OMITTED]

The analyst may wish to learn more about the origins of this tuple, and asks question Q2: "Which base data produced tuple <5400> in view Clothing?" Not surprisingly, due to the more complex view definition, this question is more difficult to answer than Q1. We develop the appropriate algorithms in Section 6, and Figure 8 presents the answer. It lists all the branches of Macy's, the clothing items it sells (but not other items), and the sales information. All of this information is used to derive the tuple <5400> in Clothing.

Fig. 8. Clothing lineage for <5400>.

store s_id s_name city state 003 Macy's San Francisco CA 004 Macy's New York City NY item i_id i_name category 0003 shirt clothing 0004 pants clothing sales s_id i_id price num 003 0003 45 1500 003 0004 60 600 004 0003 50 2100 004 0004 70 1200

Questions such as Q1 and Q2 ask about the base tuples that derive a given view tuple. We call these base tuples the derivation (or lineage) of the view tuple. In the next section, we formally define the concept of derivation. Sections 5-9 then present algorithms to compute view tuple derivations for different view and warehouse scenarios.

Note that our techniques only trace the lineage of existing view tuples based on existing base tuples. As discussed in Section 11, we plan to consider how our techniques can be extended to help explain why specific (expected) tuples are not in the view, and to help explain the presence of erroneous view tuples in terms of missing base data.

4. VIEW TUPLE DERIVATIONS

In this section we define the notion of a tuple derivation, which is the set of base relation tuples that produce a given view tuple. Section 4.1 first reviews relational view semantics. Tuple derivations for operators and views are then defined in Sections 4.2 and 4.3, respectively.

We assume that a table (relation) R with schema R contains a set of tuples {[t.sub.1], ..., [t.sub.n]}. A database D contains a list of base tables <[R.sub.1], ..., [R.sub.m]>. A view V is a virtual or materialized result of a query over the base tables in D. The query (or the mapping from the base tables to the view table) is called the view definition, denoted v. We say that [R.sub.1], ..., [R.sub.m] derives V if V = v([R.sub.1], ..., [R.sub.m]). We consider set semantics (no duplicates) in this section as well as in Sections 5, 6, and 7. We adapt our work to bag semantics (duplicates permitted) in Section 8.

4.1 Views

We consider a class of views defined over base relations using the relational algebra operators selection ([Sigma]), projection ([Pi]), join (??), aggregation ([Alpha]), set union ([union]), and set difference (-). We use the standard relational semantics, included here for completeness. For an attribute list A = [A.sub.1], ..., [A.sub.n], we use the shorthand t.A to denote (t.[A.sub.1], ..., t.[A.sub.n]).

* Base case: R = {t\t [element of] R}.

* Selection: [[Sigma].sub.C](V1) = {t\t [element of] [V.sub.1] and t satisfies condition C}.

* Projection: [[Pi].sub.A]([V.sub.1]) = {t.A\t [element of] V1} Note that [[Pi].sub.A] is duplicate-eliminating.

* Join: [??.sub.[Theta]]([V.sub.1], ..., [V.sub.M]) = {<[t.sub.1], ..., [t.sub.m]> \[t.sub.i [element of] [V.sub.i] for i = 1..m, and the [t.sub.i]'s satisfy condition [Theta]).

We use infix notation and the special case of natural join ?? in most of the paper, although all results and algorithms hold directly for the general case of theta join.

* Aggregation: [[Alpha].sub.G, aggr(B)]([V.sub.1]) = {(T.G, aggr(T.B)>\T [subset or equal to] [V.sub.1], [Inverted]At, t' [element of] T, [Inverted]At" [is not an element of] T: t'.G = t.G [conjunction] t".G [is not equal to] t.G} G is a groupby attribute list from [V.sub.1], and aggr(B) abbreviates a list of aggregation functions applied to attributes of [V.sub.1].(1)

* Set Union: [V.sub.1] [union] ... [union] [V.sub.m] = {t\t [element of] [V.sub.i] for some [element of] 1..m}.

* Set Difference: [V.sub.1] - [V.sub.2] - {t\t [element of] [V.sub.1] and t [is not an element of] [V.sub.2]}.

For convenience in formulation, when a view references the same relation more than once, we consider each relation instance as a separate relation. For example, we treat the self-join R ?? R as (R as [R.sub.1]) ?? (R as [R.sub.2]), and we consider [R.sub.1] and [R.sub.2] to be two tables in D. This approach allows view definitions to be expressed using an algebra tree instead of a graph, while not limiting the views we can handle.

Any view definition in our language can be expressed using a query tree, with base tables as the leaf nodes and operators as inner nodes. Figures 4 and 7 are examples of query trees.

4.2 Tuple Derivations for Operators

To define the concept of derivation we logically assume that the view contents are computed by evaluating the view definition query tree bottom-up. Each operator in the tree generates its result table on the basis of the results of its child nodes, and passes its result table upwards. We begin by focusing on individual operators, defining derivations of the operator's result tuples based on its input tuples.

According to relational semantics, each operator can generate its result tuple-by-tuple based on its operand tables. Intuitively, given a tuple t in the result of operator Op, some subset of the input tuples produced t. We say that tuples in this subset contribute to t, and we call the entire subset the derivation of t. Input tuples not in t's derivation either contribute to nothing, or only contribute to result tuples other than t. Figure 9 illustrates the derivation of a result tuple. In the figure, operator Op is applied to tables [T.sub.1] and [T.sub.2], which may be base tables or temporary results from other operators. (In general, we use R's to denote base tables and T's to denote tables that may be base or derived.) Table T is the operation result. Given tuple t in T, only subsets [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] of [T.sub.1] and [T.sub.2] contribute to t. <[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]> is called t's derivation. The formal definition of tuple derivation for an operator is given next, followed by additional explanations.

[Figure 9 ILLUSTRATION OMITTED]

Definition 4.1 (Tuple Derivation for an Operator). Let Op be any relational operator over tables [T.sub.1], ..., [T.sub.m], and let T = Op([T.sub.1], ..., [T.sub.m]) be the table that results from applying Op to [T.sub.1], ..., [T.sub.m]. Given a tuple t [element of] T, we define t's derivation in [T.sub.1], ..., [T.sub.m] according to Op to be [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], ..., [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are maximal subsets of [T.sub.1], ..., [T.sub.m] such that

(a) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(b) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

We also say that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is t's derivation in [T.sub.i], and each tuple t* in [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] contributes to t, for i = 1..m.

In Definition 4.1, requirement (a) says that the derivation tuple sets (the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]'s) derive exactly t. From relational semantics, we know that for any result tuple t there must exist such tuple sets. Requirement (b) says that each tuple in the derivation does in fact contribute something to t. For example, with requirement (b) and given Op = [[Sigma].sub.C], base tuples that do not satisfy the selection condition C, and therefore make no contribution to any result tuple, will not appear in any result tuple's derivation. By defining the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]'s to be the maximal subsets that satisfy requirements (a) and (b), we make sure that the derivation contains exactly all the tuples that contribute to t. Thus, the derivation fully explains why a tuple exists in the result.(2)

[Op.sup.-1] can be extended to represent the derivation of a set of tuples:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [union] represents the multiway union of relation lists, i.e., <[S.sub.1], ..., [S.sub.m]> = <([R.sub.1] [union] [S.sub.m]), ..., ([R.sub.m] [union] [S.sub.m])>. Theorem 1 shows that there is a unique derivation for any operator and result tuple. Note that all proofs are provided in the Appendix.

THEOREM 4.2 (DERIVATION UNIQUENESS). Given t [element of] Op([T.sub.1], ..., [T.sub.m]), where t is a tuple in the result of applying operator Op to tables [T.sub.1], ..., [T.sub.m], there exists a unique derivation of t in [T.sub.1], ..., [T.sub.m] according to Op.

Example 4.3 (Tuple Derivation for Aggregation). Given table R in Figure 10(a) and tuple t = <2, 8> [element of] [[Alpha].sub.X, sum(Y)](R) in Figure 10(b), the derivation of t is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

shown in Figure 10(c). Notice that R's subset {<2, 3>, <2, 5>} also satisfies requirements (a) and (b) in Definition 4.1; but it is not maximal. Intuitively, <2, 0> also contributes to the result tuple, since t = <2, 8> [element of] [[Alpha].X, sum(Y)](R) is computed by adding the Y attributes of <2, 3>, <2, 5>, and <2, 0> in R.

[Figure 10 ILLUSTRATION OMITTED]

From Definition 4.1 and the semantics of the operators introduced in Section 4.1, we now specify the actual tuple derivations for each of our operators.

THEOREM 4.4 (TUPLE DERIVATIONS FOR OPERATORS). Let T, [T.sub.1], ..., [T.sub.m] be tables. Recall that [T.sub.i] denotes the schema of [T.sub.i].

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

4.3 Tuple Derivations for Views

As mentioned earlier, a view definition can be thought of as an operator tree that is evaluated bottom-up. Thus, in this section we proceed to define tuple derivations for views inductively based on tuple derivations for the operators comprising the view definition tree. Intuitively, if a base tuple t* contributes to a tuple t' in the intermediate result of a view evaluation, and t' further contributes to a view tuple t, then t* contributes to t. We define a view tuple's derivation to be the set of all base tuples that contribute to the view tuple. The specific process through which the view tuple is derived can be illustrated by applying the view definition tree to the derivation tuple sets, and presenting the intermediate results for each operator in the evaluation.

Definition 4.5 (Tuple Derivation for a View). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) be a view over D. Consider a tuple t [element of] V.

(1) v = [R.sub.i]: Tuple t [element of] [R.sub.i] contributes to itself in V.

(2) v = Op([v.sub.1], ..., [v.sub.k]), where [v.sub.j] is a view definition over D, j = 1..k:

Suppose t' [element of] [v.sub.j](D) contributes to t according to the operator Op (by Definition 5.2), and t* [element of] [R.sub.i] contributes to t' according to the view [v.sub.j](recursively, by this definition). Then t* contributes to t according to v.

We define t's derivation in D according to v to be [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], ..., [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are subsets of [R.sub.1], ..., [R.sub.m] such that t* [element of] [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] iff t* [element of] [R.sub.i] contributes to t according to v, for i = 1..m. Also, we call [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] t's derivation in [R.sub.i] according to v, denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The derivation of a view tuple set T contains all base tuples that contribute to any view tuple in the set T:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Theorem 4.2 can be applied inductively in the obvious way to show that view tuple derivations are always unique.

Example 4.6 (Tuple Derivation for a View). Given base table R in Figure 11(a), view V = [[Alpha].sub.X, sum(Y)]([[Sigma].sub.Y [is not equal to] 0](R)) in Figure 11(c), and tuple t = <2, 8> [element of] V, it is easy to see that tuples <2, 3> and <2, 5> in R contribute to <2, 3> and <2, 5> in [[Sigma].sub.Y [is not equal to]0](R) in Figure 11(b), and further contribute to <2, 8> in V. The derivation of t is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], as shown in Figure 11(d).

[Figure 11 ILLUSTRATION OMITTED]

We now state some properties of view tuple derivations to provide the groundwork for our derivation tracing algorithms.

THEOREM 4.7 (DERIVATION TRANSITIVITY). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) be a view over D. Suppose that v can also be represented as V = v'([V.sub.1], ..., [V.sub.k]), where [V.sub.j] = [v.sub.j](D) is an intermediate view over D, for j = 1..k. Given tuple t [element of] V, let [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] be t's derivation in [V.sub.j] according to v'. Then t's derivation in D according to v is the concatenation of all [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]'s derivations in D according to [v.sub.j], j = 1..k:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [dot in circle] represents the multiway concatenation of relation lists.(3)

Theorem 4.7 follows from Definition 4.5. It shows that given a view V with a complex definition tree, we can compute a tuple's derivation by recursively tracing through the hierarchy of intermediate views that comprise the tree.

Since we define tuple derivations inductively, based on the view query tree, an interesting question arises: Are the derivations of tuples in two equivalent views also equivalent? Two view definitions (or query trees) [v.sub.1] and [v.sub.2] are equivalent iff [inverted]AD: [v.sub.1](D) = [v.sub.2](D) [Ullman 1989]. We prove in Theorem 4 that given any two equivalent Select-Project-Join (SPJ) views, their tuple derivations are also equivalent.

THEOREM 4.8 (DERIVATION EQUIVALENCE FOR SPJ VIEWS). Tuple derivations of equivalent SPJ views are equivalent. That is, given two equivalent SPJ views [v.sub.1] and [v.sub.2], [inverted]AD: [inverted]At [element of] [v.sub.1](D) = [v.sub.2](D): [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Thus, according to Theorem 4.8, we can transform an SPJ view to a simple canonical form before tracing tuple derivations, and we will exploit this property in our algorithms. Unfortunately, views with aggregation do not have this nice property, as shown in the following example.

Example 4.9 (Derivation Inequivalence for Views with Aggregation). Let [V.sub.1] = [v.sub.1](R) = [[Alpha].sub.X, sum(Y)](R) and [V.sub.2] = [v.sub.2](R) = [[Alpha].sub.X, sum(Y)]([[Alpha].sub.Y [is not equal to] 0](R)) [multiplied by] [v.sub.1] and [v.sub.2] are equivalent, since [inverted]AR, [v.sub.1](R) = [v.sub.2](R). Given base table R in Figures 10(a) and 11(a), Figures 10(b) and 11(c) show that the contents of the two views are the same. However, the derivation of tuple t = <2, 8> [element of] [V.sub.1] according to [v.sub.1] (shown in Figure 10(c)) is different from that according to [v.sub.2] (shown in Figure 11(d)).

Given Definition 4.5, a straightforward way to compute a view tuple's derivation is to compute the intermediate results for all operators in the view definition tree, store the results as temporary tables, then trace the tuple's derivation in the temporary tables recursively until reaching the base tables. Obviously, this approach is impractical due to the computation and storage required for all the intermediate results. In the following sections, we consider SPJ views separately, views with aggregation (ASPJ views), and more general views with set operators. Section 5 shows that one relational query over the base tables suffices to compute tuple derivations for SPJ views. Sections 6 and 7 present recursive algorithms that require a modest amount of auxiliary information for ASPJ and more general view derivation tracing.

5. SPJ VIEW DERIVATION TRACING

Derivations for tuples in Select-Project-Join (SPJ) views can be computed using a single relational query over the base data. In this section, we specify derivation tracing queries for SPJ views and briefly discuss some optimization issues for these queries.

5.1 Derivation Tracing Queries

We can sometimes write a query for a specific view definition v and view tuple t such that if we apply the query to the database D it returns t's derivation in D (based on Definition 4.5). We call such a query a derivation tracing query (or. tracing query) for t and v. More formally, we have the following:

Definition 5.1 (Derivation Tracing Query). Let D be a database and let v be a view over D. Given tuple t [element of] v(D), [TQ.sub.t, v] is a derivation tracing query for t and v iff [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is t's derivation in D according to v, and [TQ.sub.t, v] is independent of database instance D. We can similarly define the tracing query for a view tuple set T, and denote it [TQ.sub.T, v](D).

5.2 Tracing Queries for SPJ Views

All SPJ views can be transformed into the form [[Pi].sub.A]([[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m])) using a sequence of algebraic transformations [Ullman 1989]. We call this form the SPJ canonical form. From Theorem 4.8, we know that SPJ transformations do not affect view tuple derivations. Thus, given an SPJ view, we first transform it into SPJ canonical form, then compute its tuple derivations systematically using a single tracing query based on the canonical form. We first introduce an additional operator used in tracing queries for SPJ views.

Definition 5.2 (Split Operator). Let T be a table with schema T. The operator Split breaks T into a list of tables; each table in the list is a projection of T onto a set of attributes [A.sub.i] [subset or equal to] T, i = 1..m.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

THEOREM 5.3 (TRACING QUERY FOR AN SPJ VIEW). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) = [[Pi].sub.A]([[Sigma].sub.A]([R.sub.1] ?? ... ?? [R.sub.m])) be an SPJ view over D. Given tuple t [element of] V, t's derivation in D according to v can be computed by applying the following query to the base tables:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Given a tuple set T [subset or equal to] V, T's derivation tracing query is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where ?? is the relational semijoin operator.

Example 5.4 (Tracing Query for Calif). Recall Q1 over view Calif in Example 3.1, where we asked about the derivation of tuple <Target, pencil, 3000>. Figure 12(a) shows the tracing query for <Target, pencil, 3000> in Calif according to Theorem 5.3. The reader may verify that, by applying the tracing query to the source tables in Figures 1, 2, and 3, we obtain the derivation result in Figure 6.

[Figure 12 ILLUSTRATION OMITTED]

5.3 Tracing Query Optimizations

Clearly, the derivation tracing queries in Section 5.2 can be optimized for better performance. For example, the simple technique of pushing selection conditions below the join operator is especially applicable in tracing queries, and can significantly reduce query costs. Figure 12(b) shows the optimized tracing query for the Calif tuple. If sufficient base table key information is present in the view, the tracing query can be even simpler:

THEOREM 5.5 (DERIVATION TRACING USING KEY INFORMATION). Let [R.sub.i] be a base table with key attributes [K.sub.i], i = 1..m and let view V = [[Pi].sub.A]([[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m])) include all base table keys (i.e., [K.sub.i] [element of] A, i = 1..m). View tuple t's derivation is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

According to Theorem 5.5, we can use key information to fetch the derivation of a tuple directly from the base tables, without performing a join. The worst-case query complexity is reduced from O([n.sup.m]) to O(mn), where n is the maximum size of the base tables and m is the number of base tables on which v is defined.

We have shown that tuple derivations for SPJ views can be traced in a simple manner. For more complex views with aggregations or set operators, we cannot compute tuple derivations using a single query over the base tables. In the next two sections, we present recursive tracing algorithms for these views.

6. DERIVATION TRACING FOR ASPJ VIEWS

In this section we consider SPJ views with aggregation (ASPJ views). Although we have shown that no intermediate results are required for SPJ view derivation tracing, some ASPJ views are not traceable without storing or recomputing certain intermediate results. For example, Q2 in Example 3.2 asks for the derivation of tuple t = (5400) in the view Clothing. It is not possible to compute t's derivation directly from store, item, and sales because total is the only column of view Clothing, and it is not contained in the base tables at all. Therefore, we cannot find t's derivation by only knowing that t.total = 5400. In order to trace the derivation correctly, we need tuple <Macy's, 5400> in the intermediate aggregation result to serve as a "bridge" that connects the base tables and the view table.

We introduce a canonical form for ASPJ views in Section 6.1. In Section 6.2, we specify a derivation tracing query for simple "one-level" ASPJ views. We then develop a recursive tracing algorithm for complex ASPJ views and justify its correctness in Section 6.3. As mentioned above, intermediate (aggregation) results in the view evaluation are needed for derivation tracing. Relevant portions of the intermediate results can be recomputed from the base tables when needed, or the entire results can be stored as materialized auxiliary views in the warehouse; this issue is further discussed in Section 10.1. In the remainder of this section we simply assume that all intermediate aggregation results are available.

6.1 ASPJ Canonical Form

Unlike SPJ views, ASPJ views do not have a simple canonical form, since in an ASPJ view definition some selection, projection, and join operators cannot be pushed above or below the aggregation operators [Gupta et al. 1995]. View Clothing in Figure 3 is such an example, where the selection total [is greater than] 5000 cannot be pushed below the aggregation and the selection category = "clothing" cannot be pulled above the aggregation. However, by commuting and combining some SPJ operators [Ullman 1989] it is possible to transform any ASPJ view query tree into a form composed of [Alpha]-[Pi]-[Sigma]-?? operator sequences that we call ASPJ segments. We call this segmented form the ASPJ canonical form. An ASPJ segment may omit operators, although each segment in the ASPJ canonical form except the outermost must include an aggregation operator (otherwise the segment would be merged with an adjacent segment). For purposes of definition, when a unary operator is missing we assume there is a corresponding trivial operator to take its place. The trivial aggregation on table T is [[Sigma].sub.T], the trivial projection is [[Pi].sub.T], and the trivial selection is [[Sigma].sub.true].

Definition 6.1 (ASPJ Canonical Form). Let v be an ASPJ view definition over database D.

(1) v = R is in ASPJ canonical form where R is a base table in D.

(2) v = [[Alpha].sub.G, aggr(B)]([[Pi].sub.A]([[Delta]].sub.C]([v.sub.1] ?? ... ?? [v.sub.k]))) is in ASPJ canonical form if [v.sub.j] is an ASPJ view in ASPJ canonical form with a nontrivial topmost aggregation operator, j = 1..k.

As mentioned in Section 4, although we can trace any view's derivation by tracing through one operator at a time, based on the original view definition, it requires us to store or recompute the intermediate results of every operator, which can be extremely expensive. Transforming an ASPJ view definition into canonical form allows us to store or recompute fewer intermediate results, by tracing through all operators in each segment together, as we will see in the next section.

6.2 Derivation Tracing Queries for One-Level ASPJ Views

A view defined by one ASPJ segment is called a one-level ASPJ view. We can use one query to trace tuple derivations for a one-level ASPJ view, similar to SPJ views.

THEOREM 6.2 (TRACING QUERY FOR A ONE-LEVEL ASPJ VIEW). Given a one-level ASPJ view, V = v([R.sub.1], ..., [R.sub.m]) = [Alpha].sub.G, aggr(B)]([[Pi].sub.A]([[Delta].sub.C]([R.sub.1] ?? ... ?? [R.sub.m]))), and given tuple t [element of] V, t's derivation in [R.sub.1], ..., [R.sub.m] according to v can be computed by applying the following query to the base tables:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Given tuple set T [subset or equal to] V, T's derivation tracing query is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, too, evaluation of the tracing query can be optimized in various ways, as discussed in Section 5.3.

Recall from Section 2 that Kawaguchi et al. [1997] also define a notion of tuple derivations. The views they consider correspond to our one-level ASPJ views. In Kawaguchi et al. [1997] the derivation set t [element of] V is defined as <[R'.sub.1], ..., [R'.sub.m]>, where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] ([R.sub.i]) such that [C.sub.i] [subset of equals to] C and [G.sub.i] [subset of equals to] G are the selection conditions and groupby attributes local to [R.sub.i]. Although this approach localizes derivation tracing to a single base table at a time, derivation sets as defined in Kawaguchi et al. [1997] are not as accurate as our tuple derivations: Those tuples in [R'.sub.i] that do not join with the other tables could not have participated in the derivation of t.

6.3 Derivation Tracing Algorithm for Multilevel ASPJ Views

Given a general ASPJ view definition, we first transform the view into ASPJ canonical form, divide it into a set of ASPJ segments, and define an intermediate view for each segment.

Example 6.3 (ASPJ Segments and Intermediate Views for Clothing). Recall the view Clothing in Example 3.2. We can rewrite its definition in ASPJ canonical form with two segments, and introduce an intermediate view AllClothing, as shown in Figure 13.

[Figure 13 ILLUSTRATION OMITTED]

We then trace a tuple's derivation by recursively tracing through the hierarchy of intermediate views top-down. At each level, we use the tracing query for a one-level ASPJ view to compute derivations for the current tracing tuples with respect to the views or base tables at the next level below. Details follow.

6.3.1 Algorithm. Figure 14 presents our basic recursive derivation tracing algorithm for a general ASPJ view. Given a view definition v in ASPJ canonical form and tuple t [element of] v(D), procedure TupleDerivation(t, v, D) computes the derivation of tuple t according to v over D. The main algorithm, procedure TableDerivation(t, v, D), computes the derivation of a tuple set T [subset of equals to] v(D) according to v over D. As discussed earlier, we assume that v = v'([V.sub.1], ..., [V.sub.k]), where v' is a one-level ASPJ view and [V.sub.j] = [v.sub.j](D) is available as a base table or an intermediate view, j = 1..k. The procedure first computes T's derivation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] in <[V.sub.1], ..., [V.sub.k]> using the one-level ASPJ view tracing query TQ(T, v', <[V.sub.1], ..., [V.sub.m]>) from Theorem 7. It then calls procedure TableListDeriv [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], {[v.sub.1], ..., [v.sub.k]}, D), which computes (recursively) the derivation of each tuple set [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] according to [v.sub.j], j = 1..k, and concatenates the results to form the derivation of the entire list of view tuple sets.

Fig. 14 Derivation tracing algorithm for ASPJ views.

procedure TupleDerivation (t, v, D) in: a tracing tuple t, a view definition v, and a database D out: t's derivation in D according to v 1 return (TableDerivation({t}, v, D)); procedure TableDerivation(T, v, D) in: a tracing tuple set T, a view definition v, and a database D out: T's derivation in D according to v 1 if v = R [element of] D then return ((T)); 2 //otherwise v = v'([v.sub.1], ..., [v.sub.k]) where v' is a one-level ASPJ view 3 // [V.sub.j] = [v.sub.j](D) is an intermediate view or a base table, j - 1..k 4 ([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) [left arrow] TQ(T,v',{[V.sub.1], ..., [V.sub.k]}); 5 return (TableListDeriv([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]), {[v.sub.1], ..., [v.sub.k]}, D)); procedure TableListDeriv(<[T.sub.1], ..., [T.sub.k]>, {[v.sub.1], ..., [v.sub.k]}, D) in: a list of tuple sets [T.sub.1], ..., [T.sub.k] to be traced, a list of view definitions [v.sub.1] ..., [v.sub.k], and a database D out: the concatenation of [T.sub.i]'s derivations according to [v.sub.i], i = 1..k 1 D* [left arrow] ??; 2 for j [left arrow] 1 to k do 3 D* [left arrow] D*o TableDerivation([T.sub.j], [v.sub.j], D); 4 return (D*);

Example 6.4 (Recursive Derivation Tracing). We divided the view Clothing into two segments in Example 7.3. We assumed that the contents of the intermediate view AllClothing are available (shown in Figure 15). According to our algorithm, we first compute the derivation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] of < 5400 > in AllClothing to obtain [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] = {<Macy's, 5400>}, and then trace [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] derivation to the base tables to obtain the derivation result in Figure 8.

Fig. 15. AllClothing table.

s_name total Target 1400 Macy's 5400

Note that we do not necessarily materialize complete intermediate aggregation views such as AllClothing. In fact, there are many choices of what (if anything) to store. The issue of storing versus recomputing the intermediate information needed for derivation tracing is discussed in Section 9.1.

6.3.2 Correctness. To justify the correctness of our algorithm, we claim the following:

(1) Transforming a view into ASPJ canonical form does not affect its derivation. We can "canonicalize" an ASPJ view by transforming each segment between adjacent aggregation operators into its SPJ canonical form. The process consists of SPJ transformations only [Ullman 1989]. Theorem 4 shows that derivations are unchanged by SPJ transformations.

(2) It is correct to trace derivations recursively down the view definition tree. From Theorem 3, we know that derivations are transitive through levels of the view definition tree. Thus, when tracing tuple derivations for a canonicalized ASPJ view, we can first divide its definition into one-level ASPJ views and then compute derivations for the intermediate views in a top-down manner.

So far we have introduced a simple tracing query for SPJ views and one-level ASPJ views and a recursive tracing algorithm for general ASPJ views. In the next section we consider derivation tracing for even more general views, including the set operators union and difference.

7. DERIVATION TRACING FOR VIEWS WITH SET OPERATORS

In this section we extend our derivation tracing algorithm for general ASPJ views that allow arbitrary use of set union and difference operators in the view definition. In Section 7.1 we briefly review tuple derivations for set operators, and provide an example justifying the definition for the difference operator. In Section 7.2 we incorporate set operators into our view definition canonical form, and we present a procedure to transform any ASPJ view with set operators (referred to as a general view) into canonical form. Finally, in Section 7.3 we present a recursive algorithm that traces tuple derivations for general views.

7.1 Tuple Derivations for Set Operators

Recall from Theorem 4.4 that the tuple derivations for set operators are

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the union operator, given t [element of] [T.sub.1] [union] ... [union] [T.sub.m], each tuple t from any input table [T.sub.1], ..., [T.sub.m] contributes to t. For the difference operator, given t [element of] [T.sub.1] - [T.sub.2], the tuple t from [T.sub.1] and all tuples in [T.sub.2] contribute to t. Recall from Definition 4.1 that a tuple's derivation fully explains why the tuple appears in the operation result. In particular, a tuple t appears in [T.sub.1] - [T.sub.2] not only because [exists]t [element of] [T.sub.1], but also because ?? [exists]t [element of] [T.sub.2]. All tuples in [T.sub.2] contribute to t [element of] [T.sub.1] - [T.sub.2] in the sense that they ensure ?? [exists]t [element of] [T.sub.2]. Example 7.1 further explains the necessity of including [T.sub.2] in the derivation of t [T.sub.1] - [T.sub.2].

Example 7.1 (Tuple Derivation for Difference). Consider tables R, S, T in Figure 16(a) and view V = R - (S - T) in Figure 16(b). Although we have not yet given our tracing algorithm for general views, the derivation of tuple t = (2) [element of] V using our algorithm (and consistent with the tuple derivation definitions given above) is as shown in Figure 16(c). Notice that tuple (2) would not have appeared in V without tuple (2) in T. So <2> [element of] T obviously contributes in a significant way to (2) [element of] V. In general, contributions of this sort are not exhibited in lineage tracing for a tuple t unless we include in t's derivation all tuples in (or contributing to) the second operand of the set difference operator.

[Figure 16 ILLUSTRATION OMITTED]

7.2 Canonical Form for General Views

We now define an extended canonical form that accommodates views with union and difference operators, as well as aggregate, select, project, and join operators. The extended canonical form is composed of two types of segments: AUSPJ-segments, which are operator sequences in the form [Alpha]-[union]-[Pi]-[Delta]-??, and D-segments, which contain a single difference operator. An AUSPJ-segment may omit any of the operators in the operator sequence [Alpha]-[union]-[Pi]-[Delta]-??, but it must satisfy one of the following:

(1) It has an aggregate operator on the top.

(2) It is directly below a D-segment.

(3) It is below a join operator and has a union operator on the top.

(4) It is the top segment in the view definition, which means that no more segments lie above this segment.

When a [Alpha], [Pi], or [Sigma] operator is missing from an AUSPJ-segment, we assume, for purposes of definition that a trivial operator is present, as in Section 6.1. In Figure 17 we provide a procedure, called canonicalize that transforms a general view definition tree into canonical form, divides the transformed view definition into segments, and associates an intermediate view with each segment. We perform canonicalizing transformations on the basis of the following equivalences:

(1) pulling up [union]: [[Pi].sub.A] (R [union] S) [equivalent] [[Pi].sub.A](R) [union] [[Pi].sub.A](S) and [[Sigma].sub.C](R [union] S) [equivalent] [[Sigma].sub.C](R) [union] [[Sigma].sub.C](S).

(2) pulling up [Pi]: [[Sigma].sub.C]([[Pi].sub.A](R)) [equivalent] [[Pi].sub.A](R)) and [[Pi].sub.A](R) [??.sub.[Theta]][[Pi].sub.B](S) [equivalent] [[Pi].sub.A [union] B] (R [??.sub.[Theta]] S).(4)

(3) pulling up [Sigma]: [[Sigma].sub.C](R) ?? [[Sigma].sub.C'](S) [equivalent] [[Sigma].sub.C [conjunction] C'](R ?? S).

As in Figure 17, we first pull up the union operators in the view definition tree as far as possible. We then pull up projection and selection operators, and finally merge the same algebra operators when they appear adjacent to each other. Notice that in the canonicalizing procedure, we pull projection and selection operators above join operators, so that we can compress the operator tree into as few segments as possible. This helps to reduce the total tracing cost, and possibly the number of intermediate views that are stored. However, when we actually apply the tracing query for each segment to obtain a derivation, we can sometimes push selection and projection operators back below the join to reduce the cost of the tracing query, as discussed in Section 5.3.

Fig. 17. Canonicalizing general view definitions.

procedure Canonicalize([v.sub.0]) in: a general view definition tree [v.sub.0] out: a canonicalized view definition v with an intermediate view associated with each segment in v 1 copy [v.sub.0] to v; 2 pull union operators above selections and projections in v; 3 pull projection operators above joins and selections in v; 4 pull selection operators above joins in v; 5 merge the same algebra operators adjacent to each other in v; 6 divide v into segments; 7 define an intermediate view over each segment in v; 8 return (v);

Having obtained the canonicalized definition tree, we divide it into AUSPJ-segments and D-segments, based on the following rules:

(1) Every aggregation and difference operator begins a segment.

(2) Every nonleaf child of a difference or join operator begins a segment.

(3) The topmost node begins a segment.

Finally, we define an intermediate view for each segment. As with multilevel ASPJ views, when tracing tuple derivations for a general view, each segment (intermediate view) becomes a "tracing unit." That is, we recursively trace tuple derivations down the view definition tree, through one segment at a time, until we reach the base tables.

THEOREM 7.2 (EXTENDED CANONICAL FORM). Procedure Canonicalize in Figure 17 returns a view v in canonical form. v is equivalent to the given view [v.sub.0], and the two views have equivalent tuple derivations.

Example 7.3 (Canonical Form for a View with Set Operators). Consider the general view definition in Figure 18(a). Figure 18(b) shows the canonical form and its segments. Notice that ([[Sigma].sub.11] is pulled up and merged with [[Sigma].sub.1] to obtain ([[Sigma].sub.1 [conjunction] 11], [[Pi].sub.8] is subsumed by [[Pi].sub.5], and [[union].sub.6] is merged into [[union].sub.6].

[Figure 18 ILLUSTRATION OMITTED]

7.3 Derivation Tracing Algorithm for General Views

Once we have obtained a canonicalized view definition, we can trace its tuple derivations through one segment at a time. Theorem 7.4 presents the derivation tracing query for a one-level AUSPJ view (or an AUSPJ segment). Tuple derivations for a D-segment are traced based on Theorem 4.4

THEOREM 7.4 (TRACING QUERY FOR A ONE-LEVEL AUSPJ VIEW). Consider a one-level AUSPJ view [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Given tuple t [element of] V, t's derivation according to v can be computed by applying the following query to the base tables:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Given tuple set T [subset or equal to] V, T's derivation tracing query is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the special case where the traced tuple set is the entire view table V (which will appear later in our recursive tracing algorithm for general views), we use a flag "ALL" to specify that the entire view table is to be traced, and the tracing query can be simplified by removing the semijoin

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The recursive tracing algorithm for ASPJ views (Figure 14) can now be extended to handle general views by modifying the procedure TableDerivation. The new specification for TableDerivation(T, v, D) is given in Figure 19. As we recursively trace down a canonicalized view definition tree, at each level we are tracing a view (or a base table) v that has one of three forms: (1) v = R where R is a base table; (2) v = v'([v.sub.1], ..., [v.sub.k]) where v' is an AUSPJ-segment and [v.sub.i] is an intermediate view or a base table, i = 1..k; (3) v = [v.sub.1] - [v.sub.2] where [v.sub.i] is an intermediate view or a base table, i = 1, 2. These three cases map to the case statement in Figure 9. For derivation tracing, we may need to store or recompute the contents of some intermediate views, including the contents of the [v.sub.i]'s in case (2) and v in case (3). For case (2), to trace the derivation of T according to v, we first apply the AUSPJ tracing query TQ from Theorem 7.4 to [V.sub.1], ..., [V.sub.k] in order to obtain t's derivation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] in [V.sub.i], i = 1..k. We then recursively trace the derivations of the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] through lower segments. For case (3), we trace the derivation of T according to [v.sub.1] and the derivation of all tuples in [v.sub.2](D) according to [v.sub.2], as discussed in Section 7.1. Recall from Theorem 7.4 that we do not need to store or recompute the actual view table [v.sub.2](D). Instead, we use a flag "ALL" to specify that the entire view table is to be traced.

Fig. 19. Derivation tracing algorithm for general views.

procedure TableDerivation(T, v, D) in: a tracing tuple set T (or the special symbol ALL), a view definition v, and a database D out: T's derivation in D according to v 1 case v = R [element of] D: 2 if T = ALL then return (<R>); else return (<T>); 3 case v - v'([v.sub.1], ..., [v.sub.k]): 4 // v' is a one-level ASPJ view 5 // [V.sub.j] = [v.sub.j](D) is an intermediate view or a base table, j = 1..k 6 [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] 7 return ([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]; 8 case v = [v.sub.1] - [v.sub.2]: 9 // V = v(D) 10 // [V.sub.j] = [v.sub.j](D) is an intermediate view or a base table, j = 1,2 11 if T = ALL then T [left arrow] V; 12 return (TableListDeriv((T, ALL), <[v.sub.1], [v.sub.2]>, D));

8. DERIVATION TRACING WITH BAG SEMANTICS

So far we have addressed the derivation tracing problem using set semantics: the base tables are assumed to be sets, and the view is defined using operators with set semantics so that all operation results, including the final view table, are also sets. In this section we extend our definition for tuple derivations as well as our tracing algorithms to consider bag semantics. That is, we treat base tables and the results of [Sigma], [Pi] and ?? operations as bags, and we use bag union (w) and bag difference (??).(5) Duplicate elimination, if desired, can be achieved using our aggregation (a) operator.

Duplicates can occur in base tables and/or in views. Consider the four scenarios in Figure 20. Case (1) represents the scenario we have considered so far. Case (4) represents the most general scenario to be addressed in this section, and case (2) will use the same algorithms as (4). Case (3), in which views may have duplicates but base tables are known to have keys, allows us to perform certain optimizations related to the key-based example we gave earlier in Section 5.3. In fact, we may choose to transform a case (4) scenario into case (3) by attaching system-generated tuple IDs to the base data in order to apply a more efficient tracing procedure.

Fig. 20. Base table and view scenarios.

base tables no duplicates duplicates possible no duplicates (1) (2) view duplicates possible (3) (4)

In general, the derivation tracing problem for set semantics considered so far is subsumed by the problem we are about to address for bag semantics. However, as we will see, the solutions we gave for set semantics are simpler and more efficient than those for bag semantics, so the separate treatment is worthwhile.

8.1 Tuple Derivations for Bag Semantics

Consider an operator Op with bag semantics. Given N copies of a tuple t in Op([T.sub.1], ..., [T.sub.m]), there are at least N ways to derive t from the [T.sub.i]'s. Consider the following examples.

(1) Let R = {<a>, <a>}, S = {<b>} and V = R x S = {<a, b>, <a, b>}. A tuple <a, b> in V may be derived from the first <a> in R and the <b> in S, or from the second <a > in R and the <b > in S.

(2) Let R(A, B) = {<a, b>, <a, c>} and V= [[Pi].sub.A](R) = {<a>, <a>} where [Pi] preserves dulipcates. A tuple <a> in V may be derived from <a, b> or <a, c> in R.

(3) Let R = {<a>}, S = {<a>} and V = R w S = {<a>, <a>}. A tuple <a> in V may be derived from the <a> in R or the <a> in S.

(4) Let R = {<a>, <a>}, S = {<a>} and V = R ?? S = {<a>}. The (single) tuple <a> in V may be derived from the first or the second <a> in R. Thus, often, there is no longer a unique derivation of t according to Definition 4.1. Given any operator other than bag difference, if there are N copies of t in the result then there are exactly N derivations of t, each of which derives one copy of t as in examples 1-3 above. Given a bag difference operator, the number of derivations of t may exceed the number of t's in the result, as in example 4 above.

More generally, given a view v, a tuple t [element of] v(D) may have multiple derivations according to v. We may sometimes want to retrieve the derivations of a tuple one by one. At other times we may prefer a complete set of contributing tuples, without distinguishing individual derivations. We introduce the concepts of derivation set and derivation pool in Definitions 8.2 and 8.3 to cover both cases. First, we redefine tuple derivations for a view under bag semantics, since our previous definition (Definition 4.5), although clearer, produces only derivation pools under bag semantics and not individual derivations. Definition 8.1 subsumes Definition 4.5.

Definition 8.1 (Tuple Derivations for a View). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) be a view over D. Consider a tuple t [element of] V.

(1) v = [R.sub.i]: {t} is a derivation of t according to v.

(2) v = Op([v.sub.1], ..., [v.sub.k]), where [v.sub.j] is a view definition over D, j = 1..k:

Suppose ([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) is a derivation of t according to Op (by Definition 5.2), and <[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]> is a derivation of [T.sub.i]* according to [v.sub.j] over (recursively, by this definition), j = 1..k. Then <[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]> is a derivation of t according to v.

Derivations for a tuple set T are constructed from all possible combinations of derivations for the tuples in T such that the derivations selected for any [t.sub.1] = [t.sub.2] [element of] T are different.

As we can see, the number of derivations for a tuple set T may be exponential in the number of tuples in T: if T contains n tuples, and each tuple has up to m derivations, then T may have as many as [m.sup.n] derivations. Thus, enumerating all derivations for a tuple set (Definition 8.2) can be impractical, and we may prefer to trace the tuple set's derivation pool (Definition 8.3).

Definition 8.2 (Derivation Set). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) be a view over D. Given a tuple t [element of] V, the set of all of t's derivations according to v is called the derivation set of t according to v, denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII](t).(6) The derivation set of a tuple set T [subset or equal to] Op([T.sub.1], ..., [T.sub.m]) is the set of all derivations of T, denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (T).

Definition 8.3 (Derivation Pool). Let D be a database with base tables [R.sub.1], ..., [R.sub.m] and let V = v(D) be a view over D. Given a tuple t [element of] V, let <[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) contain all tuples in any of t's derivations (based on Definition 4.5). ([MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) is called the derivation pool of t according to v, denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII](t). The derivation pool of a tuple set T [subset or equal to] V contains all tuples in the derivation pool of any tuple in T, and is denoted [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII](T).

Notice that a view tuple's derivation set and derivation pool contain exactly the same collection of base tuples, but organized in different ways. As we will see, their tracing procedures will differ considerably.

Example 8.4 (Multiple Derivations, Derivation Set, Derivation Pool). Consider a view Stationery defined over our original tables store, item, and sales from Figures 1-3: Stationery = [Pi]s-name, i-name ([Sigma]category="stationery"(store ?? item ?? sales)), where now [Pi] does not eliminate duplicates. Figure 21 shows Stationery's contents. A single tuple t = <Target, pencil> in Stationery has two derivations, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], as shown in Figure 22, so t's derivation set is {[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]}. Figure 23 shows t's derivation pool.

Fig. 21. Stationery contents.

s_name i_name Target binder Target pencil Target binder Target pencil

Fig. 22. Multiple derivations of <Target, pencil>.

[D*.sub.1] = store s_id s_name city state 001 Target Palo Alto CA item i_id i_name category 0002 pencil stationery sales s_id i_id price num 001 0002 1 3000 [D*.sub.2] = store s_id s_name city state 002 Target Albany NY item i_id i_name category 0002 pencil stationery sales s_id i_id price num 002 0002 2 2000

Fig. 23. Derivation pool for (Target, pencil).

store s_id s_name city state 001 Target Palo Alto CA 002 Target Albany NY item i_id i_name category 0002 pencil stationery sales s_id i_id price num 001 0002 1 3000 002 0002 2 2000

We are now considering two modes of derivation tracing: tracing individual derivations and tracing derivation pools. We can prove that the property of view and derivation equivalence after our canonicalizing transformations (Theorem 7.2) still holds for bag semantics, derivation sets, and derivation pools. Thus, we can still transform a view into canonical form before tracing derivation sets or pools. The transitive property (Theorem 4.7) also applies to derivation sets and pools under bag semantics, which allows us to trace derivations down the view definition tree recursively. In addition, we prove the following theorems, which provide the groundwork for our later tracing algorithms.

THEOREM 8.5 (DERIVATION UNIQUENESS FOR UNIQUE VIEW TUPLES). Let v be a general view over database D with bag semantics and no difference operators. If a tuple t [element of] v(D) has no duplicates, then t has a unique derivation in v. As a result, in this case [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

THEOREM 8.6 (DERIVATION POOL OF A SET OF SAME-VALUED TUPLES). Let v be any general view over database D. If all tuples in a tuple set T [subset or equal to] v(D) have the same value t, then [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. This also implies that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

THEOREM 8.7 (DERIVATION POOL OF SELECTED PORTION IN VIEW TABLE). Let v be any general view over database D. To trace the derivation pool of a selected portion [[Sigma].sub.C](v(D)) of the view table, we can define another view v' = [[Sigma].sub.c](V), and trace the derivation of the entire view table v' (D) according to v'. In other words:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where v' = [[Sigma].sub.C](V).

In Sections 8.2, 8.3, and 8.4, we develop techniques for (1) tracing derivation pools; (2) tracing derivation sets; and (3) associating a unique derivation with each view tuple using existing or system-generated base table keys.

8.2 Tracing Derivation Pools

Tracing derivation pools for views with bag semantics is similar to tracing derivations for views with set semantics as specified in Sections 5-7. We begin by specifying the derivation pools for all of the relational operators, using bag semantics.

THEOREM 8.8 (DERIVATION POOLS FOR OPERATORS WITH BAG SEMANTICS). Let T, [T.sub.1], ..., [T.sub.m] be tables. Recall that [T.sub.i] denotes the schema of [T.sub.i]. The derivation pools [Op.sup.-p] for [Sigma], [Pi], ??, and [Alpha] are the same as their derivations [Op.sup.-1] as specified in Theorem 2, except that {t} is replaced by [[Sigma].sub.T=t(T)], and t.[T.sub.i] is replaced by [[Sigma].sub.[T.sub.i]=t.[T.sub.i]]([T.sub.i]).

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The derivation pool tracing query for one-level AUSPJ views with bag semantics is obtained by replacing the Split operator in Theorem 9 with a targeted split (TSplit) operator.

Definition 8.9 (TSplit Operator). Let T be a table with schema T and let [T.sub.i] be a table with schema [T.sub.i] [subset or equal to] T, i = 1..m. We define the targeted split of T by [T.sub.1], ..., [T.sub.m] to be

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Note that we need to use semijoins with the data in [T.sub.1], ..., [T.sub.n] instead of simple projections based on schema (which we used in the original Split operator), because semijoin is the only way to produce the correct number of duplicates in the derivation result. However, since TSplit is less efficient than Split, we recommend using the original tracing query when the base tables are known to have no duplicates.

The general derivation pool tracing procedure is obtained by making only small modifications to procedure TableDerivation from Figure 19, shown as the underlined portions in Figure 24. We replace T with R ?? T in the return value for the case v = R to retrieve all tuples in R that are also in T, including duplicates. The tracing query TQ is now using TSplit in Definition 8.9 instead of using Split from Theorem 7.4. Finally, for D-segments (v = [v.sub.1] ?? [v.sub.2]), we separate the case where T contains only tuples with the same value t. When all tuples have the same value, the recursive procedure call directly follows the derivation pool equation in Theorem 8.8. However, if two tuples in T have distinct values, [t.sub.1] and [t.sub.2] say, then we still need to include any copies of [t.sub.1] that might appear in [v.sub.2] (or derivations of such tuples), since any [t.sub.1]'s in [v.sub.2] are part of [t.sub.2]'s derivation pool in v. A similar argument holds with [t.sub.1] and [t.sub.2] reversed.

Fig. 24. Derivation pool tracing algorithm.

procedure TableDerivation(T, v, D) in: a tracing tuple set T, a view definition v, and a database D out: T's derivation pool in D according to v 1 case v = R [element of] D: 2 if T = ALL then return (<R>); 3 else return (<R ?? T>); 4 case v = v'([v.sub.1], ..., [v.sub.k]): 5 // v' is a one-level AUSPJ view 6 // [V.sub.j] = [v.sub.j](D) is an intermediate view or a base table, j = 1..k 7 [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] 8 return [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]; 9 case v = [v.sub.1] ?? [v.sub.2]: 10 // V = v(D) 11 // [V.sub.j] = [v.sub.j](D) is an intermediate view or a base table, j = 1,2 12 if T = ALL then T [left arrow] V; 13 if T contains only tuples with value t 14 then return (TableListDeriv({T, ALL), [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], D)); 15 else return (TableListDeriv((T, ALL), <[v.sub.1], [v.sub.2]>, D));

8.3 Tracing Derivation Sets

Unlike the derivation pool for a tuple t, t's derivation set separates its distinct derivations. In this section we present a derivation enumeration technique that produces t's derivations one by one, thus generating t's entire derivation set.

Procedure DerivEnum(t, v, D) in Figure 25 traces the derivation set of a tuple t according to a general view v. We assume that v is already in canonical form (Section 7.2). However, when recursively tracing down the view definition tree, we treat AUSPJ segments with an omitted (i.e., trivial) aggregation operator separately. For clarity of presentation, we also separate the bag union node from the SPJ subtree in these segments. Thus, each (intermediate) view v we trace through during the derivation enumeration procedure has one of the following five forms: (1) v = R; (2) v = [[Pi].sub.A]([[Sigma].sub.C]([v.sub.1] ?? ... ?? [v.sub.k])); (3) v = [v.sub.1] ?? ... ?? [v.sub.k]; (4) v = v'([v.sub.1], ..., [v.sub.k]); and (5) v = [v.sub.1] ?? [v.sub.2], where R is a base table in database D, [v.sub.1] is an intermediate view or a base table, i = 1..k, and v' is an AUSPJ segment with a nontrivial aggregation node. These cases map to the five cases in Figure 25, in order.

Fig. 25. Derivation enumeration algorithm.

procedure DerivEnum(t, v, D) in: a tracing tuple t, a view definition v, and a database D out: the set of all t's derivations in D according to v 1 DS [left arrow] ??; 2 case v = R [element of] D: 3 for each tuple in [Sigma]R = t(R) do 4 insert ({t})into DS; 5 case v = [[Pi].sub.A]([[Sigma].sub.C]([v.sub.1] ?? ... ?? [v.sub.k])): 6 Pending [left arrow] [[Sigma].sub.C][conjunction]A=t] ([v.sub.1](D) ?? ... ?? [v.sub.k](D)); 7 while Pending [is not equal to] ?? do 8 begin 9 pick a tuple t' in Pending; 10 for j [left arrow] 1 to k do 11 [DS.sub.j] [left arrow] DerivEnum(t'.[V.sub.j], [v.sub.j], D); 12 insert all elements of [DS.sub.1] x ... x [DS.sub.k] into DS; 13 remove t' and all its duplicates from Pending; 14 end 15 case v = [v.sub.1] ?? ... ?? [v.sub.k]: 16 for j [left arrow] 1 to k do 17 insert all elements of DerivEnum(t, [v.sub.j], D) into DS; 18 case v = v'([v.sub.1], ... [v.sub.k]): // v' is an AUSPJ view with aggregation: 19 [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] 20 for j [left arrow] 1 to k do 21 [DS.sub.j] [left arrow] TableDerivEnum [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]; 22 DS [left arrow] [DS.sub.1] x ... x [DS.sub.k]; 23 case v = [v.sub.1] ?? [v.sub.2]: 24 DS [left arrow] DerivEnum(t, [v.sub.1], D) x TableDerivEnum(ALL, [[Sigma].sub.2[is not equal to]t] ([v.sub.2]), D); 25 return (DS); procedure TableDerivEnum(T, v, D) in: a tracing tuple set T, a view definition v, and a database D out: the set of all T's derivations in D according to v 1 let T = {[t.sub.1], ..., [t.sub.n]}; 2 for i [left arrow] 1 to n do 3 [DS.sub.i] [left arrow] DerivEnum([t.sub.]i, v, D); 4 DS [left arrow] ??; 5 for each element [D.sub.1], ..., [D.sub.n] in [DS.sub.1] x ... x [DS.sub.n] do 6 if [inverted]Ai [is not equal to] j: [D.sub.i] [is not equal to] [D.sub.j] then insert [D.sub.1][union] ... [union][D.sub.n] into DS; 7 return (DS);

For case (1), according to Definition 4.5 each copy of t in R forms a derivation {t} of t in V. For ease (2), we first compute t's derivations in [v.sub.1](D), ..., [v.sub.k](D) based on Theorem 8.10.

THEOREM 8.10 (DERIVATION SET FOR AN SPJ VIEW). Given an SPJ view V = v([T.sub.1], ..., [T.sub.m]) = [[Pi].sub.A]([[Sigma].sub.C]([T.sub.1] ?? ... ?? [T.sub.m])) and a tuple t [element of] V, t's derivation set according to v is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

We initialize a Pending set to be [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. According to Theorem 8.10, [inverted]At' [element of] Pending: [D.sup.*] = <{t'.[V.sub.1]}, ..., {t'.[V.sub.k]}> forms a derivation of t in [v.sub.1](D), ..., [v.sub.k](D). We then trace the derivation set [DS.sub.j] of t'.[V.sub.j] according to [v.sub.j], for j = 1..k. Based on Definition 4.5, we then append the cross-product of the [DS.sub.j]'s to t's derivation set, where [DS.sub.1] x [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. We remove t' and its duplicates from the Pending set, and repeat the above process until the Pending set becomes empty.

For case (3), where v is a bag union of [v.sub.1], ..., [v.sub.k], each t in any [v.sub.i](D) is a derivation of t [element of] v(D) according to the bag union operator, so each derivation of t according to [v.sub.i] forms a derivation of t according to v. Therefore, we simply enumerate through t's derivations according to each [v.sub.i], adding them to the derivation set.

Consider case (4), where v is defined by an AUSPJ segment v' over [v.sub.1], ..., [v.sub.k] with a nontrivial aggregation operator. Since v has no duplicates due to the aggregation, by Theorem 8.5, t has a unique derivation in [V.sub.1], ..., [V.sub.k] according to v'. This derivation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is obtained by tracing t in [V.sub.1], ..., [V.sub.k] according to v'. We then trace the derivation set [DS.sub.j] for each [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] according to [v.sub.j]. (More on this step below.) The cross-product of the [DS.sub.j's] forms the derivation set of t according to v.

For case (5), where v is a bag difference of [v.sub.1] and [v.sub.2], each derivation oft according to [v.sub.1] and each derivation of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] ([v.sub.2](D)) according to [v.sub.2] together form a derivation of t according to v. Thus, t's derivation set is the cross-product of t's derivation set according to [v.sub.1] and [T.sub.2]'s derivation set according to [v.sub.2].

Procedure TableDerivEnum(T, v, D) in Figure 25 is called to trace the derivation set of a tuple set T according to a view v. Recall from Definition 8.1 that derivations for a tuple set T are constructed from all possible combinations of derivations for the tuples in T such that the derivations selected for any [t.sub.1] = [t.sub.2] [element of] T are different. Let T have n tuples. We first trace the derivation set [DS.sub.i] for each tuple [t.sub.i] [element of] T, i = 1..n. Then, for every combination of distinct derivations [D.sub.1], ..., [D.sub.n] from [DS.sub.1], ..., [DS.sub.n], we add [D.sub.1][union] ... [union][D.sub.n] to the derivation set for T. Note that this procedure can be extremely expensive if T is large. While calling TableDerivEnum is necessary in the general case, in many cases we can replace it with the much cheaper TableDerivation (Figure 24). In case (4), if v does not contain any difference operators, then by Theorem 8.5, t has a unique derivation in D according to v. Hence, we can use t's derivation pool obtained by calling TableDerivation, since in this case the derivation pool is the same as the derivation set. Likewise, in case (5), if [v.sub.2] does not contain any difference operators, we can replace TableDerivEnum with TableDerivation to obtain the single derivation for ALL.

8.4 Using Key Information

We now propose an alternative approach to tracing view tuple derivations in the presence of duplicates for multilevel ASPJ views. Suppose for now that view v's base tables all have keys, i.e., case (3) in Figure 20. We can extend v's definition using the key information to obtain a supporting view v' such that v', as well as all of its intermediate results, has no duplicates. Then, after mapping each tuple t [element of] v(D) to a distinct tuple t' [element of] v'(D), we can retrieve a unique derivation for t by tracing the derivation of t' according to v' using the tracing algorithm for set semantics. In other words, we use base table keys to assign a unique derivation to each copy of each view tuple. If we are not interested in individual unique derivations, this technique is still useful for tracing derivation sets and pools as in previous sections in a more efficient manner.

THEOREM 8.11 (SPJ VIEW DEFINITION WITH KEYS). Let D be a database with tables [R.sub.1], ..., [R.sub.m], and let V = v(D) = [[Pi].sub.A]([[Sigma].sub.C]([R.sub.1] ?? ... ??[R.sub.m])) be an SPJ view over D. Suppose each [R.sub.i] has a set of attributes [K.sub.i] that are a key for [R.sub.i], i = 1..m. Then we can define a view V' = v' (D) = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]([[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m])) that contains no duplicates. If V contains n copies of a tuple t, then there are n tuples [t.sub.1], ..., [t.sub.n] in V' such that [t.sub.j].A = t, j = 1..n. The derivation of each [t.sub.j] according to v' is also a derivation of t according to v, and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

According to Theorem 8.11, we can map each copy of a tuple t [element of] V to a distinct tuple [t.sub.j] [element of] V' such that [t.sub.j].A = t. Then we can associate a unique derivation with each copy of t by tracing the corresponding [t.sub.j]'s derivation. We can also retrieve the entire derivation set (or pool) for t by tracing the derivations for all the [t.sub.j]'s. Recall from Theorem 6 that derivation tracing for views that include base table keys (e.g., view v') takes time at most linear in the size of [R.sub.1], ..., [R.sub.m], which is generally much more efficient than the derivation enumeration algorithm in Section 8.3.

Example8.12 (SPJ View Extended with Keys). Consider again the problem of tracing the derivation of <Target, pencil> according to view Stationery in Example 8.4. Suppose table store has key s_id, item has key i_id, and sales has key <s_id, i_id>. We first define a view ExtStationery = [[Pi].sub.s-name, i-name, s-id, i-id]([[Sigma].sub.category="stationery"](store ?? item ?? sales)). Figure 26 shows the view contents. Let us map the traced tuple t = <Target, pencil) in Stationery to t' = <Target, pencil, 001,0002> in ExtStationery. We retrieve t"s derivation according to ExtStationery and obtain the result in Figure 27. We can similarly map another copy of <Target, pencil> in Stationery to t" = <Target, pencil, 002, 0002> in ExtStationery, and retrieve the second derivation, shown in Figure 28. Note that Figures 27 and 28 are identical to the two derivations in Figures 22, as desired.

Fig. 26. ExtStationery contents.

s_name i_name s_id i_id Target binder 001 0001 Target pencil 001 0002 Target binder 002 0001 Target pencil. 002 0002

Fig. 27. Derivation of t'.

store s-id s-name city state 001 Target Palo Alto CA item i_id i_name category 0002 pencil stationery sales s_id i_id price num 001 0002 1 3000

Fig. 28 Derivation of t".

store s_id s_name city state 002 Target Albany NY item i_id i_name category 0002 pencil stationery sales s_id i_id price num 002 0002 2 2000

Given a multilevel ASPJ view V = v(D), if v's top segment contains an aggregation operator then V has no duplicates, and each tuple in V has a unique derivation (Theorem 8.5). If v's top segment contains no aggregation operator, i.e., V = [[Pi].sub.A]([[Sigma].sub.C]([T.sub.1] ?? ... ?? [T.sub.m])) where [T.sub.i] is a base table or an intermediate aggregation view, we can first rewrite the top segment to include [T.sub.i]'s key attributes, as in Theorem 8.11. (For the cases where [T.sub.i] is an intermediate aggregation view, its key is the groupby attributes. Otherwise we use base table keys.) Then, each tuple in the extended view has a unique derivation, which can be traced using the recursive algorithm in Figure 14.

For base tables without keys, we can still apply the techniques introduced in this section by first attaching a system-generated key (e.g., the tuple ID or a surrogate) to each base tuple. We then extend the view definition to include these keys as described earlier. After tracing tuple derivations for the extended view, we project out the system-generated key attributes from the derivation results to obtain the correct derivation according to the original view.

Extending the techniques of this section to general views including the bag operators union and difference is somewhat complex, due in part to the requirement of uniform schemas in operands. We leave the extension as an exercise for the reader.

8.5 Algorithms Summary

We have presented four major derivation tracing algorithms:

(1) The basic tracing algorithm (Sections 5-7) traces view tuple derivations under set semantics for general views.

(2) The pool tracing algorithm (Section 8.2) traces view tuple derivation pools under bag semantics for general views. It retrieves all contributing tuples for the traced tuple without distinguishing individual derivations. The algorithm itself is similar to the basic tracing algorithm, although it incurs some overhead for handling duplicates in base tables and intermediate views during the tracing process.

(3) The enumerating algorithm (Section 8.3) traces view tuple derivation sets under bag semantics for general views. It lists each derivation of the traced tuple separately. This algorithm is the most expensive of the four algorithms we propose.

(4) The key-based algorithm (Section 8.4) traces view tuple derivations in the presence of duplicates in the view by using key information to associate a unique derivation with each view tuple. In general, this algorithm is more efficient than the other three algorithms, but it can only be used easily for multilevel ASPJ views and may require additional machinery to generate tuple IDs.

9. DERIVATION TRACING IN A WAREHOUSING ENVIRONMENT

In a distributed multisource data warehousing environment, querying the sources for lineage information can be difficult or impossible: sources may be inaccessible, expensive to access, expensive to transfer data from, and/or inconsistent with the views at the warehouse. In Section 9.1, we consider the tradeoffs between materializing and recomputing intermediate results in the view definition tree, and conclude that storing intermediate aggregation results improves overall lineage tracing and view maintenance performance. In Section 9.2, we discuss how we can reduce or entirely avoid source queries, and perform efficient and consistent lineage tracing, by storing modest amounts of additional auxiliary data from the source tables in the warehouse. Sections 9.3 and 9.4 briefly discuss functionality, implementation, and performance issues in a warehouse lineage tracing system based on the results and algorithms in this paper.

9.1 Materializing Intermediate Aggregate Results

In Sections 6 and 7, we saw that intermediate aggregation results are needed for derivation tracing in the general case. There are two approaches to obtaining the necessary results. One is to recompute the required intermediate data during the tracing process. This approach requires no extra storage or maintenance cost, but the tracing process takes longer, especially when the recomputation may require querying the sources. The other approach is to maintain materialized auxiliary views containing the intermediate results. In this approach, less computation is required at tracing time, but the auxiliary views must be stored and kept up-to-date. Because efficient incremental maintenance of multilevel aggregate views generally requires materializing the same intermediate views we need for lineage tracing, we take the materialized approach in this paper.

Example 9.1 (Materialized View for AllClothing). To improve the efficiency of the tracing process in Example 6.4, we materialize auxiliary view AllClothing (in Figure 13) with the contents shown in Figure 15.

Note that when materializing AllClothing, the tuple (Target, 1400) is never used when tracing tuple derivations for Clothing, since it is filtered out by the selection condition [[Sigma].sub.total [is greater than] 5000] in Clothing's definition. In this case, materializing the result of V' = [[Sigma].sub.total [is greater than] 5000](AllClothing) instead of AllClothing seems to be a better choice. However, notice that V' is not incrementally maintainable without storing AllClothing. Thus, we would either need to recompute V' for each relevant update, which would incur a high maintenance cost, or we need to store all of AllClothing in order to maintain V'. Hence, materializing V' is not actually an improvement. In general, given a view definition tree where selections are pushed down as far as possible, all selection conditions above an aggregate are selections on the aggregated attributes, and therefore are not incrementally maintainable without storing the entire aggregation results.

9.2 Derivation Views

By storing auxiliary data based on the source tables in the warehouse, we can reduce or avoid expensive source queries during lineage tracing entirely. There is a wide variety of possible auxiliary views to maintain, with different performance tradeoffs in different settings. A simple extreme solution is to store a complete copy of each source table in the warehouse. Our tracing procedures will then query these source table copies as if they were the sources. However, this solution can be costly and wasteful if a source is large, especially if much of its data does not contribute to the view. Also, computing selections and joins over large source table copies each time a tuple's derivation is traced can be expensive. Other solutions can in some cases store much less information and still enable derivation tracing without accessing the sources, but the maintenance cost is higher. We propose an intermediate scheme that achieves low tracing query cost with modest extra storage and maintenance cost. For the following discussion, we focus on general ASPJ views with set semantics.

After adding auxiliary views for intermediate aggregation results as described in Section 10.1, the views of concern are those defined by the lowest-level segments, which are directly over the source tables. Let V = v([R.sub.1], ..., [R.sub.m]) be such a view. We introduce an auxiliary view, called the derivation view for v. It contains information about the derivation of each tuple in V over V = v([R.sub.1], ..., [R.sub.m]), as specified in Definition 9.2, given next. Theorem 9.3, which follows directly from Theorem 6.2, then shows that any V tuple's derivation in [R.sub.1], ..., [R.sub.m] can be computed with a simple selection and split operation over the derivation view.

Definition 9.2 (Derivation View). Let V = v([R.sub.1], ..., [R.sub.m]) = [[Alpha].sub.G, aggr(B)]([[Pi].sub.A]([[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m]))) be a one, level ASPJ view over source tables [R.sub.1], ..., [R.sub.m]. The derivation view for v, denoted DV(v), is

DV(v) = [[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m]).

THEOREM 9.3 (DERIVATION TRACING USING THE DERIVATION VIEW). Let V be a one-level ASPJ view over base tables: V = v([R.sub.1], ..., [R.sub.m]) = [[Alpha].sub.G, aggr(B)]([[Sigma].sub.C]([R.sub.1] ?? ... ?? [R.sub.m]))). Let DV(v) be v's derivation view as defined in Definition 9. Given a tuple t [element of] V, t's derivation in [R.sub.1], ..., [R.sub.m] according to v can be computed by applying the following query to DV(v):

[TQ.sub.t, v] = [Split.sub.[R.sub.1], ..., [R.sub.m]]([Sigma]G=tG(DV(v))).

Given tuple set T [subset or equal to] V, T's derivation tracing query using DV(v) is

[TQ.sub.T, v] = [Split.sub.1[R.sub.1], ... [R.sub.m]](DV(v) ?? T).

Example 9.4 (Derivation View for AllClothing). The view AllClothing in Example 9.1 is defined on base tables store, item, and sales. Suppose these base tables are located in remote sources that we cannot or do not wish to access. In order to trace tuple derivations for AllClothing, we maintain a derivation view DV_AllClothing. Figure 30 shows the derivation view definition, and Figure 30 shows its contents. Then, computing derivations for tuples in view AllClothing requires accessing view DV_AllClothing only, and not the source tables.

Fig. 30. DV_AllClothing table.

s_id s_name city state i_id i_name category 001 Target Palo Alto CA 0004 pants clothing 002 Target Albany NY 0004 pants clothing 003 Macy's San Francisco CA 0003 shirt clothing 003 Macy's San Francisco CA 0004 pants clothing 004 Macy's New York City NY 0003 shirt clothing 004 Macy's New York City NY 0004 pants clothing s_name city price num Target Palo Alto 30 600 Target Albany 35 800 Macy's San Francisco 45 1500 Macy's San Francisco 60 600 Macy's New York City 50 2100 Macy's New York City 70 1200

Using previously devised techniques for data warehousing [Zhuge et al. 1996; 1997], the auxiliary intermediate views and derivation views can be maintained consistently with each other, and with other views in the warehouse. Note that in cases of warehousing environments where the sources are inaccessible, the auxiliary views themselves need to be made self-maintainable. Previously devised techniques can be used here as well [Gupta et al. 1996; Quass et al. 1996].

There are alternative derivation views to the one proposed here that trade tracing query cost for storage or maintenance cost. One simple option is to split the derivation view into separate tables that contain the base tuples of each source relation that contribute to the view. This scheme may reduce the storage requirement, but tracing queries must then recompute the join. Of course, if accessing the sources is cheap and reliable, then it may be preferable to query the sources directly. However, a compensation log [Hull and Zhou 1996] may be needed to keep the tracing result consistent with the warehouse views. Determining whether it is better to materialize the necessary information for derivation tracing or to query the sources and recompute information at tracing time in a given setting (based on query cost, update cost, storage constraints, source availability, etc.) is an optimization problem [Gupta 1997; Labio et al. 1997].

9.3 A System Supporting Derivation Tracing

Figure 31 illustrates an overall warehouse structure for our example that supports tuple derivation tracing with materialized intermediate aggregation results and derivation views. The query tree on the left side of Figure 31 is the original definition of view Clothing. In order to trace Clothing's tuple derivations, an auxiliary view AllClothing is maintained to record the intermediate aggregation results (as discussed in Section 9.1). Furthermore, to trace tuples in AllClothing, derivation view DV_AllClothing is maintained (as discussed in Section 9.2). The final set of materialized views is

Clothing = [[Pi].sub.total]([[Sigma].sub.total [is greater than] 5000](AllClothing))

AllClothing = [[Alpha].sub.s_name, sum(num) as total(DV_ALLClothing)

DV AllClothing = [[Sigma].sub.category="clothing"](store ?? item ?? sales)

[Figure 31 ILLUSTRATION OMITTED]

Each view can be computed and maintained based on the views (or base tables) directly beneath it using warehouse view maintenance techniques [Quass et al. 1996; Zhuge et al. 1997]. Bold arrows on the right side of Figure 27 show the query and answer data flow. Ordinary view queries are sent to the view Clothing, while derivation queries are sent to the Derivation Tracer module. The tracer takes a request for the derivation of a tuple t in Clothing and queries auxiliary view AllClothing for t's derivation [T.sub.1] in AllClothing, as specified in Theorem 5. The tracer then queries DV_AllClothing for the derivation [T.sub.2] of [T.sub.1] in D, as specified in Theorem 16. [T.sub.2] is t's derivation in D.

9.4 Implementation

Based on the results in this paper, we have implemented a lineage tracing system [Cui and Widom 2000al as part of the WHIPS [Wiener et al. 1996] data warehousing prototype at Stanford. The system implements our tracing algorithms for arbitrary ASPJ views with set semantics (Section 6) and an interactive Web interface. Through the interface, users can scan a materialized view at the warehouse and select view tuples of interest. Lineage of the selected tuples can be traced down one level in the WHIPS view hierarchy [Labio et al. 1999], or all the way back to the sources. The interface can also show the derivation process, illustrating step-by-step how the base tuples derived the selected view tuples. Although the system transforms the original view definition into our canonical form for lineage tracing, the derivation process is displayed according to the original view definition.

Related to the issues discussed in Sections 9.1 and 9.2, we have also conducted performance studies of various auxiliary view schemes and selection algorithms, considering both lineage tracing and view maintenance costs. Our studies focused on SPJ views in Cui and Widom [2000b] and considered arbitrary relational views with aggregation in Cui and Widom [1999].

10. RELATED WORK REVISITED

In this section we revisit top-down Datalog query processing and the view update problem, and examine the differences between those problems and ours. We also show how our lineage tracing algorithms can be applied to facilitate solutions to the view update problem.

10.1 Top-Down Datalog Query Processing

In Datalog, relations are represented as predicates, tuples are atoms (or facts), and queries or views are represented by logical rules. Each rule contains a head (or goal) and a body with some subgoals that can (possibly recursively) derive the head [Ullman 1989]. There are two modes of reasoning in Datalog: the bottom-up (or forward-chaining) mode and the top-down (or backward-chaining) mode. The top-down mode proves a goal by constructing a rule-goal graph with the goal as the top node, scanning the graph top-down, and recursively applying rule-goal unification and atom matching until finding an instantiation of all of the subgoals in the base data. Backtracking is used if a dead-end is met in the searching (proving) process.

Top-down evaluation of a Datalog goal thus provides information about the facts in the base data that yield the goal; in other words, it provides the lineage of the goal tuple. Our approach to tracing tuple lineage is obviously different from Datalog top-down processing. Instead of performing rule-goal unification and atom matching one tuple at a time, we generate a single query to retrieve all lineage tuples of a given tuple (or tuple set) in an SPJ or one-level ASPJ view. Our approach is better suited to tracing query optimization (as described in Section 5.3). Also, we support lineage tracing for aggregation views and bag semantics, which are not handled in Datalog. We do not handle recursion in this paper, although we believe our approach can be extended to recursive views while maintaining efficiency.

10.2 View Update Problem

The well-known view update problem is to transform updates on views into updates against the base tables on which the view is defined so that the new base tables will continue to derive the updated view. The problem was first formulated in Stonebraker [1975] and Dayal and Bernstein [1978], and solutions were based generally on the idea of view data lineage. In fact, Stonebraker [1975] provides a view update algorithm that is similar to our algorithm for tracing the lineage of SPJ views. However, neither paper formally defines the notion of data lineage, nor does consider complex view scenarios (e.g., views with aggregation) or the warehousing environment.

Our derivation tracing algorithm for ASPJ views can guide the view update process to find an appropriate update translation for views with aggregation in many cases. For deletions (or modifications), our derivation tracing algorithms can directly identify an appropriate set of base relation tuples to delete (or modify), as shown in the following example.

Example 10.1 (View Update: Deletion). In Example 4.6 (Figure 11), we illustrated the derivation for <2, 8> in view V = [[Alpha].sub.X, sum(Y)]([[Sigma].sub.Y[is not equal to] (R)]). When the view update command "delete <2, 8> from V" is issued, we can use the tuple derivation to determine that <2, 3> and <2, 5> should be deleted, and these should be the only changes. The updated base table will be R = {< 1, 2>, <1, 4>, <2, 0>}, which derives the updated view {< 1, 6>}. Note that without using tuple derivation tracing, a more naive algorithm might choose to delete <2, 0> also, which maintains "correctness" but deletes more than necessary.

For insertions, the problem is harder, since the view tuple being inserted as well as its derivation do not currently exist. Our derivation tracing algorithms can be adapted to identify some components of the possible derivations of a view tuple being inserted, thereby guiding the base tuple insertions. Any attribute that is not projected into the view must be guessed using extra semantics, such as user instructions or base table constraints, or left null. Even here, derivation tracing can help the "guessing" process in certain cases, as shown in the following example.

Example 10.2 (View Update: Insertion). Suppose view update "insert <3, 2> into V" is issued to the view in Example 4.6 (Figure 11). Since <3, 2> is not in view V, we cannot ask for its current derivation. We only know that after we update R, R should produce a new view with <3, 2> in it. According to the tracing query for V (as specified in Theorem 6.2), we can guess that after the update, the derivation R* of <3, 2> must satisfy the condition: [inverted]At [element of] R*: t.X = 3 [conjunction] t.Y [is not equal to] 0. Assuming a constraint that R.Y attributes are positive integers and considering the requirement sum(R*.Y) = 2, we can also assert that [inverted]At [element of] R*: t.Y [is less than or equal to] 2. By further assuming a constraint that R has no duplicates, we can assert that [inverted]A[t.sub.1], [t.sub.2] [element of] R*: [t.sub.1].Y [is not equal to] t2.Y. Putting all these assertions together, the only potential derivation of <3, 2> is <3, 2>, so an appropriate base table update is to insert <3, 2> into R.

Notice that the inserted tuple in Example 10.2 was carefully chosen. If <3, 8> were inserted instead, we would have to randomly pick a translation from the reasonable ones or ask the user to choose. Even in this case, lineage tracing techniques together with base table constraints can be very useful in reducing the number of possible translations.

11. CONCLUSIONS AND FUTURE WORK

We formulated the view data lineage problem and presented lineage tracing algorithms for relational views with aggregation using both set and bag semantics. Our algorithms identify the exact set of base data that produced a given view data item. We also presented techniques for efficient and consistent lineage tracing in a multisource warehousing system. Our results can form the basis of a tool by which an analyst can browse warehouse data and then "drill-through" to the source data that produced certain warehouse data of interest. We have implemented an initial lineage tracing system [Cui and Widom 2000a] within the WHIPS [Wiener et al. 1996] data warehousing prototype at Stanford, and we are in the process of extending our work in the following directions.

* Tuple derivations as defined in this paper explain how certain base tuples cause certain view tuples to exist. As such, derivation tracing is a useful technique for investigating the origins of potentially erroneous view data. However, in some cases a view tuple may be erroneous not (only) because some existing base tuples that derive it are erroneous, but because base tuples that should appear in the derivation are missing. For example, a base tuple may contribute to the wrong group in an aggregate view because its grouping value is incorrect. It is also possible that some tuples that should appear in a view are missing because of erroneous (or missing) base tuples. We plan to explore how both of these "missing data" problems can be addressed in our lineage framework.

* Most commercial data warehousing systems support data transformations in the process of producing warehouse views, e.g., Broadbase Software, Inc. [1981]; Microsoft [1999]; Sagent Technology [1999]. Some transformations may be defined using relational algebra (or SQL), while others are more general scripts or programs. We are extending our lineage tracing framework to handle a large class of general data transformations, which encompasses relational algebra, as well as the more complex and ad-hoc transformations supported by commercial systems.

* We believe that lineage tracing can be used to help solve related problems such as view schema evolution and view update. Some initial ideas for view update are presented in Section 10.2.

In summary, data lineage is a rich problem with many useful applications. In this paper we provide an initial practical solution for lineage tracing in data warehouses, leading to several additional interesting research issues as outlined above.

[Figure 29 ILLUSTRATION OMITTED]

ACKNOWLEDGMENTS

We are grateful to Sudarshan Chawathe, Himanshu Gupta, Jeff Ullman, Vasilis Vassalos, Yue Zhuge, and all of our WHIPS group colleagues for helpful and enlightening discussions. We also acknowledge the anonymous TODS referees for useful feedback and suggestions.

(1) For notational purposes, we always assume that G is nonempty. An empty G (indicating aggregation over all tuples) causes no problems in our framework, since it only results in dropped projection attributes and dropped selection conditions. However, we do not handle the special-case semantics of SQL aggregation that produces a single tuple (with value 0, for example) in the case of aggregating over an empty set.

(2) By Definition 5.2, if V = R - S, then t's derivation not only includes t from R, but also includes all tuples t' [is not equal to] t in S. We discuss this definition of derivation for set difference in more detail in Section 7.

(3) The concatenation of two relation lists <[R.sub.1], ..., [R.sub.m]> ?? <[S.sub.1], ..., [S.sub.n]> is <[R.sub.1], ..., [R.sub.m], [S.sub.1], ..., [S.sub.n]>. Recall that relations are renamed so that the same relation never appears twice.

(4) Although we use natural join throughout most of the paper, all results carry over directly to theta join. For this equivalence to hold in general, it is necessary to make the theta explicit.

(5) Bag operators do not eliminate duplicates. For examples, {a, b} ?? {a, c} = {a, a, b, c} and {a, a, b} ?? {a, b, c}. Bag operator ?? corresponds to SQL's UNION ALL (as opposed to UNION); bag operator ?? is not supported by SQL.

(6) Although called a set, a derivation set may well have duplicates; that is, two derivations of t may have the same value.

REFERENCES

BANCILHON, F. AND SPYRATOS, N. 1981. Update semantics of relational views. ACM Trans. Database Syst. 6, 4, 557-575.

BROADBASE SOFTWARE, INC. 1999. http://www.broadbase.com/.

CHAUDHURI, S. AND DAYAL, U. 1997. An overview of data warehousing and OLAP technology. SIGMOD Rec. 26, 1, 65-74.

CUI, Y. AND WIDOM, J. 2000a. Lineage tracing in a data warehousing system. In Proceedings of the 16th International Conference on Data Engineering (San Diego, CA, Feb.).

CUI, Y. AND WIDOM, J. 2000b. Practical lineage tracing in data warehouses. In Proceedings of the 16th International Conference on Data Engineering (San Diego, CA, Feb.).

CUI, Y. AND WIDOM, J. 1999. Storing auxiliary data for efficient maintenance and lineage tracing of complex views. Tech. Rep. Stanford University, Stanford, CA. http://www-db.stanford.edu/pub/papers/auxview.ps.

CUI, Y., WIDOM, J., AND WINNER, J. L. 1997. Tracing the lineage of view data in a warehousing environment. Tech. Rep. Stanford University, Stanford, CA. http://www-db.stanford.edu/ pub/papers/lineage-full.ps.

DAYAL, U. AND BERNSTEIN, P.A. 1978. On the updatability of relational views. In Proceedings of the 4th International Conference on Very Large Data Bases (Berlin, Germany, Sept. 13-15). 368-377.

FALOUTSOS, C., JAGADISH, H. V., AND SIDIROPOULOS, N. D. 1997. Recovering information from summary data. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.). 36-45.

GRAY, J., BOSWORTH, A., LAYMAN, A., AND PIRAHESH, H. 1996. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the 12th IEEE International Conference on Data Engineering (New Orleans, LA, Feb. 1997). IEEE Press, Piscataway, NJ, 152-159.

GUPTA, H. 1997. Selection of views to materialize in a data warehouse. In Proceedings of the 6th International Conference on Database Theory (ICDT '97, Delphi, Greece, Jan. 9-10). Springer-Verlag, Berlin, Germany, 98-112.

GUPTA, A., JAGADISH, H., AND MUMICK, I. S. 1996. Data integration using self-maintainable views. In Proceedings of the Fifth International Conference on Extending Database Technology (Avignon, France). 140-144.

GUPTA, A., HARINARAYAN, V., AND QUASS, D. 1995. Aggregate-query processing in data warehousing environments. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.). 358-369.

HACHEM, N. I., QIU, K., GENNERT, M., AND WARD, M. 1993. Managing derived data in the Gaea scientific DBMS. In Proceedings of the Conference on Very Large Data Bases (VLDB '93, Dublin, Ireland, Aug.). Morgan Kaufmann Publishers Inc., San Francisco, CA, 1-12.

HAN, J., CHEE, S., AND CHIAN, J.Y. 1998. Issues for on-line analytical mining of data warehouses. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (Seattle, WA, June '98).

HULL, R. AND ZHOU, G. 1996. A framework for supporting data integration using the materialized and virtual approaches. SIGMOD Rec. 25, 2, 481-492.

KAWAGUCHI, A., LIEUWEN, D. F., MUMICK, I. S., QUASS, D., AND Ross, Q.A. 1997. Concurrency control theory for deferred materialized views. In Proceedings of the 6th International Conference on Database Theory (ICDT '97, Delphi, Greece, Jan. 9-10). Springer-Verlag, Berlin, Germany, 306-320.

LABIO, W. J., QUASS, D., AND ADELBERG, B. 1997. Physical database design for data warehousing. In Proceedings of the 13th International Conference on Data Engineering (Birmingham, UK, Apr.). IEEE Computer Society, Washington, DC, 277-288.

LABIO, W. J., YANG, J., CUI, Y., GARCIA-MOLINA, H., AND WIDOM, J. 1999. Performance issues in incremental warehouse maintenance. Tech. Rep. Stanford University, Stanford, CA. http://www-db.stanford.edu/pub/papers/whips-wm.ps.

MICROSOFT. 1999. Microsoft SQL server: Data transformation services. Microsoft Press, Redmond, WA. MSDN Online Library, http://msdn.microsoft.com/library/psdk/sq]/dts_ovrw.htm

QUASS, D., GUPTA, A., MUMICK, I. S., AND WIDOM, J. 1996. Making views self-maintainable for data warehousing. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (Miami Beach, FL, Dec. '96). 158-169.

SAGENT TECHNOLOGY. 1999. Sagent Technology. http://www.sagent.com/

STONEBRAKER, M. 1975. Implementation of integrity constraints and views by query modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data (San Jose, CA, May). 65-78.

ULLMAN, g. D. 1989. Database and Knowledge-Base Systems. Computer Science Press, Inc., New York, NY.

WIENER, J. L., GUPTA, H., LABIO, W. J., ZHUGE, Y., GARCIA-MOLINA, H., AND WIDOM, J. 1996. A system prototype for warehouse view maintenance. In Proceedings of the Workshop on Materialized Views: Techniques and Applications (Montreal, Canada, June). 26-33.

WIDOM, J. 1995. Research problems in data warehousing. In Proceedings of the 1995 International Conference on Information and Knowledge Management (CIKM, Baltimore, MD, Nov. 28-Dec. 2), N. Pissinou, A. Silberschatz, E. K. Park, and K. Makki, Eds. ACM Press, New York, NY, 25-30.

WOODRUFF, A. AND STONEBRAKER, M. 1997. Supporting fine-grained data lineage in a database visualization environment. In Proceedings of the 13th International Conference on Data Engineering (Birmingham, UK, Apr.). IEEE Computer Society, Washington, DC, 91-102.

ZHUGE, Y., WIENER, J. L., AND GARCIA-MOLINA, H. 1997. Multiple view consistency for data warehousing. In Proceedings of the 13th International Conference on Data Engineering (Birmingham, UK, Apr.). IEEE Computer Society, Washington, DC, 289-300.

ZHUGE, Y., GARCIA-MOLINA, H., AND WIENER, J. L. 1996. The Strobe algorithms for multisource warehouse consistency. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (Miami Beach, FL, Dec. '96). 146-157.

Received: July 1999; revised: January 2000; accepted: January 2000

APPENDIX

A. PROOFS

We provide proofs for selected theorems from the paper. A full set of complete proofs appears in Cui et al. [1997].

A.1 Proof of Theorem 4.4

To prove [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], we need to prove

(1) [T'.sub.i] [subset or equal to] [T.sub.i], i = 1..m.

(2) Op([T'.sub.1], ..., [T'.sub.m]): {t}..

(3) [inverted]At' [element of] [T'.sub.i]:Op([T'.sub.1], ..., {t'}, ..., [T'.sub.m]) [is not equal to] ??.

(4) [inverted]A<[T".sub.1], ..., [T".sub.m]) that satisfy 1, 2, 3: <[T".sub.1], ..., [T".sub.m]> [subset or equal to] <[T'.sub.1], ..., [T'.sub.m]>.

Here we show the proofs for ??, [Alpha], and -. Proofs for other operators can be found in Cui et al. [1997].

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Let [T'.sub.1] = {t.[T.sub.i]}, i = 1..m. 1. t [element of] [T.sub.1] ?? ... ?? [T.sub.m] ?? t.[T.sub.i] [element of] [T.sub.i], i = i..m ?? [T'.sub.i] [subset or equal to] [T.sub.i], i = 1..m 2. {t.[T.sub.1]} ?? ... ?? {t.[T.sub.m]} = {t} according to the definition of ?? 3. Consider an arbitrary t' in an arbitrary [T'.sub.i] t' = t.[T.sub.i] ?? [T'.sub.1] ?? ... ?? {t'} ?? ... [T'.sub.m] = {t} [is not equal to] ?? 4. Consider an arbitrary <[T".sub.1], ..., [T".sub.m]> that satisfies 1, 2, 3. ?? [T".sub.1] ?? ... ?? [T".sub.m] = {t} and [inverted]At" [element of] [T".sub.1] ?? ... {t"} ?? [T".sub.m] [is not equal to] ?? ?? [inverted]At" [element of] [T".sub.i]: [T".sub.1] ?? ... (t"} ?? [T'.sub.m] = {t} ?? [inverted]At" [element of] [T".sub.i]: t" = t.[T.sub.i] [element of] [T'.sub.i] ?? [T".sub.i] [subset or equal to] [T'.sub.i] [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Let T' = [[Sigma].sub.G = t.G](T). 1. T' [subset or equal to] T according to the definition of [Sigma] 2. [inverted]At' [element of] T':t'.G = t.G and [inverted]At" such that t" [element of] T and t" [element of] T':t" .G [is not equal to] t.G ?? t = <T'.G, aggr(T'.B)> (In other words, T' is the group from which t is computed.) ?? [[Alpha].sub.G, aggr(B)](T') = {t} 3. Consider an arbitrary t' [element of] T' t'.G = t.G ?? [[Alpha].sub.G, aggr(B)]({t'}) = {<t.G, aggr({t'.B})>} [is not equal to] ?? 4. Consider an arbitrary <[T".sub.1], ..., [T".sub.m] that satisfies 1, 2, 3. [[Alpha].sub.G, aggr(B)](T") = {t} ?? [inverted]At" [element of] T":t".G = t.G ?? [inverted]At" [element of] T":t" [element of] T" ?? [T" [subset or equal to] T' [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Let [T'.sub.1] = {t}, [T'.sub.2] = [T.sub.2]. 1. It is obvious that [T'.sub.i] [subset or equal to] [T.sub.i]. 2. t [element of] [T.sub.1] - [T.sub.2] ?? t [is not an element of] [T.sub.1] - [T.sub.2] ?? [T'.sub.1] - [T'.sub.2] = {t} - [T.sub.2] = {t} 3. Consider an arbitrary t' [element of] [T'.sub.1], t' = t ?? {t'} - [T'.sub.2] = {t} [is not equal to] ?? Consider an arbitrary t' [element of] [T'.sub.2], t' [is not equal to] t ?? [T'.sub.1] - {t'} = {t} [is not equal to] ?? 4. Consider an arbitrary <[T".sub.1], [T".sub.2]> that satisfies 1, 2, 3. ?? [T".sub.1] - [T".sub.2] = {t} and [inverted]At" [element of] [T".sub.1]: {t"} - [T".sub.2] [is not equal to] ?? ?? [inverted]At" [element of] [T".sub.1]:t" = t. Otherwise, if t" [is not an element of] [T".sub.2] then [T".sub.1] - [T".sub.2] = {t}; if t" [element of] [T".sub.2] then {t"} - [T".sub.2] = ?? ?? [T".sub.1] [subset or equal to] [T'.sub.1] Also, [T".sub.2] [subset or equal to] [T'.sub.2] = [T.sub.2]

A.2 Proof of Theorem 4.8

To prove this theorem, we first give a new definition of tuple derivation for SPJ views, which is equivalent to Definition 5.3 when only SPJ views are considered. We then prove Theorem 4 based on the new definition.

Definition A.1 (Tuple Derivation for an SPJ View). Let D be a database with base tables [R.sub.1], ..., [R.sub.m], and let V = v(D) be an SPJ view over D. Given tuple t [element of] V, [inverted]A[R.sub.i], t* [element of] [R.sub.i] contributes to t iff [exists][R'.sub.j] [subset or equal to] [R.sub.j] for j = 1..m, such that

(a) t [element of] v([R'.sub.i], ..., [R'.sub.i-1], {[t.sub.i]}, [R'.sub.i+1], ..., [R'.sub.m]);

(b) t [element of] v([R'.sub.1], ..., [R'.sub.i-1], ??, [R'.sub.i+1], ..., [R'.sub.m]).

t's derivation according to v is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where each [R*.sub.i] contains all the tuples in [R.sub.i] that contribute to t.

Definition A.1 says that a base tuple t* [element of] [R.sub.i] contributes to t in the view if there exist some subsets of base tables <[R'.sub.1], ..., [R'.sub.m]), such that they produce t with [R'.sub.i] = {t*}, and cannot produce t if [R'.sub.i] = ??. We first prove that Definition A.1 is equivalent to Definition 4.5 for SPJ views.

LEMMA A.2 (DEFINITION EQUIVALENCE). Definition A.1 is equivalent to Definition 4.5 for SPJ views. In other words, the derivation of t defined by Definition 4.5 is the same as that defined by Definition A.1 for every SPJ view and every traced tuple.

PROOF OF LEMMA A.2. We first prove the equivalence of the two definitions for views with a single SPJ operator.

For V = [[Sigma].sub.C](R) or [[Pi].sub.A](R), [inverted]At [element of] V, let <R*> and <R**> be t's derivations defined by Definition 4.5 and Definition A. 1.

1. Consider an arbitrary t* [element of] R* ?? According to Definition 4.5 and the monotonicity of V, ?? [subset or equal to] v({t*}) [subset or equal to] v(R*) = {t} ?? v({t*}) = {t} ?? v(??) = ?? and v({t*}) = {t} ?? t* contributes to t according to Definition A.1 ?? t* [element of] R** 2. Consider an arbitrary t* [element of] R** ?? According to Definition A. 1, v(??) = ?? and v({t*}) = {t} ?? R* [union] {t*) also satisfies (a) and (b) in Definition 5.3 ?? v(R* [union] {t*}) = v(R*) [union] v({t*}) = {t} and [inverted]A[t.sub.1] [element of] R* [union] {t*}: v({[t.sub.1}) [is not equal to] ?? ?? From the maximality of R*, we know that t* [element of] R*

Therefore, R* = R**.

For V = R ?? S, [inverted]At [element of] V, let <R*, S*> and <R**, S**> be t's derivations defined by Definition 4.5 and Definition A.1

1. Consider an arbitrary t* [element of] R* ?? According to Definition 4.5 and the monotonicity of V, ?? [subset or equal to] {t*} ?? S* [subset or equal to] R* ?? S* = {t} ?? {t*} ?? S* = {t} and ?? ?? S* = ?? ?? t* contributes to t according to Definition A.1 ?? t* [element of] R** 2. Consider an arbitrary t* [element of] R** ?? According to Definition A. 1, [exists]S' [subset or equal to] S such that ?? ?? S' = ?? and {t*} ?? S' = {t} ?? [exists]t' [element of] S' such that {t*} ?? <{t'} = {t} ?? (R* [union] {t*}) ?? <(S* [union] {t'}) = {t}. Also, [inverted]A[t.sub.1] [element of] R* [union] {t*}: {[t.sub.1]} ?? (S* [union] {t'}) [is not equal to] ?? and [inverted]A[t.sub.2] [element of] S* [union] {t'}: (R* [union] {t*}) ?? {t.sub.2} [is not equal to] ?? ?? <R* [union] {t*}, S* [union] {t'}> also satisfies (a) and (b) in Definition 4.5 ?? From the maximality of (R*, S*}, we know that t* [element of] R*.

Therefore, R* = R**. Similarly, we can prove S* = S**.

We already know from Theorem 4.7 that tuple derivation as defined by Definition 4.5 is transitive. We can also prove this property for derivation as defined by Definition A. 1. After proving that Definition 4.5 and Definition A. 1 are equivalent for views with a single SPJ operator, due to the transitive property of tuple derivation, we can easily prove that Definition 4.5 and Definition A.1 are equivalent for any SPJ view by induction. []

PROOF OF THEOREM 4.8 Having proved that Definition A. 1 is equivalent to Definition 4.5 for SPJ views, we now prove Theorem 4.8 based on Definition A. 1 (instead of Definition 4.5, which is a more general definition originally used in the paper).

Given two equivalent SPJ views [v.sub.1] and [v.sub.2], we know that [inverted]AD = <[T.sub.1], ..., [T.sub.m]}: [v.sub.1](D) = [v.sub.2](D). Given [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]: According to Definition A.1, there exists subsets of the base tables <R'.sub.1], ..., [R'.sub.m]) such that they produce t with [R'.sub.i] = {t*}, and cannot produce t if [R'.sub.i] = ??, according to [v.sub.1]. Since these subsets produce exactly the same result using [v.sub.2], we know that t* must also contribute to t according to [v.sub.2]. Therefore, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. We can similarly prove that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Therefore, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. []

A.3 Proof of Theorem 5.3

We need to prove that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Let [V.sub.2] = [R.sub.1] ?? ... ?? [R.sub.m], [V.sub.1] = [[Sigma].sub.C]([V.sub.2]) and V = [[Pi].sub.A]([V.sub.1). According to Theorem A.7, we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

A.4 Proof of Theorem 5.5

Assuming <[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the derivation of t, we need to prove that [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for i = 1..m. From Theorem 5.3, we know that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and therefore [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Because otherwise, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], such that t'.[K.sub.i] = t.[K.sub.i]. Then there exist two tuples t* [is not equal to] t' in [R.sub.i] with the same [K.sub.i] value, which conflicts with the key constraint.

Therefore, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

A.5 Proof of Theorem 8

If an operator o is the parent of another operator o' in a view definition tree, we say o [right arrow] o'. We first prove that procedure Canonicalize([v.sub.0]) returns the canonical form. In Canonicalize, after we pull up the union operators in a top-down manner, there are only the following possibilities for a union node ([union]) and the node right above it: - [right arrow] [union], [Alpha] [right arrow] [union], ?? [right arrow] [union], [union] [right arrow] [union], or nothing above [union]. For any other possibility, we can always pull the union operator further up. Pulling up the projections afterwards does not affect the above patterns. At the same time, there are only the following patterns for projections: - [right arrow] [union], [Alpha] [right arrow] [Pi], [union] [right arrow] [Pi], [Pi] [right arrow] [Pi], or nothing above [Pi]. Similarly, pulling up selections does not affect any of the above patterns, and there are only the following patterns for selections: - [right arrow] [Sigma], [Alpha] [right arrow] [Sigma], [union] [right arrow] [Sigma], [Pi] [right arrow] [Sigma], [Sigma] [right arrow] [Sigma], or nothing above [Sigma]. For join, any operator could be its parent: - [right arrow] ??, [Alpha] [right arrow] ??, [union] [right arrow] ??, [Pi] [right arrow] ??, [Sigma] [right arrow] ??, ?? [right arrow] ??, or nothing above ??. After merging the same-type operators, we are left with 18 possible two-operator sequences out of 36 total possibilities in the definition tree after applying Canonicalize. With the given 18 sequences, any two adjacent; operators either belong to different segments or are in the order consistent with the sequence order in an AUSPJ-segment. Therefore, procedure Canonicalize ([v.sub.0]) returns a canonical form.

We now prove the view equivalence after view transformations in Canonicalize. We learn from [Ullman 1989] that projection can be pushed above selections and joins while keeping a query (or view) equivalent. Here we prove that union operators can alsobe pushed above selections and projections while keeping the transformed view equivalent to the original view. In other words, we need to prove (1) [[Pi].sub.A](R [union] S) = [[Pi].sub.A](R) [union] [[Pi].sub.A](S); (2) [[Sigma].sub.A](R [union] S) = [[Sigma].sub.A](R) [union] [[Sigma].sub.A](S). For (1), [inverted]At: t [element of] [[Pi].sub.A](R [union] S) [equivalence] [exists]t' [element of] R [union] S such that t'.A = t [equivalence] [exists]t' [element of] R or [element of] S such that t'.A = t [equivalence] t [element of] [[Pi].sub.A](R) or [[Pi].sub.A](S) [equivalence] t [element of] [[Pi].sub.A](R [union] [[Pi].sub.A](S)). We can prove (2) similarly.

Finally, we can easily extend Lemma A.2 (Appendix A.2) to prove that equivalent SPJ views with set unions have equivalent derivations, which proves the derivation equivalence after applying Canonicalize.

This work was supported by DARPA and the Air Force Rome Laboratories under contracts F30602-95-C-0119 and F30602-96-1-0312.

Authors' addresses: Y. Cui and J. Widom, Computer Science Department, Stanford University, Stanford, CA 94305; email: cyw@db.stanford.edu; widom@db.stanford.edu; J. L. Wiener, Compaq Systems Research Center, 130 Lytton Ave., Palo Alto, CA 94301; email: wiener@pa.dec.com.

Printer friendly Cite/link Email Feedback | |

Author: | CUI, YINGWEI; WIDOM, JENNIFER; WIENER, JANET L. |
---|---|

Publication: | ACM Transactions on Database Systems |

Article Type: | Statistical Data Included |

Geographic Code: | 1USA |

Date: | Jun 1, 2000 |

Words: | 20737 |

Previous Article: | A Cost Model for Query Processing in High Dimensional Data Spaces. |

Next Article: | Emancipating Instances from the Tyranny of Classes in Information Modeling. |

Topics: |