# Transformation of XML data into XML normal form.

Normalization as a way of producing good database designs is a well
understood topic for relational data. In this paper we discuss the
problem of eliminating redundancies and preserving data dependencies in
XML data when an XML schema is normalized. Normalization is one of the
main tasks in relational database design, where 3NF or BCNF, is to be
reached. However, neither of them is ideal: 3NF preserves dependencies
but may not always eliminate redundancies, BCNF on the contrary--always
eliminates redundancies but may not preserve constraints. In this paper
we consider the possibility of achieving both redundancy-free and
dependency preserving form of XML schema. We show how the XML normal
form can be obtained for a class of XML schemas and a class of XML
functional dependencies. We relate our approach to the decomposition
algorithm proposed by Arenas and Libkin in [1].

Keywords: XML, XML functional dependencies, XML normal forms, XML schema design

Povzetek: Razvita je metoda pretvorbe XML podatkov v normalizirano obliko s poudarkom na odstranjevanju redundanc.

1 Introduction

Normal forms, as desired forms for schemas defining structures and properties of data collections, were first proposed and investigated by Codd in early 70s. The most important of them are 3NF (Third Normal Form) [2] and BCND (Boyce-Codd Normal Form) [3]), where BCNF is a stronger definition of 3NF. Definitions of normal forms are based on the functional dependencies defined among the attributes constituting the relational schema, and specify requirements to be satisfied by the set of these functional dependencies. Then any relation that is an instance of the schema and satisfies the given set of functional dependencies, is free of harmful redundancies and anomalies. In the normalization process, the initial poor designed relational schema is decomposed into an equivalent set of well-designed schemas, i.e. into schemas in desired normal forms (usually in 3NF and often also in BCNF).

Recently, as XML becomes popular as the standard data model for storing and interchanging data on the Web and more companies adopt XML as the primary data model for storing information, XML schema design has become an increasingly important issue. Thus, we observe attempts to extend normal forms and database design principles to XML databases. Research on normalization of XML data was reported in a number of papers. In [1], Arenas and Libkin extended the relational tuple-oriented definition of functional dependencies to so-called tree tuple-based functional dependencies and developed the first theory of XML functional dependencies (XFDs) and XML normal forms (XNFs). This approach was further studied by Kolahi [4, 5], and Kolahi and Libkin [6]. Integrity constraints for XML data (including keys) were studied extensively in Buneman et al. in [7], and Fan and Simeon in [8]. The equivalence between XFDs and relational FDs was investigated by Vincent et al. in [9, 10]. Yu and Jagadish proposed in [11] a new XML normal form, called GTT-XNF, that is based on Generalized Tree Tuples (GTT).

Central objectives of a good schema design is to avoid data redundancies and to preserve dependencies enforced by the application domain (these dependencies are formalized by means of functional dependencies). Existence of redundancy can lead not only to a higher data storage cost but also to increased costs for data transfer and data manipulation. It can also lead to update anomalies [12].

One strategy to avoid data redundancies is to design redundancy-free schema. One can start from an intuitively correct XML schema and a given set of functional dependencies reflecting some rules existing in application domain. Then the schema is normalized, i.e. restructured, in such a way that the newly obtained schema has no redundancy, preserves all data (is a lossless decomposition) and preserves all dependencies. In general, obtaining all of these three objectives is not always possible, as has been shown for relational schemas [13]. However, in the case of XML schema, especially thanks to its hierarchical structure, this goal can be more often achieved [4].

1.1 Related work

An algorithm, called the Decomposition Algorithm (DA) normalizing XML schemas was proposed in [1]. The algorithm converts any DTD, given a set of XML functional dependencies (XFDs), into DTD in XML normal form (XNF). The decomposition algorithm consists of two basic operations: moving attributes, and creating new element types. These operations are performed when an XFD violating XNF is identified. Thus, the basic idea is similar to normalization of relational data, when the second normal form (2NF), the third normal form (3NF), or the Boyce-Codd normal form (BCND) are to be achieved. In the relational counterparts during the normalization process a relational schema is decomposed into a set of its projections. Thus, we can obtain a set of separate relational schemas as the result of the normalization process performed for an initial relational schema. In the case of XML documents, the result is still one XML document restructured accordingly. In [14], an information-theoretic approach to normal forms for relational and XML data has been developed. Some other papers, e.g. [5, 15, 16, 6] study XML design and normalization for native or relational storage of XML documents. In [17, 18] approaches for obtaining well-designed XML schemas from conceptual ER (Entity-Relationship) schemas have been discussed.

1.2 Contribution

In this paper we describe a systematic approach to the normalization of XML data when so-called cyclic functional dependencies exist, i.e. dependencies, which in the relational case are functional dependencies of the form {A, B [right arrow] C, C [right arrow] A}, where A, B, C are attributes in a relational schema R(A, B, C). It is well known from relational database theory [13], that the schema R(A, B, C) is then in 3NE but is not in BCNE As a result, instances of R(A, B, C) have redundancies, but decomposition the schema into BCNF leads to two schemas RI(C, A) and R2(C, B), which are free of redundancy but do not preserve dependency A, B [right arrow] C.

We focus on normalization procedure for XML schemas with cyclic XFDs. The contributions of the paper are the following:

-- We use a language based on tree patterns [19, 20] to express XML schemas and XML functional dependencies. This notation is used in the formal analysis of properties of XML normal form as well as the base for developing transformation algorithms.

-- We propose an approach to obtain XNF starting from ER schema. In the first step the ER schema is converted into an XML schema satisfying a necessary conditions (see Theorem 7.3) which is a prerequisite to successful applying of DA algorithm [1]. We show that the presence of cyclic dependencies results in a bad behavior of the DA algorithm.

This paper is organized as follows. In Section 2 we introduce a running example and motivate the research. In Section 3 a relational form of the running example is considered and some problems with its normalization are discussed. Basic notations relevant to the discussed issue from the XML perspective, are introduced in Section 4. We define tree patterns and use them to formal definition of tree tuples and instances. In Section 5, tree patterns are used to define data dependencies: XML functional dependencies (XFDs) and keys (a subclass of XFDs). In Section 6, an XML normal form (XNF) is defined (according to [1]) and we show how this form can be obtained for our running example. We discuss different normalization alternatives--we show advantages and drawbacks of some schema choices. A method for transforming an XML schema into XNF is proposed in Section 7. First, for the XML schema the conceptual model in a form of ER schema is created. This schema and functional dependencies among its attributes are the basis for creating an initial XML schema satisfying the necessary condition formulated in Theorem 7.3. This schema is the subject for further normalization. Section 8 concludes the paper.

2 Redundancies in XML data motivation example

Our primary goal is to devise methods which will allow checking correctness of XML data and designing its expected (normalized) form. We expect such data to be devoid of the redundancies and immune to the update anomalies.

In the case of XML schemas some redundancy problems may occur because of bad design of hierarchical structure of XML document. On the other hand, the hierarchical structure of this data can sometimes help conduct the normalization of XML data.

Example 2.1. Let us consider an XML schema tree (Figure 1) that describes a fragment of a database for storing data about parts and suppliers offering these parts. Its DTD specification is given in Figure 2, and an instance of this schema is depicted in Figure 3. Each part element has identifier pId. One part may be offered by zero or more suppliers. Offers are stored in offer elements. Each offer has: offer identifier o I d, supplier identifier s I d, price price, delivery time delivTime, and delivery place delivPlace.

We assume that the following constraints must be satisfied by any instance of this schema:

-- all offer children of the same supplier must have the same values of sld; this is similar to relational functional dependencies, but now we refer to both the values (text value of sId), and to structure (children of the same supplier).

-- delivPlace functionally depends on part (pId) and supplier (sId), i.e. when a supplier has two different offers for the same part (possibly with different delivTime and/or price) the delivPlace is the same--see offers o1 and o2 in Figure 3.

-- delivPlace functionally determines supplier (sld). It means that having a delivery place (delivPlace) we exactly now which supplier is associated to this place; although one supplier can own many delivery places. For example, in Figure 3 dl is delivery place uniquely associated to the supplier sl.

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

It is easily seen that schema in Figure 1 leads to redundancy: sid (and also all other data describing suppliers such as e.g.: name and address) and delivPlace are stored multiple times for a supplier.

Further on we will show that a special caution should be paid to such kind of dependencies as these in which participates delivPlace. In this case we have to do with "cyclic" dependencies, i.e. delivPlace depends on pld and sld (pld, sld [right arrow] delivPlace) and sld depends on delivPlace (delivPlace [right arrow] sId).

3 Redundancies and dependencies in relational databases

3.1 Relational schemas and functional dependencies

In relational data model, a relational schema is understood as a pair R = (U, F), where U is a finite set of attributes, and F is a set of functional dependencies over F. A functional dependence (FD) as an expression of he form

X [right arrow] Y,

where X, Y [[subset].bar] U are subsets of U. If Y [[subset].bar] X, then X [right arrow] Y is a trivial FD. By [F.sup.+] we denote all dependencies which can be infer from F using, for example, Armstrong's axioms [21, ?].

A relation of type U is a finite set of tuples of type U. Let U = {[A.sub.1], ..., [A.sub.n]} and dom (A) be the domain of attribute A [member of] U. Then a tuple [[A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]], where [a.sub.i] [member of] dom([A.sub.i]), is a tuple of type U.

A relation R conforms to a schema R = (U, F) (is an instance of this schema) if R is of type U, and all dependencies from [F.sup.+] are satisfied by R. An FD X [right arrow] Y is satisfied by R, denoted R [??] X [right arrow] Y, if for each tuples [r.sup.1] , [r.sup.2] [member of] R holds

[[pi].sub.X] ([r.sub.1]) = [[pi].sub.X] ([r.sub.2]) [??][[pi].sub.Y] ([r.sub.1]) [[pi].sub.Y] ([r.sub.2]),

where [[pi].sub.X] (r) is the projection of tuple r on the set X of attributes.

A key in R = (U, F) is such a minimal set K of attributes that K [right arrow] U is in [F.sup.+]. Then each A [member of] K is called a prime attribute.

3.2 Normalization of relational schemas

The main task in relational schema normalization is producing such a set of schemas that posses the required form, usually 3NF or BCNF. The normalization process consists in decomposition of a given input schema. The other approach consists in synthesizing 3NF from functional dependencies [22].

Ideally, a decomposition of a schema should be loss-less, i.e. should preserve data and dependencies. Let R = (U, F), [U.sub.1], [U.sub.2] [[subset].bar] U, and [U.sub.1] [union] [U.sub.2] = U, then schemas [R.sub.1] = ([U.sub.1], [F.sub.1]) and [R.sub.2] = ([U.sub.2], [F.sub.2]) are a lossless decomposition of R = (U, F), iff:

-- The decomposition preserves data, i.e. for each instance R of R the natural join of projections of R on [U.sub.1] and [U.sub.2] produces the relation equal to R, i.e.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

- The decomposition preserves dependencies, i.e.

[F.sup.+] = [([F.sub.1] [union] [F.sub.2]).sup.+],

where [F.sub.1] = {X [right arrow] Y |X [right arrow] Y [member of] F [and] X [union] Y [[subset].bar] [U.sub.1]}, and similarly for [F.sub.2].

The decomposition (([U.sub.1], [F.sub.1]), ([U.sub.2], [F.sub.2])) of (U,F) preserves data, if [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.1] [member of] [F.sup.+] (or, symmetrically, [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.2] [member of] [F.sup.+] [23]. Then we say that the decomposition is determined by the functional dependence [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.1] [member of] [F.sup.+]

A schema R = (U, F) is in 3NF if for every FD X [right arrow] A [member of] [F.sup.+], holds:

-- X is a superkey, i.e. a key is a part of X, or

-- A is prime.

The second condition says that only prime attributes may be functionally dependent on a set of attributes which is not a key. A schema is in BCNF if only the first condition of the two above is allowed. It means, that if whenever a set X determines functionally an attribute A, then X is a superkey, i.e. determines the whole set U.

The aim of a normalization process is to develop normal forms by analyzing functional dependencies and successive decomposition of the input relational schema into its projections. In this way a well-designed schema can be obtained, where unnecessary redundancies and update anomalies had been eliminated. In practice, 3NF is accepted as the most desirable form of relational schemas It does not eliminate all redundancies but guaranties dependency preservation. On contrast, BCNF eliminates all redundancies but does not preserve all dependencies.

In [6] it was shown that 3NF has the least amount of redundancy among all dependency preserving normal forms. The research adopts a recently proposed informationtheoretic framework for reasoning about database designs [14].

3.3 Relational analysis of XML schema

Let us consider the relational representation of the data structure presented in Figure 1.

Then we have the following relational schema:

R = (U, F), where

U = {oId, sld, pld, price, delivTime, delivPlace},

F = {oId -+ sld, pId, price, delivTime, delivPlace, sId, pld --+ delivPlace, delivPlace --+ sId}.

In R there is only one key. The key consists of one attribute oId since all attributes in U functionally depends on old. Thus, R is in 2NF and oId is the only prime (key) attribute in R. Additionally, we assume that a given supplier delivers a given part exactly to one place (pId, sId [right arrow] delivPlace). Moreover, delivery place delivPlaee is connected with only one supplier (delivPlace [right arrow] sld).

R is not in 3NF because of the functional dependency sId, pId [right arrow] delivPlace:

-- sId, pId is not a superkey, and

-- delivPlace is not a prime attribute in U.

Similarly for delivPlace [right arrow] sId.

In this case the lack of 3NF is the source of redundancies and update anomalies. Indeed, for example, the value of delivPlace will be repeated as many times as many different tuples with the same value of the pair (sId, pId) exist in the instance of R. To eliminate this drawback, we can decompose R into two relational schemas, [R.sub.1] and [R.sub.2], which are in 3NF. The decomposition must be based on the dependency sId, pId [right arrow] delivPlace which guarantees that the decomposition preserves data. In the result we obtain:

[R.sub.1] = ([U.sub.1], [F.sub.1]), where

[U.sub.1] = {oId, sId, pId, price, delivTime},

[F.sub.1] = {oId [right arrow] sId, pId, price, delivTime}.

[R.sub.2] = ([U.sub.2], [F.sub.2]), where

[U.sub.2] = {sId, pId, delivPlace},

[F.sub.2] = {sld, pld [right arrow] delivPlace, delivPlace [right arow] sId}.

The discussed decomposition is both data and dependencies preserving, since:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for every instance R of schema R, and F = [([F.sub.1] U [F.sub.2]).sup.+]. However, we see that [R.sub.2] is not in BCNF, since delivPlace is not a superkey in [R.sub.2].

The lack of BCNF in [R.sub.2] is the reason of redundancies. For example, in table [R.sub.2] we have as many duplicates of sId as many tuples with the same value of delivPlace exist in this table.

We can further decompose [R.sub.2] into BCNF schemas [R.sub.21] and [R.sub.22], taking delivPlace [right arrow] sId as the base for the decomposition. Then we obtain:

[R.sub.21] = ([U.sub.21],[F.sub.21]), where

[U.sub.21] = {delivPlace, sid},

[F.sub.21] = {delivPlace [right arrow] sId}.

[R.sub.22] = ([U.sub.22], [F.sub.22]), where

[U.sub.22] = {pId, delivPlace},

[F.sub.22] = [PHI].

After applying this decomposition to [R.sub.2] we obtain tables [R.sub.21] and [R.sub.22]:

This decomposition is information preserving, i.e.

[R.sub.2] = [R.sub.21] [??] [R.sub.22],

but does not preserve functional dependencies, i.e.

[F.sub.2] [not equal to] [([F.sub.21] [union] [F.sub.22]).sup.+) = [F.sub.21]

We can observe some negative consequences of the loss of functional dependencies in the result decomposition.

Assume that we insert the tuple (pl, d2) into [R.sub.22]. The tuple will be inserted because it does not violate any constrain imposed on [R.sub.22]. However, taking into account table [R.sub.21] we see that supplier s1 (determined by d2 in force of delivPlace [right arrow] sId) offers part pl in the place dl. Thus, the considered insertion violates functional dependency sId, pId [arrow] delivPlace defined in [R.sub.2].

The considered example shows that in the case of relational databases we are not able to completely eliminate redundancies and also preserve all functional dependencies. It turns out ([6]) that the best form for relation schema is 3NF, although some redundancies in tables having this form can still remain.

In next section we will show that the hierarchical structure of XML documents can be used to overcome some of the limitations of relational normal forms [24]. As it was shown in [5], there are decompositions of XML schemas that are both information and dependency preserving. In particular, we can obtain a form of XML schema that is equivalent to BCNF, i.e. eliminates all redundancies, and additionally preserves all XML functional dependencies.

4 XML schemas and instances

Schemas for XML data are usually specified by DTD or XSD [25, 26]. In this paper an XML schema (a schema for short) will be specified by means of a slightly simplified version of DTD. We will assume that both attributes and elements of type #PCDATA will be represented by so called terminal elements labeled with terminal labels and having text values. Additionally, terminal elements may have only one occurrence within children of one non-terminal element.

Definition 4.1. Let L be a set of labels, Ter [subset] L be a set of terminal labels and r [member of] L - Ter be a root label. Let p be a set of productions of the form:

l [right arrow] [alpha],

where:

--l [member of] L - Ter;

--r does not occur on the right-hand side of any production;

--a is a regular expression over L - {r} defined as follows:

[alpha] ::= [beta]|[gamma]

[beta] ::= 1/[beta]? | [beta] * | [beta]+

[gamma] ::= A | A? | [beta] | [gamma][gamma] | [gamma] | [gamma]

l [member of] L- Ter- {r}, A [member of] Ter.

Then the quadruple

S = (r, L, Ter, p),

is an XML schema:

XML schemas will be often represented as XML schema trees (Figure 1) (or as XML schema graphs when the schema is recursive). In this paper we restrict ourselves to non-recursive schemas. An example of XML schema, corresponding to the schema tree in Figure 1, was given in Figure 2, where: db is the root, part, supplier, and offer are non-terminal labels, and pid, oid, sid, price, delivTime, delivPlace are terminal labels.

The following notion of tree patterns [19] will be useful to define tree tuples, tree formulas and XML functional dependencies.

Definition 4.2. Let L be a set of labels, and Ter [subset] L be its subset of terminal labels. A tree pattern is an expression conforming to the syntax:

e ::= A | l | l/e | l[[e.sub.1], ..., [e.sub.k]] | l[[e.sub.1], ..., [e.sub.k]]/e,

where A [member of] Ter, l [member of] L -- Ter, n [less than or equal to] 1.

To denote that a tree pattern [phi] is build using the set {[A.sub.1], ..., [A.sub.n]} of terminal labels, we will write [phi]([A.sub.1], ..., [A.sub.n]).

Definition 4.3. Let [phi]([A.sub.1], ..., [A.sub.n]) be a tree pattern, and [x.sub.1], ..., [x.sub.n] be text-valued variables. Then the expression

[phi]([A.sub.1] : [x.sub.1],...,[A.sub.n] : [X.sub.n]),

is a tree formula.

Definition 4.4. Let [phi]([A.sub.1] : [x.sub.1], ..., [A.sub.n] : [X.sub.n]) be a tree formula, and [omega] be a variable valuation, i.e. a function

w : {[x.sub.1], ..., [X.sub.n]} [right arrow] Str [union] {[perpendicular to]},

where Str is a set of text values, and [perpendicular to] is a distinguished null value. Then

t = [phi]([A.sub.1]: [x.sub.1], ..., [A.sub.n]: [X.sub.n])([omega]) = [phi]([A.sub.1] : [omega]([x.sub.1]), ..., [A.sub.n]: [omega]([x.sub.n]))

is called a tree tuple of type [phi]([A.sub.1], ..., [A.sub.n]) with values ([A.sub.1]: [omega]([x.sub.1]), ..., [A.sub.n] : [omega]([X.sub.n]).

Without loss of generality, a tree tuple of type [phi]([A.sub.1], ..., [A.sub.n]) will be denoted also as:

t = [phi]([A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]),

t = [phi]([A.sub.1], ..., [A.sub.n])([omega]),

t = [phi]([omega]([A.sub.1]), ..., [omega]([A.sub.n])),

t = [phi]([A.sub.1] : t. [A.sub.1], ..., [A.sub.n] :t.[A.sub.n]).

Definition 4.5. Let t be a tree tuple of type [phi](U), and let U' = {[A.sub.1], ..., [A.sub.n]} [subset or equal to] U. The projection of t on U', denoted [pi]U',(t), is the set of fields ([A.sub.i] : t.[A.sub.i]) occurring in t, i.e.:

[pi]U',(t) = {[A.sub.1] : t.[A.sub.1],...,[A.sub.n] : t.[A.sub.n]}.

We assume that there is a function type(t) that for a tree tuple t returns its type, it will be denoted by

type(t) = [phi]([A.sub.1], ..., [A.sub.n]).

An XML database consists of a set of XML data, and an XML data is an instance of an XML schema. It will be convenient to distinguish between two kinds of instances:

--an instance as a set of tree tuples, referred to as a canonical instance,

--an instance as a labeled tree, referred as an XML tree.

For one canonical instance there may be a number of XML trees representing it. In the set of all such XML trees we will be interested in such that have a required form, so called XML normal form ([XNF). The canonical instance as well as all corresponding XML trees must conform to the given XML schema.

Definition 4.6. The set of tree tuples

D = {[t.sub.1], ..., [t.sub.N]},

is a canonical instance of an XML schema S = (r, L, Ter, p) if the type, type(t), of any tuple t [member of] D conforms to S.

The conformance of a tree tuple type to an XML schema is defined as follows.

Definition 4.7. Let [phi]([A.sub.1], ..., [A.sub.n]) be the type of a tree tuple t. This tree pattern conforms to the XML schema S = (r, L, Ter, p), if:

1. [phi]([A.sub.1], ..., [A.sub.n]) = r[e], and

2. e conforms to p according to r. The conformance of a pattern e to the set of productions p according to a label l, is defined as follows:

--if e is [A.sub.i], then [A.sub.i] [member of] Ter, and there is such the production l [right arrow] [alpha] in p that [A.sub.i] occurs in [alpha];

--if e is l'/e', then there is such the production l [right arrow] [alpha] in p that l' occurs in a, and e' conforms to p according to l';

--if e is l'[[e.sub.1], ..., [e.sub.n]], then there is such the production l [right arrow] [alpha] in p that l' occurs in a, and every pattern [e.sub.i], 1 [less than or equal to] i [less than or equal to] n, conforms to p according to l'.

--if e is l'[[e.sub.1], ..., [e.sub.n]/e', then l'[[e.sub.1], ..., [e.sub.n]] must conform to p according to l, and e' must conform to p according to l'.

We define XML tree as an ordered rooted node-labeled tree over a set L of labels, and a set Str [union] {[perpendicular to]}, which elements are used as values of terminal nodes.

Definition 4.8. An XML tree I is a tuple (root, [N.sup.e], [N.sup.t], child, [lambda], v), where:

--root is a distinguished root node, [N.sup.e] is a finite set of non-terminal element nodes, and [N.sup.t] is a finite set of terminal nodes;

--child [subset or equal to] ({r} [union] [N.sup.e]) x ([N.sup.e] [union] [N.sup.t])--a relation introducing tree structure into the set {r} [union] [N.sup.e] [union] [N.sup.t], where r is the root, each non-terminal node has at least one child, terminal nodes are leaves;

--[lambda] : {root} [union] [N.sup.e] [union] [N.sup.t] [right arrow] L--a function labeling nodes with labels;

--v : [N.sup.t] [right arrow] Str [union] {[perpendicular to]}--a function assigning terminal nodes with values.

Definition 4.9. We say that an XML tree I = (root, [N.sup.e], [N.sup.t], child, [lambda], v) conforms to an XML schema S = (r, L, Ter, p), denoted I [??] S, if

--[lambda](root) = r; if n [member of] [N.sup.e], then [lambda](n) [member of] L - Ter; if n [member of] [N.sup.t], then [lambda](n) [member of] Ter;

--if [lambda](n) = l, l [right arrow] [alpha] [member of] p, and [n.sub.1], ..., [n.sub.k] are children of n, then the sequence [lambda] ([n.sub.1]) ... [lambda]([n.sub.k]) is a word in the language generated by [alpha].

It will be useful to perceive an XML canonical instance D with tuples of type [phi] as a pair ([phi], [OMEGA]) (called a description), where [OMEGA] is a set of valuations for [phi].

Example 4.1. For the instance [I.sub.1] in Figure 3 we have: ([phi] (pId, oId, sId, price, delivTime, delivPlace) {(p1, o1, s1, x1, t1, d1), (p1, o2, s1, x2, t2, dl), (p1, o3, s2, x3, t3, d2), (p2, o4, s1, x4, t4, d1)}).

An XML tree I satisfies a description ([phi],[OMEGA]), if the root of I satisfies [phi] for every valuation [omega] [member of] [OMEGA] , where:

1. (I, root) [??] (r[e],[omega]), iff [there exists]n [member of] [N.sup.e] child(r, n) A (I, n) [??] (e, [omega]);

2. (I, n) [??] (A, [omega]), iff [lambda](n) = A and v(n) = [omega](A).

3. (I, n) [??] (l/e, [omega]) = iff [lambda](n) = l and [there exists]n' [member of] [N.sup.e] child(n, n') [conjunction] (I, n') [??] (e, [omega])

4. (I, n) [??] (l[[e.sub.l], ..., [e.sub.k]], [omega]), iff [lambda](n) = l and for each i, 1 [less than or equal to] i [less than or equal to] k, exists [n.sub.i] such that child(n, [n.sub.i]) [conjunction] (I, [n.sub.i]) [??] ([e.sub.1], [omega]));

5. (I, n) [??] (l[[e.sub.1], ..., [e.sub.k]]/e,[omega]), iff (I,n) [??] (l[[e.sub.l], ..., [e.sub.k]], [omega]) and exists n' such that child(n, n') [conjunction] (I, n') [??] (e, [omega])).

A description ([phi],[OMEGA]) represents a class of [phi] instances with the same set of values (the same [OMEGA]), since elements in instance trees can be grouped and nested in different ways.

[FIGURE 4 OMITTED]

For example, the XML tree in Figure 4 satisfies (among others) the following two descriptions ([[phi].sub.1], [[OMEGA].sub.1]), and ([[phi].sub.2], [[OMEGA.].sub.2]), where:

[[phi].sub.1] (B, C) = /A[B,C], [[OMEGA].sub.1] = {(b, c1), (b, c2)}; [[phi].sub.2](B,C,D) = /A[B,C,D], [[OMEGA].sub.2] = {(b, c1, d1), (b, c2, d1)}.

5 XML functional dependencies and keys

Over an XML schema we can define some constraints, which specify functional dependencies between values and/or nodes in instances of the schema. These constraints are called XML functional dependencies (XFD).

Definition 5.1. A tree pattern of the form [phi]([A.sub.1], ..., [A.sub.k])/A conforming to an XML schema S = (r, L, Ter, p) is an XML functional dependency (XFD) over S. Then we say that the set of paths terminating in ([A.sub.1], ..., [A.sub.k]) and satisfying [phi], determines the path terminating in A and satisfying [phi]/A.

Satisfaction of an XFD is defined against an canonical instance of XML schema.

Definition 5.2. Let S be an XML schema, D be a canonical instance of S and f = [phi]([A.sub.1], ..., [A.sub.k])/A be an XFD on S. We say the D satisfies f, denoted D [??] f if for any two tree tuples [t.sub.1], [t.sub.2] [member of] D the following implication holds

[[pi].sub.[A.sub.1], ..., [A.sub.k]]([t.sub.1]) = [[pi].sub.[A.sub.1], ..., [A.sub.k]]([t.sub.2]) [??] [[pi].sub.A]([t.sub.1]) = [[pi]A([t.sub.2]).

Assume that [A.sub.1], ..., [A.sub.k],A terminates, respectively, paths [p.sub.1], ..., [p.sub.n],p. Then the XPath-oriented XFD [phi]([A.sub.1], ..., [A.sub.k])/A corresponds to the following path-oriented XFD defined in [1]:

[p.sub.1].[A.sub.1], ..., [p.sub.k].[A.sub.k] [right arrow] p.A..

The XPath-oriented definition has the following advantages:

--it is easy to check, using only the XPath semantics, whether an XML tree satisfies the XFD or not (see below),

--this form of XFD can be used to generate an XQury program performing some transformation operations (see the next section).

An XFD f can be interpreted as XPath expression, i.e. f([x.sub.1], ..., [x.sub.k]) [??] [phi]([A.sub.1] : [x.sub.1], ..., [A.sub.k] : [x.sub.k])/A, that for a given valuation [omega] of its variables returns a sequence of objects (nodes or text values).

An XML tree I = (S, [OMEGA]) satisfies an XFD f(x) if for each valuation [omega] [member of] [OMEGA] of its variables, f([omega](x)) evaluated on I returns an empty sequence or a singleton, i.e.

count ([??]f([omega](x))(I)[??]) [less than or equal to] 1,

where [??]expr[??] (I) is the result of evaluation expr on the instance I.

An XML tree I = (S, [OMEGA]) satisfies an XFD f if for each valuation [omega] [member of] [OMEGA], f([omega]) evaluated on I returns an empty sequence or a singleton, i.e.

count([??]f([omega])(I)[??]) [less than or equal to] 1,

where [??]expr[??] (I) is the result of evaluation expr on the instance I.

Example 5.1. Over [S.sub.1] the following XFDs can be defined:

[f.sub.1] (oid) = /db/part[supplier/offer/oId], [f.sub.2] (oid) = /db/part [supplier/offer/oId]/pId, [f.sub.3] (pid, sid) = /db/part[pId]/supplier/ offer [sId]/ delivPlace, [f.sub.4] (delivPlace) = /db/part/supplier/offer[ delivPlace]/sId, [f.sub.5] (pid, sid) = /db/part[pId]/supplier[offer/sId].

According to XPath semantics [27] the expression [f.sub.1] (oid) by a valuation [omega], is evaluated against the instance [I.sub.1] (Figure 3) as follows: (1) first, a sequence of nodes of type /db/part is chosen; (2) next, for each selected node the predicate [supplier/offer/old = [omega](oid)] is tested, this predicate is true in a node n, if there exists a path of type supplier/offer/oId in [I.sub.1] leading from n to a text node with the value [omega](oid). We see that count([??][f.sub.1]([omega](oid))[??]) equals 1 for all four valuations satisfied by [I.sub.1], i.e. for oid [??] [o.sub.1], oid [??] [o.sub.2], oid [??] [o.sub.3], and oid [??] [o.sub.4].

Similarly, execution of [f.sub.2](oid)(I) gives a singleton for any valuation of oid. These singletons are text values of the path/db/part/pId, where only nodes satisfying the predicate [supplier/offer/oId = [omega](oid)] are taken from the set of nodes determined by/db/part, So, also this XFD is satisfied by [I.sub.1].

However, none of the following XFDs is satisfied in [I.sub.1]:

[g.sub.1] ( sid) = db/part[supplier/offer/sId], [g.sub.2] (pid) =/db/part[pId]/supplier/offer/sId [g.sub.3] (pid) =/db/part[pId]/supplier/offer [g.sub.4] (pid, sid) =/db/part[pId]/supplier/offer[sId], [g.sub.5] (delivPlace) =/db/part/pId/supplier/ offer [delivPlace].

Evaluating the above XFDs against [I.sub.1], we obtain, for example:

count[??][g.sub.1](sid)([sid [??] sl])([I.sub.1])[??]) = 2, count([??][g.sub.2](pid)([pid [??] [p.sub.1]])[??]) = 2, count([??][g.sub.3](pid)([gid [??] [p.sub.l]])[??]) = 3.

An XFD can determine functional relationship between a tuple of text values of a given tuple of paths and a path denoting either a text value (e.g. [f.sub.2](oid)) or a subtree (a node being the root of the subtree) (e.g. [f.sub.1] (oid)) [??] the latter XFDs will be referred to as XML keys.

Definition 5.3. A functional dependence f = [phi]([A.sub.1], ..., [A.sub.k]), where f is of type q/l and l is a non-terminal label, is an XML key for subtrees of the type q/l.

Another notation for XML keys has been proposed in [7]. In that notation

(db-part, {pId})

is an absolute key saying that a subtree of type db/part is uniquely determined by the path db/part/pid (this constraint holds in the instance [I.sub.1] in Figure 3). In our XPath-oriented notation this key is the following XFD:

db/part[pid].

An example of a relative key ([7]) is

(db.part, (supplier, {offer.sId})),

that says that that in the context of db/part, a tree supplier is determined by the path offer/sId. In our XPath-oriented notation this key is expressed as follows:

db/part[pid]/supplier[sid].

6 Normal form for XML

To eliminate redundancies in XML documents, some normal forms (XNF) for XML schemas have been proposed [1, 24, 4, 5]. In this paper we will define XNF in the spirit of BCNF defined for relational schemas. This approach was also used in [1].

Definition 6.1. Let S be an XML schema and [summation] be a set of XFDs defined over S. (S, [summation]) is in XBF iff for any XFD [phi]([A.sub.1], ..., [A.sub.k])/A [member of] [(S, [summation]).sup.+], also [phi]([A.sub.1], ..., [A.sub.k]) [member of] [(S, [summation]).sup.+],where [(S, [summation]).sup.+] is a set of XFDs being consequences of (S, [summation]).

In other word, if [phi]([A.sub.1], ..., [A.sub.k])/A is an XFD for q/l/A, then [phi]([A.sub.1], ..., [A.sub.k]) is a key for q/l. It means that there is at most one subtree of type q/l for any different valuation of ([A.sub.1], ..., [A.sub.k]) in [phi], if the value of a child A of q/l is determined by this valuation. In this way the redundancy, i.e. repetition of values in different subtrees of type q/l, is eliminated.

Let us consider the schema [S.sub.2] in Figure 5 and its instance [I.sub.2] in Figure 6.

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

The following XFD over [S.sub.2]

/db/part/supplier [delivPlace]/sId

says that delivery place (delivPlace) determines the supplier (sId). However, [S.sub.2] is not in XNF, since its instance [I.sub.2] does not satisfy the key

/db/part/supplier [delivPlace].

It means that [I.sub.2] (Figure 6) is not free of redundancy (there are two different subtrees of type supplier describing the same supplier, i.e. its possible name, address, etc.).

In the case of schema [S.sub.3] (Figure 7) the corresponding XFD and the key are:

db/supplier [part/delivPlace]/sId

db/supplier [part/delivPlace].

[FIGURE 7 OMITTED]

[FIGURE 8 OMITTED]

These constraints are satisfied by [I.sub.3] (Figure 8). We see that this time, there is only one subtree of type supplier for any value of delivPlace.

The other dependency of interest is sId, pId [right arrow] delivPlace. Its specification with respect to [S.sub.2] and [S.sub.3] is as follows:

db/part[pId]/supplier [sId]/delivPlace,

and

db/supplier[sId]/part[pId]/delivPlace.

It is easy to see then if these XTDs hold, then also keys

db/part[pId]/supplier[sId],

and

db/supplier [sId]/part[pId]

are satisfied.

However, neither [S.sub.2] nor [S.sub.3] is in XNF. We have already shown that there is redundancy in instances of [S.sub.2]. Similarly, we see that also in instances of [S.sub.3] redundancies may occur. Indeed, since one part may be delivered by many suppliers then the description of one part may be multiplied under each supplier delivering this part, so such data as part name, type, manufacturer etc. will be stored many times. It results from the fact that satisfaction of db/supplier/part[pid]/pname would not imply the satisfaction of db/supplier/part[pid].

In Figure 9 there is schema [S.sub.4] that is in XNF. To make the example more illustrative, we added node name to part data. Also the instance in Figure 10 was slightly extended as compared to instances [I.sub.2] and [I.sub.3].

[FIGURE 9 OMITTED]

[FIGURE 10 OMITTED]

XSD (XML Schema Definition) for [S.sub.4] in notation proposed in [26] is shown in Figure 11.

Note that we cannot use DTD since there are two subtrees labeled part, where each of them has different type: the part subtree under supplier consists of pId and delivPlace, whereas the part subtree under parts consists of pId and name. Recall that in the case of DTD each non-terminal symbol (label) can have only one type (definition), i.e. can appear on the left-hand side of exactly one production rule [26]. This difficulty might be overcame by introducing new labels (for example partDesc) for full descriptions of a parts.

It can be shown that [S.sub.4] satisfies the condition of XNF. Thus, this schema is both redundant-free and dependency preserving. However, as we will show in the next section, schema [S.sub.4] is not the best XNF for the considered running example.

7 Transforming XML schemas to XML normal form

In the previous section we discussed an example of transforming an XML schema into XNF. We started with the schema [S.sub.1] in Figure l, and the final schema was [S.sub.4] in Figure 9. However, the final schema has been created in a rather intuitive way. Thus, although it is in XNF it is not clear whether there exists another XNF reformulation of [S.sub.1], maybe better than [X.sub.4], or not. A systematic algorithm (the Decomposition Algorithm, DA) for normalizing an XML schema was proposed in [1]. If we apply DA to [S.sub.1], we obtain a schema that is worse than [S.sub.4]. It is so due to the following reasons:

1. In DA we always obtain a normal form relative to a context path, i.e. the XNF is restricted to the subtree determined by this context path. Only if the context path is equal to the root label (db in our case), the XNF is absolute.

2. In DA it is not possible to change ordering of elements on a path, because we can only move and discard attributes or create new element types [1]. For example, the part element precedes supplier (is above it) in [S.sub.1] and in the result schema this ordering must remain unchanged. But then the absolute XNF for [S.sub.1] cannot be achieved.

3. The result of normalization strongly depends on the starting XML schema.

7.1 Transforming ER model to XNF

In this section we will discuss how to transform an ER (Entity-Relationship) schema [28] into XML schema in XML normal form (XNF). Our method is based on analyzing functional dependencies among attributes of entities involved in the ER schema.

In Figure 12 there is an ER schema corresponding to the XML schema considered in previous sections (see [S.sub.1] in Figure 1) with two additional attributes sname (supplier name) and pname (part name), delivPlace is treated as an entity with the key attribute did.

[FIGURE 12 OMITTED]

The following functional dependencies are captured by the ER schema in Figure 12:

The first three of them are of particular importance because they state constraints between entities (key attributes of entities).

We will proceed in two steps:

1. An initial XML schema is created using the entities names from the ER schema, their attributes and functional dependencies between key attributes. The initial schema must follow the necessary condition formulated in Theorem 7.3. Satisfaction of this condition is the prerequisite to obtain an XML schema in absolute XNF in the next step.

2. The DA algorithm [1] is applied to obtain the final XNF schema. In fact, only the step called Creating New Element Types from this algorithm is to be applied. In this way all violations caused by functional dependencies over non-key attributes (appearing on the right-hand sides of these dependencies) are resolved. Such functional dependencies in our example are for example the three last in (1).

Definition 7.1. Let [phi]([A.sub.1], ..., [A.sub.n])/B be an XFD of type q/l/B over an XML schema S. S is called XNF-consistent with [phi]([A.sub.1], ..., [A.sub.n])/B, if for any instance I of S holds the implication:

I [??] [phi]([A.sub.1], ..., [A.sub.n])/B [??] [merge.sub.q/l](I) [??] [phi]([A.sub.1], ..., [A.sub.n]),

where [merge.sub.q/l](I) is the result of merging subtrees of type q/l in I

Let I be an XML tree and T = (r, [tau]) be a subtree of type q/l of I, i.e. T belongs to the result of evaluation of the path expression q/1 on the instance I, T [member of] [??]q/l(I)[??]. Then r is a list of the form r := ([A.sub.1] : [a.sub.l], ..., [A.sub.n] : [a.sub.n]), and [tau] is a list of subtrees (each of these lists may be empty).

Definition 7.2. Two subtrees [T.sub.1] = ([r.sub.1], [[tau].sub.1]), and [T.sub.2] = ([r.sub.2], [[tau].sub.2]) of type q/l of an instance I are joinable if [r.sub.1] = [r.sub.2], and then:

--(r, [[tau].sub.1]) [??] (r, [[tau].sub.2]) = (r, [[tau].sub.1] [parallel] [[tau].sub.2]), where [[tau].sub.1] [parallel] [[tau].sub.2] is concatenation of lists [[tau].sub.1] and [[tau].sub.2];

--the merge operation on all subtrees of type q/l of an instance I is defined as follows:

[merge.sub.q/l] (I) = for each [T.sub.1], [T.sub.2] [member of] [??]q/l(I)[??] if [T.sub.1] and [T.sub.2] are joinable then I := I - q/l[[T.sub.1]] - q/l[[T.sub.2]] [union] q/l[[T.sub.1] [??] [T.sub.2]].

Example 7.1. Let S be defined by

l [right arrow] [l.sub.1]*

[l.sub.1] [right arrow] [A..sub.1] B [l.sub.2]*

[l.sub.2] [right arrow] [A.sub.2] C

and let terminal labels [A.sub.1], [A.sub.2] functionally determine B, i.e. [A.sub.1], [A.sub.2] [right arrow] B. Then the corresponding XFD is

[phi]([A.sub.1], [A.sub.2])/B := l/[l.sub.1][[A.sub.1], [l.sub.2][[A.sub.2]]]/B.

We see that I (Figure 13) satisfies [phi]([A.sub.1], [A.sub.2])/B, i.e. for the values of the pair of paths (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]) the value of the path l/[l.sub.1]/B is uniquely determined. Similarly, the merged form of I, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], satisfies [phi]([A.sub.1], [A.sub.2]), i.e. the pair (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]) of path values uniquely determines the node of type l/[l.sub.1]. Note that I in its unmerged form does not satisfy [[phi]([A.sub.1], [A.sub.2]), because there are two nodes of type l/[l/sub.1] corresponding to the same pair of path values of the type (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]).

[FIGURE 13 OMITTED]

The following theorem formulates the necessary condition for XML schema to be XNF-consistent with an XDF defined over this schema.

Theorem 7.3. Let f := [phi]([A.sub.1], ..., [A.sub.n])/B be an XFD of type q/l/B over an XML schema S. Then S is XNF-consistent with f if the set consisting of labels of all terminal children of q/l is functionally dependent on the set of terminal labels occurring in f .

Proof We will proof the theorem by contradiction. Let us assume that there exists a terminal child of I labeled C and C is not functionally dependent on the set {[A.sub.1], ..., [A.sub.n]} of terminal labels occurring in f. Then the tree of the form

I = [phi]([A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]) [union] {q/l[B : b, C: [c.sub.1]], q/l[B : b, C: [c.sub.2]]},

is a subtree of an instance of S in which [phi]([A.sub.1], ..., [A.sub.n])/B is satisfied. However, this subtree violates [phi]([A.sub.1], ..., [A.sub.n]). This violation follows from the fact that there are two subtrees rooted in q/l, one with terminal children (B : b, C : [c.sub.1]), and the other with terminal children (B : b, C : [c.sub.2]), so this subtrees cannot be merged. Thus S is not XDF-consistent with f.

In the schema in the following example the necessary condition formulated in Theorem 7.3 does not hold.

Example 7.2. Let S be defined by

l [right arrow] [l.sub.1]*

[l.sub.1] [right arrow] [A.sub.1] [l.sub.2]*

[l.sub.2] [right arrow] [A.sub.2] B C

and let terminal labels [A.sub.1], [A.sub.2] functionally determine B, i.e. [A.sub.1], [A.sub.2] [right arrow] B. Then the corresponding XFD is

[phi]([A.sub.1], [A.sub.2])/B := l/[l.sub.1][[A.sub.1]]/[l.sub.2] [[A.sub.2]]/B.

Assume that C does not depend on {[A.sub.1], [A.sub.2]}. Schema S is not XNF-consistent with [phi]([A.sub.1], [A.sub.2])/B, since we have (see Figure 14):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Thus, I violates [phi]([A.sub.1], [A.sub.2]) before and after application of the merging operation, in both cases there are two nodes of type l/[l.sub.1]/[l.sub.2] corresponding to the same pair of values ([a.sub.1], [a.sub.2]) of the pair of paths (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]).

[FIGURE 14 OMITTED]

Now, using the Theorem 7.3 and DA, the transformation of ER into XNF is realized in the following two steps:

1. We start with functional dependencies defined over key attributes of entities modeled by ER schema. Following the requirements of Theorem 7.3 we obtain the following DTD for ER schema in Figure 12:

Now, the remaining functional dependencies specified in (1) must be taken into account.

2. The DTD obtained in the first step is the subject of the decomposition by means of the DA algorithm. It is easily seen that the XFD corresponding to pid [right arrow] pname is anomalous. The step Creating new element types of DA converts (2) into

Applying the above steps to the ER schema from Figure 12 gives the XML schema in XNF depicted in Figure 15 and its instance in Figure 16.

[FIGURE 15 OMITTED]

8 Conclusion

In this paper, we discussed how the concept of database normalization can be used in the case of XML data. Normalization is commonly used to develop a relational schema free of unnecessary redundancies and preserving all data dependencies existing in application domain. In order to apply this approach to design XML schemas, we introduced a language for expressing XML functional dependencies. In fact, this language is a class of XPath expressions, so its syntax and semantics are defined precisely. We define the notion of satisfaction of XML functional dependence by an XML tree. To define XNF we use the approach proposed in [24].

All considerations are illustrated by the running example. We discuss various issues connected with normalization and compare them with issues faced in the case of relational databases. We show how to develop redundancy-free and dependency preserving XML schema. It is worth mentioning that the relational version of the schema cannot be structured in redundancy-free and dependency preserving form. In this case, preservation of all dependencies requires 3NF but then some redundancy is present. Further normalization to BCNF eliminates redundancies but does not preserve dependencies. In the case of XML, thanks to its hierarchical nature, we can achieve both properties. However, it is not clear if this is true in all cases (see e.g. [5].

We have proposed a method for normalizing XML data in to steps. First, we build a conceptual model by means of ER schema and specify all functional dependencies among its attributes. Following the necessary condition formulated in Theorem 7.3, an initial XML schema is created. This schema is XNF-consistent with all XML functional dependencies under consideration. Such the schema can be further normalized, for example using the decomposition algorithm (DA) [1]. It was shown that in the presence of cyclic functional dependencies the procedure proposed in DA results in bad design (only a local XNF can be obtained, i.e. only subschemas of the schema can be transformed into XNF).

[FIGURE 16 OMITTED]

Acknowledgement: This work was supported in part by the Polish Ministry of Science and Higher Education under grant N N516 369536.

Received: March 4, 2009

References

[1] M. Arenas and L. Libkin, "A normal form for XML documents." ACM Trans. Database Syst., vol. 29, pp. 195-232, 2004.

[2] E. Codd, "Further normalization of the data base relational model," R. Rustin (ed.): Database Systems, Prentice Hall, and IBM Research Report RJ 909, pp. 33-64, 1972.

[3] E. F. Codd, "Recent investigations in relational data base systems," in IFIP Congress, 1974, pp. 1017-1021.

[4] S. Kolahi, "Dependency-Preserving Normalization of Relational and XML Data," Database Programming Languages, DBPL 2005, Lecture Notes in Computer Science, vol. 3774, pp. 247-261, 2005.

[5] --, "Dependency-preserving normalization of relational and XML data," Journal of Computer and System Sciences, vol. 73, no. 4, pp. 636-647, 2007.

[6] S. Kolahi and L. Libkin, "On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF," in PODS '06. ACM, New York, USA, 2006, pp. 114-123.

[7] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan, "Reasoning about keys for XML," Information Systems, vol. 28, no. 8, pp. 1037-1063, 2003.

[8] W, Fan and J. Simeon, "Integrity constraints for XML," J. Comput. Syst. Sci., vol. 66, no. 1, pp. 254-291, 2003.

[9] M. W. Vincent, J. Liu, and M. K. Mohania, "On the equivalence between FDs in XML and FDs in relations," Acta Inf., vol. 44, no. 3-4, pp. 207-247, 2007.

[10] M. W. Vincent and J. Liu, "Checking functional dependency satisfaction in XML," Database and XML Technologies--XSym 2005, Lecture Notes in Computer Science, vol. 3671, pp. 4-17, 2005.

[11] C. Yu and H. V. Jagadish, "XML schema refinement through redundancy detection and normalization," VLDB Journal, vol. 17, no. 2, pp. 203-223, 2008.

[12] R. Elmasri and S. B. Navathe, Fundamentals of Database Systems. Redwood City: The Benjamin/Cummings, 1994.

[13] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, Reading, Massachusetts: Addison-Wesley, 1995.

[14] M. Arenas and L. Libkin, "An information-theoretic approach to normal forms for relational and XML data," J. ACM, vol. 52, no, 2, pp. 246-283, 2005.

[15] P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner, "Pathfinder: XQuery-the relational way," in VLDB 2005. ACM, 2005, pp. 1322-1325.

[16] S. Pal, I. Cseri, O. Seeliger, M. Rys, G. Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D. Churin, and E. Kogan, "XQuery implementation in a relational database system" in VLDB 2005. ACM, 2005, pp, 1175-1186.

[17] E Pigozzo and E. Quintarelli, "An algorithm for generating XML Schemas from ER Schemas," in SEBD, 2005, pp. 192-199.

[18] R. Schroeder and R. dos Santos Mello, "Improving query performance on XML documents: a workload-driven design approach," in DocEng. ACM, 2008, pp. 177-186.

[19] W. Xu and Z. M. Ozsoyoglu, "Rewriting XPath queries using materialized views," in VLDB 2005, 2005, pp. 121-132.

[20] T. Pankowski, "XML data integration in SixP2P--a theoretical framework," in EDBT Workshop Data Management in P2P Systems (DAMAP 2008),ACM Digital Library, 2008, pp. 11-18.

[21] W. W. Armstrong, "Dependency structures of data base relationships," in IFIP Congress, 1974, pp. 580-583.

[22] P. A. Bernstein, "Synthesizing third normal form relations from functional dependencies" ACM Transactions on Database Systtems, vol. 1, no. 4, pp. 277-298, 1976.

[23] J. Rissanen, "Independent components of relations," ACM Transactions on Database Systems, vol. 2, no. 4, pp. 317-325, 1977.

[24] M. Arenas, "Normalization theory for XML," SIG-MOD Record, vol. 35, no. 4, pp. 57-64, 2006.

[25] XML Schema Definition Language (XSDL) 1.1, 2007, www.w3.org/TR/xmlschema 11-1.

[26] W. Martens, E Neven, and T. Schwentick, "Simple off the shelf abstractions for XML schema," SIGMOD Record, vol. 36, no. 3, pp. 15-22, 2007.

[27] XML Path Language (XPath) 2.0, 2006, www.w3.org/TR/xpath20.

[28] P. P. Chen, "The entity-relationship model - toward a unified view of data," ACM Transactions on Database Systtems, vol. 1, no. 1, pp. 9-36, 1976.

Tadeusz Pankowski

Institute of Control and Information Engineering

Poznan University of Technology

P1. M.S.-Curie 5, 60-965 Poznafi, Poland

E-mail: tadeusz.pankowski @put.poznan.pl

Tomasz Pitka

Faculty of Mathematics and Computer Science

Adam Mickiewicz University

ul. Umultowska 87, 61-614 Poznan, Poland

E-mail: tomasz.pilka@amu.edu.pl

Keywords: XML, XML functional dependencies, XML normal forms, XML schema design

Povzetek: Razvita je metoda pretvorbe XML podatkov v normalizirano obliko s poudarkom na odstranjevanju redundanc.

1 Introduction

Normal forms, as desired forms for schemas defining structures and properties of data collections, were first proposed and investigated by Codd in early 70s. The most important of them are 3NF (Third Normal Form) [2] and BCND (Boyce-Codd Normal Form) [3]), where BCNF is a stronger definition of 3NF. Definitions of normal forms are based on the functional dependencies defined among the attributes constituting the relational schema, and specify requirements to be satisfied by the set of these functional dependencies. Then any relation that is an instance of the schema and satisfies the given set of functional dependencies, is free of harmful redundancies and anomalies. In the normalization process, the initial poor designed relational schema is decomposed into an equivalent set of well-designed schemas, i.e. into schemas in desired normal forms (usually in 3NF and often also in BCNF).

Recently, as XML becomes popular as the standard data model for storing and interchanging data on the Web and more companies adopt XML as the primary data model for storing information, XML schema design has become an increasingly important issue. Thus, we observe attempts to extend normal forms and database design principles to XML databases. Research on normalization of XML data was reported in a number of papers. In [1], Arenas and Libkin extended the relational tuple-oriented definition of functional dependencies to so-called tree tuple-based functional dependencies and developed the first theory of XML functional dependencies (XFDs) and XML normal forms (XNFs). This approach was further studied by Kolahi [4, 5], and Kolahi and Libkin [6]. Integrity constraints for XML data (including keys) were studied extensively in Buneman et al. in [7], and Fan and Simeon in [8]. The equivalence between XFDs and relational FDs was investigated by Vincent et al. in [9, 10]. Yu and Jagadish proposed in [11] a new XML normal form, called GTT-XNF, that is based on Generalized Tree Tuples (GTT).

Central objectives of a good schema design is to avoid data redundancies and to preserve dependencies enforced by the application domain (these dependencies are formalized by means of functional dependencies). Existence of redundancy can lead not only to a higher data storage cost but also to increased costs for data transfer and data manipulation. It can also lead to update anomalies [12].

One strategy to avoid data redundancies is to design redundancy-free schema. One can start from an intuitively correct XML schema and a given set of functional dependencies reflecting some rules existing in application domain. Then the schema is normalized, i.e. restructured, in such a way that the newly obtained schema has no redundancy, preserves all data (is a lossless decomposition) and preserves all dependencies. In general, obtaining all of these three objectives is not always possible, as has been shown for relational schemas [13]. However, in the case of XML schema, especially thanks to its hierarchical structure, this goal can be more often achieved [4].

1.1 Related work

An algorithm, called the Decomposition Algorithm (DA) normalizing XML schemas was proposed in [1]. The algorithm converts any DTD, given a set of XML functional dependencies (XFDs), into DTD in XML normal form (XNF). The decomposition algorithm consists of two basic operations: moving attributes, and creating new element types. These operations are performed when an XFD violating XNF is identified. Thus, the basic idea is similar to normalization of relational data, when the second normal form (2NF), the third normal form (3NF), or the Boyce-Codd normal form (BCND) are to be achieved. In the relational counterparts during the normalization process a relational schema is decomposed into a set of its projections. Thus, we can obtain a set of separate relational schemas as the result of the normalization process performed for an initial relational schema. In the case of XML documents, the result is still one XML document restructured accordingly. In [14], an information-theoretic approach to normal forms for relational and XML data has been developed. Some other papers, e.g. [5, 15, 16, 6] study XML design and normalization for native or relational storage of XML documents. In [17, 18] approaches for obtaining well-designed XML schemas from conceptual ER (Entity-Relationship) schemas have been discussed.

1.2 Contribution

In this paper we describe a systematic approach to the normalization of XML data when so-called cyclic functional dependencies exist, i.e. dependencies, which in the relational case are functional dependencies of the form {A, B [right arrow] C, C [right arrow] A}, where A, B, C are attributes in a relational schema R(A, B, C). It is well known from relational database theory [13], that the schema R(A, B, C) is then in 3NE but is not in BCNE As a result, instances of R(A, B, C) have redundancies, but decomposition the schema into BCNF leads to two schemas RI(C, A) and R2(C, B), which are free of redundancy but do not preserve dependency A, B [right arrow] C.

We focus on normalization procedure for XML schemas with cyclic XFDs. The contributions of the paper are the following:

-- We use a language based on tree patterns [19, 20] to express XML schemas and XML functional dependencies. This notation is used in the formal analysis of properties of XML normal form as well as the base for developing transformation algorithms.

-- We propose an approach to obtain XNF starting from ER schema. In the first step the ER schema is converted into an XML schema satisfying a necessary conditions (see Theorem 7.3) which is a prerequisite to successful applying of DA algorithm [1]. We show that the presence of cyclic dependencies results in a bad behavior of the DA algorithm.

This paper is organized as follows. In Section 2 we introduce a running example and motivate the research. In Section 3 a relational form of the running example is considered and some problems with its normalization are discussed. Basic notations relevant to the discussed issue from the XML perspective, are introduced in Section 4. We define tree patterns and use them to formal definition of tree tuples and instances. In Section 5, tree patterns are used to define data dependencies: XML functional dependencies (XFDs) and keys (a subclass of XFDs). In Section 6, an XML normal form (XNF) is defined (according to [1]) and we show how this form can be obtained for our running example. We discuss different normalization alternatives--we show advantages and drawbacks of some schema choices. A method for transforming an XML schema into XNF is proposed in Section 7. First, for the XML schema the conceptual model in a form of ER schema is created. This schema and functional dependencies among its attributes are the basis for creating an initial XML schema satisfying the necessary condition formulated in Theorem 7.3. This schema is the subject for further normalization. Section 8 concludes the paper.

2 Redundancies in XML data motivation example

Our primary goal is to devise methods which will allow checking correctness of XML data and designing its expected (normalized) form. We expect such data to be devoid of the redundancies and immune to the update anomalies.

In the case of XML schemas some redundancy problems may occur because of bad design of hierarchical structure of XML document. On the other hand, the hierarchical structure of this data can sometimes help conduct the normalization of XML data.

Example 2.1. Let us consider an XML schema tree (Figure 1) that describes a fragment of a database for storing data about parts and suppliers offering these parts. Its DTD specification is given in Figure 2, and an instance of this schema is depicted in Figure 3. Each part element has identifier pId. One part may be offered by zero or more suppliers. Offers are stored in offer elements. Each offer has: offer identifier o I d, supplier identifier s I d, price price, delivery time delivTime, and delivery place delivPlace.

We assume that the following constraints must be satisfied by any instance of this schema:

-- all offer children of the same supplier must have the same values of sld; this is similar to relational functional dependencies, but now we refer to both the values (text value of sId), and to structure (children of the same supplier).

-- delivPlace functionally depends on part (pId) and supplier (sId), i.e. when a supplier has two different offers for the same part (possibly with different delivTime and/or price) the delivPlace is the same--see offers o1 and o2 in Figure 3.

-- delivPlace functionally determines supplier (sld). It means that having a delivery place (delivPlace) we exactly now which supplier is associated to this place; although one supplier can own many delivery places. For example, in Figure 3 dl is delivery place uniquely associated to the supplier sl.

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

It is easily seen that schema in Figure 1 leads to redundancy: sid (and also all other data describing suppliers such as e.g.: name and address) and delivPlace are stored multiple times for a supplier.

Further on we will show that a special caution should be paid to such kind of dependencies as these in which participates delivPlace. In this case we have to do with "cyclic" dependencies, i.e. delivPlace depends on pld and sld (pld, sld [right arrow] delivPlace) and sld depends on delivPlace (delivPlace [right arrow] sId).

3 Redundancies and dependencies in relational databases

3.1 Relational schemas and functional dependencies

In relational data model, a relational schema is understood as a pair R = (U, F), where U is a finite set of attributes, and F is a set of functional dependencies over F. A functional dependence (FD) as an expression of he form

X [right arrow] Y,

where X, Y [[subset].bar] U are subsets of U. If Y [[subset].bar] X, then X [right arrow] Y is a trivial FD. By [F.sup.+] we denote all dependencies which can be infer from F using, for example, Armstrong's axioms [21, ?].

A relation of type U is a finite set of tuples of type U. Let U = {[A.sub.1], ..., [A.sub.n]} and dom (A) be the domain of attribute A [member of] U. Then a tuple [[A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]], where [a.sub.i] [member of] dom([A.sub.i]), is a tuple of type U.

A relation R conforms to a schema R = (U, F) (is an instance of this schema) if R is of type U, and all dependencies from [F.sup.+] are satisfied by R. An FD X [right arrow] Y is satisfied by R, denoted R [??] X [right arrow] Y, if for each tuples [r.sup.1] , [r.sup.2] [member of] R holds

[[pi].sub.X] ([r.sub.1]) = [[pi].sub.X] ([r.sub.2]) [??][[pi].sub.Y] ([r.sub.1]) [[pi].sub.Y] ([r.sub.2]),

where [[pi].sub.X] (r) is the projection of tuple r on the set X of attributes.

A key in R = (U, F) is such a minimal set K of attributes that K [right arrow] U is in [F.sup.+]. Then each A [member of] K is called a prime attribute.

3.2 Normalization of relational schemas

The main task in relational schema normalization is producing such a set of schemas that posses the required form, usually 3NF or BCNF. The normalization process consists in decomposition of a given input schema. The other approach consists in synthesizing 3NF from functional dependencies [22].

Ideally, a decomposition of a schema should be loss-less, i.e. should preserve data and dependencies. Let R = (U, F), [U.sub.1], [U.sub.2] [[subset].bar] U, and [U.sub.1] [union] [U.sub.2] = U, then schemas [R.sub.1] = ([U.sub.1], [F.sub.1]) and [R.sub.2] = ([U.sub.2], [F.sub.2]) are a lossless decomposition of R = (U, F), iff:

-- The decomposition preserves data, i.e. for each instance R of R the natural join of projections of R on [U.sub.1] and [U.sub.2] produces the relation equal to R, i.e.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

- The decomposition preserves dependencies, i.e.

[F.sup.+] = [([F.sub.1] [union] [F.sub.2]).sup.+],

where [F.sub.1] = {X [right arrow] Y |X [right arrow] Y [member of] F [and] X [union] Y [[subset].bar] [U.sub.1]}, and similarly for [F.sub.2].

The decomposition (([U.sub.1], [F.sub.1]), ([U.sub.2], [F.sub.2])) of (U,F) preserves data, if [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.1] [member of] [F.sup.+] (or, symmetrically, [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.2] [member of] [F.sup.+] [23]. Then we say that the decomposition is determined by the functional dependence [U.sub.1] [intersection] [U.sub.2] [right arrow] [U.sub.1] [member of] [F.sup.+]

A schema R = (U, F) is in 3NF if for every FD X [right arrow] A [member of] [F.sup.+], holds:

-- X is a superkey, i.e. a key is a part of X, or

-- A is prime.

The second condition says that only prime attributes may be functionally dependent on a set of attributes which is not a key. A schema is in BCNF if only the first condition of the two above is allowed. It means, that if whenever a set X determines functionally an attribute A, then X is a superkey, i.e. determines the whole set U.

The aim of a normalization process is to develop normal forms by analyzing functional dependencies and successive decomposition of the input relational schema into its projections. In this way a well-designed schema can be obtained, where unnecessary redundancies and update anomalies had been eliminated. In practice, 3NF is accepted as the most desirable form of relational schemas It does not eliminate all redundancies but guaranties dependency preservation. On contrast, BCNF eliminates all redundancies but does not preserve all dependencies.

In [6] it was shown that 3NF has the least amount of redundancy among all dependency preserving normal forms. The research adopts a recently proposed informationtheoretic framework for reasoning about database designs [14].

3.3 Relational analysis of XML schema

Let us consider the relational representation of the data structure presented in Figure 1.

Then we have the following relational schema:

R = (U, F), where

U = {oId, sld, pld, price, delivTime, delivPlace},

F = {oId -+ sld, pId, price, delivTime, delivPlace, sId, pld --+ delivPlace, delivPlace --+ sId}.

In R there is only one key. The key consists of one attribute oId since all attributes in U functionally depends on old. Thus, R is in 2NF and oId is the only prime (key) attribute in R. Additionally, we assume that a given supplier delivers a given part exactly to one place (pId, sId [right arrow] delivPlace). Moreover, delivery place delivPlaee is connected with only one supplier (delivPlace [right arrow] sld).

R is not in 3NF because of the functional dependency sId, pId [right arrow] delivPlace:

-- sId, pId is not a superkey, and

-- delivPlace is not a prime attribute in U.

Similarly for delivPlace [right arrow] sId.

In this case the lack of 3NF is the source of redundancies and update anomalies. Indeed, for example, the value of delivPlace will be repeated as many times as many different tuples with the same value of the pair (sId, pId) exist in the instance of R. To eliminate this drawback, we can decompose R into two relational schemas, [R.sub.1] and [R.sub.2], which are in 3NF. The decomposition must be based on the dependency sId, pId [right arrow] delivPlace which guarantees that the decomposition preserves data. In the result we obtain:

[R.sub.1] = ([U.sub.1], [F.sub.1]), where

[U.sub.1] = {oId, sId, pId, price, delivTime},

[F.sub.1] = {oId [right arrow] sId, pId, price, delivTime}.

[R.sub.2] = ([U.sub.2], [F.sub.2]), where

[U.sub.2] = {sId, pId, delivPlace},

[F.sub.2] = {sld, pld [right arrow] delivPlace, delivPlace [right arow] sId}.

The discussed decomposition is both data and dependencies preserving, since:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for every instance R of schema R, and F = [([F.sub.1] U [F.sub.2]).sup.+]. However, we see that [R.sub.2] is not in BCNF, since delivPlace is not a superkey in [R.sub.2].

The lack of BCNF in [R.sub.2] is the reason of redundancies. For example, in table [R.sub.2] we have as many duplicates of sId as many tuples with the same value of delivPlace exist in this table.

[R.sub.21] sId pId delivPlace s1 p1 d1 s1 p2 d2 s1 p3 d2 s2 p1 d3

We can further decompose [R.sub.2] into BCNF schemas [R.sub.21] and [R.sub.22], taking delivPlace [right arrow] sId as the base for the decomposition. Then we obtain:

[R.sub.21] = ([U.sub.21],[F.sub.21]), where

[U.sub.21] = {delivPlace, sid},

[F.sub.21] = {delivPlace [right arrow] sId}.

[R.sub.22] = ([U.sub.22], [F.sub.22]), where

[U.sub.22] = {pId, delivPlace},

[F.sub.22] = [PHI].

After applying this decomposition to [R.sub.2] we obtain tables [R.sub.21] and [R.sub.22]:

[R.sub.21] sId delivPlace s1 d1 s1 d1 s1 d2 s2 d3 [R.sub.22] pId delivPlace p1 d1 p2 d1 p3 d2 p1 d3

This decomposition is information preserving, i.e.

[R.sub.2] = [R.sub.21] [??] [R.sub.22],

but does not preserve functional dependencies, i.e.

[F.sub.2] [not equal to] [([F.sub.21] [union] [F.sub.22]).sup.+) = [F.sub.21]

We can observe some negative consequences of the loss of functional dependencies in the result decomposition.

Assume that we insert the tuple (pl, d2) into [R.sub.22]. The tuple will be inserted because it does not violate any constrain imposed on [R.sub.22]. However, taking into account table [R.sub.21] we see that supplier s1 (determined by d2 in force of delivPlace [right arrow] sId) offers part pl in the place dl. Thus, the considered insertion violates functional dependency sId, pId [arrow] delivPlace defined in [R.sub.2].

The considered example shows that in the case of relational databases we are not able to completely eliminate redundancies and also preserve all functional dependencies. It turns out ([6]) that the best form for relation schema is 3NF, although some redundancies in tables having this form can still remain.

In next section we will show that the hierarchical structure of XML documents can be used to overcome some of the limitations of relational normal forms [24]. As it was shown in [5], there are decompositions of XML schemas that are both information and dependency preserving. In particular, we can obtain a form of XML schema that is equivalent to BCNF, i.e. eliminates all redundancies, and additionally preserves all XML functional dependencies.

4 XML schemas and instances

Schemas for XML data are usually specified by DTD or XSD [25, 26]. In this paper an XML schema (a schema for short) will be specified by means of a slightly simplified version of DTD. We will assume that both attributes and elements of type #PCDATA will be represented by so called terminal elements labeled with terminal labels and having text values. Additionally, terminal elements may have only one occurrence within children of one non-terminal element.

Definition 4.1. Let L be a set of labels, Ter [subset] L be a set of terminal labels and r [member of] L - Ter be a root label. Let p be a set of productions of the form:

l [right arrow] [alpha],

where:

--l [member of] L - Ter;

--r does not occur on the right-hand side of any production;

--a is a regular expression over L - {r} defined as follows:

[alpha] ::= [beta]|[gamma]

[beta] ::= 1/[beta]? | [beta] * | [beta]+

[gamma] ::= A | A? | [beta] | [gamma][gamma] | [gamma] | [gamma]

l [member of] L- Ter- {r}, A [member of] Ter.

Then the quadruple

S = (r, L, Ter, p),

is an XML schema:

XML schemas will be often represented as XML schema trees (Figure 1) (or as XML schema graphs when the schema is recursive). In this paper we restrict ourselves to non-recursive schemas. An example of XML schema, corresponding to the schema tree in Figure 1, was given in Figure 2, where: db is the root, part, supplier, and offer are non-terminal labels, and pid, oid, sid, price, delivTime, delivPlace are terminal labels.

The following notion of tree patterns [19] will be useful to define tree tuples, tree formulas and XML functional dependencies.

Definition 4.2. Let L be a set of labels, and Ter [subset] L be its subset of terminal labels. A tree pattern is an expression conforming to the syntax:

e ::= A | l | l/e | l[[e.sub.1], ..., [e.sub.k]] | l[[e.sub.1], ..., [e.sub.k]]/e,

where A [member of] Ter, l [member of] L -- Ter, n [less than or equal to] 1.

To denote that a tree pattern [phi] is build using the set {[A.sub.1], ..., [A.sub.n]} of terminal labels, we will write [phi]([A.sub.1], ..., [A.sub.n]).

Definition 4.3. Let [phi]([A.sub.1], ..., [A.sub.n]) be a tree pattern, and [x.sub.1], ..., [x.sub.n] be text-valued variables. Then the expression

[phi]([A.sub.1] : [x.sub.1],...,[A.sub.n] : [X.sub.n]),

is a tree formula.

Definition 4.4. Let [phi]([A.sub.1] : [x.sub.1], ..., [A.sub.n] : [X.sub.n]) be a tree formula, and [omega] be a variable valuation, i.e. a function

w : {[x.sub.1], ..., [X.sub.n]} [right arrow] Str [union] {[perpendicular to]},

where Str is a set of text values, and [perpendicular to] is a distinguished null value. Then

t = [phi]([A.sub.1]: [x.sub.1], ..., [A.sub.n]: [X.sub.n])([omega]) = [phi]([A.sub.1] : [omega]([x.sub.1]), ..., [A.sub.n]: [omega]([x.sub.n]))

is called a tree tuple of type [phi]([A.sub.1], ..., [A.sub.n]) with values ([A.sub.1]: [omega]([x.sub.1]), ..., [A.sub.n] : [omega]([X.sub.n]).

Without loss of generality, a tree tuple of type [phi]([A.sub.1], ..., [A.sub.n]) will be denoted also as:

t = [phi]([A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]),

t = [phi]([A.sub.1], ..., [A.sub.n])([omega]),

t = [phi]([omega]([A.sub.1]), ..., [omega]([A.sub.n])),

t = [phi]([A.sub.1] : t. [A.sub.1], ..., [A.sub.n] :t.[A.sub.n]).

Definition 4.5. Let t be a tree tuple of type [phi](U), and let U' = {[A.sub.1], ..., [A.sub.n]} [subset or equal to] U. The projection of t on U', denoted [pi]U',(t), is the set of fields ([A.sub.i] : t.[A.sub.i]) occurring in t, i.e.:

[pi]U',(t) = {[A.sub.1] : t.[A.sub.1],...,[A.sub.n] : t.[A.sub.n]}.

We assume that there is a function type(t) that for a tree tuple t returns its type, it will be denoted by

type(t) = [phi]([A.sub.1], ..., [A.sub.n]).

An XML database consists of a set of XML data, and an XML data is an instance of an XML schema. It will be convenient to distinguish between two kinds of instances:

--an instance as a set of tree tuples, referred to as a canonical instance,

--an instance as a labeled tree, referred as an XML tree.

For one canonical instance there may be a number of XML trees representing it. In the set of all such XML trees we will be interested in such that have a required form, so called XML normal form ([XNF). The canonical instance as well as all corresponding XML trees must conform to the given XML schema.

Definition 4.6. The set of tree tuples

D = {[t.sub.1], ..., [t.sub.N]},

is a canonical instance of an XML schema S = (r, L, Ter, p) if the type, type(t), of any tuple t [member of] D conforms to S.

The conformance of a tree tuple type to an XML schema is defined as follows.

Definition 4.7. Let [phi]([A.sub.1], ..., [A.sub.n]) be the type of a tree tuple t. This tree pattern conforms to the XML schema S = (r, L, Ter, p), if:

1. [phi]([A.sub.1], ..., [A.sub.n]) = r[e], and

2. e conforms to p according to r. The conformance of a pattern e to the set of productions p according to a label l, is defined as follows:

--if e is [A.sub.i], then [A.sub.i] [member of] Ter, and there is such the production l [right arrow] [alpha] in p that [A.sub.i] occurs in [alpha];

--if e is l'/e', then there is such the production l [right arrow] [alpha] in p that l' occurs in a, and e' conforms to p according to l';

--if e is l'[[e.sub.1], ..., [e.sub.n]], then there is such the production l [right arrow] [alpha] in p that l' occurs in a, and every pattern [e.sub.i], 1 [less than or equal to] i [less than or equal to] n, conforms to p according to l'.

--if e is l'[[e.sub.1], ..., [e.sub.n]/e', then l'[[e.sub.1], ..., [e.sub.n]] must conform to p according to l, and e' must conform to p according to l'.

We define XML tree as an ordered rooted node-labeled tree over a set L of labels, and a set Str [union] {[perpendicular to]}, which elements are used as values of terminal nodes.

Definition 4.8. An XML tree I is a tuple (root, [N.sup.e], [N.sup.t], child, [lambda], v), where:

--root is a distinguished root node, [N.sup.e] is a finite set of non-terminal element nodes, and [N.sup.t] is a finite set of terminal nodes;

--child [subset or equal to] ({r} [union] [N.sup.e]) x ([N.sup.e] [union] [N.sup.t])--a relation introducing tree structure into the set {r} [union] [N.sup.e] [union] [N.sup.t], where r is the root, each non-terminal node has at least one child, terminal nodes are leaves;

--[lambda] : {root} [union] [N.sup.e] [union] [N.sup.t] [right arrow] L--a function labeling nodes with labels;

--v : [N.sup.t] [right arrow] Str [union] {[perpendicular to]}--a function assigning terminal nodes with values.

Definition 4.9. We say that an XML tree I = (root, [N.sup.e], [N.sup.t], child, [lambda], v) conforms to an XML schema S = (r, L, Ter, p), denoted I [??] S, if

--[lambda](root) = r; if n [member of] [N.sup.e], then [lambda](n) [member of] L - Ter; if n [member of] [N.sup.t], then [lambda](n) [member of] Ter;

--if [lambda](n) = l, l [right arrow] [alpha] [member of] p, and [n.sub.1], ..., [n.sub.k] are children of n, then the sequence [lambda] ([n.sub.1]) ... [lambda]([n.sub.k]) is a word in the language generated by [alpha].

It will be useful to perceive an XML canonical instance D with tuples of type [phi] as a pair ([phi], [OMEGA]) (called a description), where [OMEGA] is a set of valuations for [phi].

Example 4.1. For the instance [I.sub.1] in Figure 3 we have: ([phi] (pId, oId, sId, price, delivTime, delivPlace) {(p1, o1, s1, x1, t1, d1), (p1, o2, s1, x2, t2, dl), (p1, o3, s2, x3, t3, d2), (p2, o4, s1, x4, t4, d1)}).

An XML tree I satisfies a description ([phi],[OMEGA]), if the root of I satisfies [phi] for every valuation [omega] [member of] [OMEGA] , where:

1. (I, root) [??] (r[e],[omega]), iff [there exists]n [member of] [N.sup.e] child(r, n) A (I, n) [??] (e, [omega]);

2. (I, n) [??] (A, [omega]), iff [lambda](n) = A and v(n) = [omega](A).

3. (I, n) [??] (l/e, [omega]) = iff [lambda](n) = l and [there exists]n' [member of] [N.sup.e] child(n, n') [conjunction] (I, n') [??] (e, [omega])

4. (I, n) [??] (l[[e.sub.l], ..., [e.sub.k]], [omega]), iff [lambda](n) = l and for each i, 1 [less than or equal to] i [less than or equal to] k, exists [n.sub.i] such that child(n, [n.sub.i]) [conjunction] (I, [n.sub.i]) [??] ([e.sub.1], [omega]));

5. (I, n) [??] (l[[e.sub.1], ..., [e.sub.k]]/e,[omega]), iff (I,n) [??] (l[[e.sub.l], ..., [e.sub.k]], [omega]) and exists n' such that child(n, n') [conjunction] (I, n') [??] (e, [omega])).

A description ([phi],[OMEGA]) represents a class of [phi] instances with the same set of values (the same [OMEGA]), since elements in instance trees can be grouped and nested in different ways.

[FIGURE 4 OMITTED]

For example, the XML tree in Figure 4 satisfies (among others) the following two descriptions ([[phi].sub.1], [[OMEGA].sub.1]), and ([[phi].sub.2], [[OMEGA.].sub.2]), where:

[[phi].sub.1] (B, C) = /A[B,C], [[OMEGA].sub.1] = {(b, c1), (b, c2)}; [[phi].sub.2](B,C,D) = /A[B,C,D], [[OMEGA].sub.2] = {(b, c1, d1), (b, c2, d1)}.

5 XML functional dependencies and keys

Over an XML schema we can define some constraints, which specify functional dependencies between values and/or nodes in instances of the schema. These constraints are called XML functional dependencies (XFD).

Definition 5.1. A tree pattern of the form [phi]([A.sub.1], ..., [A.sub.k])/A conforming to an XML schema S = (r, L, Ter, p) is an XML functional dependency (XFD) over S. Then we say that the set of paths terminating in ([A.sub.1], ..., [A.sub.k]) and satisfying [phi], determines the path terminating in A and satisfying [phi]/A.

Satisfaction of an XFD is defined against an canonical instance of XML schema.

Definition 5.2. Let S be an XML schema, D be a canonical instance of S and f = [phi]([A.sub.1], ..., [A.sub.k])/A be an XFD on S. We say the D satisfies f, denoted D [??] f if for any two tree tuples [t.sub.1], [t.sub.2] [member of] D the following implication holds

[[pi].sub.[A.sub.1], ..., [A.sub.k]]([t.sub.1]) = [[pi].sub.[A.sub.1], ..., [A.sub.k]]([t.sub.2]) [??] [[pi].sub.A]([t.sub.1]) = [[pi]A([t.sub.2]).

Assume that [A.sub.1], ..., [A.sub.k],A terminates, respectively, paths [p.sub.1], ..., [p.sub.n],p. Then the XPath-oriented XFD [phi]([A.sub.1], ..., [A.sub.k])/A corresponds to the following path-oriented XFD defined in [1]:

[p.sub.1].[A.sub.1], ..., [p.sub.k].[A.sub.k] [right arrow] p.A..

The XPath-oriented definition has the following advantages:

--it is easy to check, using only the XPath semantics, whether an XML tree satisfies the XFD or not (see below),

--this form of XFD can be used to generate an XQury program performing some transformation operations (see the next section).

An XFD f can be interpreted as XPath expression, i.e. f([x.sub.1], ..., [x.sub.k]) [??] [phi]([A.sub.1] : [x.sub.1], ..., [A.sub.k] : [x.sub.k])/A, that for a given valuation [omega] of its variables returns a sequence of objects (nodes or text values).

An XML tree I = (S, [OMEGA]) satisfies an XFD f(x) if for each valuation [omega] [member of] [OMEGA] of its variables, f([omega](x)) evaluated on I returns an empty sequence or a singleton, i.e.

count ([??]f([omega](x))(I)[??]) [less than or equal to] 1,

where [??]expr[??] (I) is the result of evaluation expr on the instance I.

An XML tree I = (S, [OMEGA]) satisfies an XFD f if for each valuation [omega] [member of] [OMEGA], f([omega]) evaluated on I returns an empty sequence or a singleton, i.e.

count([??]f([omega])(I)[??]) [less than or equal to] 1,

where [??]expr[??] (I) is the result of evaluation expr on the instance I.

Example 5.1. Over [S.sub.1] the following XFDs can be defined:

[f.sub.1] (oid) = /db/part[supplier/offer/oId], [f.sub.2] (oid) = /db/part [supplier/offer/oId]/pId, [f.sub.3] (pid, sid) = /db/part[pId]/supplier/ offer [sId]/ delivPlace, [f.sub.4] (delivPlace) = /db/part/supplier/offer[ delivPlace]/sId, [f.sub.5] (pid, sid) = /db/part[pId]/supplier[offer/sId].

According to XPath semantics [27] the expression [f.sub.1] (oid) by a valuation [omega], is evaluated against the instance [I.sub.1] (Figure 3) as follows: (1) first, a sequence of nodes of type /db/part is chosen; (2) next, for each selected node the predicate [supplier/offer/old = [omega](oid)] is tested, this predicate is true in a node n, if there exists a path of type supplier/offer/oId in [I.sub.1] leading from n to a text node with the value [omega](oid). We see that count([??][f.sub.1]([omega](oid))[??]) equals 1 for all four valuations satisfied by [I.sub.1], i.e. for oid [??] [o.sub.1], oid [??] [o.sub.2], oid [??] [o.sub.3], and oid [??] [o.sub.4].

Similarly, execution of [f.sub.2](oid)(I) gives a singleton for any valuation of oid. These singletons are text values of the path/db/part/pId, where only nodes satisfying the predicate [supplier/offer/oId = [omega](oid)] are taken from the set of nodes determined by/db/part, So, also this XFD is satisfied by [I.sub.1].

However, none of the following XFDs is satisfied in [I.sub.1]:

[g.sub.1] ( sid) = db/part[supplier/offer/sId], [g.sub.2] (pid) =/db/part[pId]/supplier/offer/sId [g.sub.3] (pid) =/db/part[pId]/supplier/offer [g.sub.4] (pid, sid) =/db/part[pId]/supplier/offer[sId], [g.sub.5] (delivPlace) =/db/part/pId/supplier/ offer [delivPlace].

Evaluating the above XFDs against [I.sub.1], we obtain, for example:

count[??][g.sub.1](sid)([sid [??] sl])([I.sub.1])[??]) = 2, count([??][g.sub.2](pid)([pid [??] [p.sub.1]])[??]) = 2, count([??][g.sub.3](pid)([gid [??] [p.sub.l]])[??]) = 3.

An XFD can determine functional relationship between a tuple of text values of a given tuple of paths and a path denoting either a text value (e.g. [f.sub.2](oid)) or a subtree (a node being the root of the subtree) (e.g. [f.sub.1] (oid)) [??] the latter XFDs will be referred to as XML keys.

Definition 5.3. A functional dependence f = [phi]([A.sub.1], ..., [A.sub.k]), where f is of type q/l and l is a non-terminal label, is an XML key for subtrees of the type q/l.

Another notation for XML keys has been proposed in [7]. In that notation

(db-part, {pId})

is an absolute key saying that a subtree of type db/part is uniquely determined by the path db/part/pid (this constraint holds in the instance [I.sub.1] in Figure 3). In our XPath-oriented notation this key is the following XFD:

db/part[pid].

An example of a relative key ([7]) is

(db.part, (supplier, {offer.sId})),

that says that that in the context of db/part, a tree supplier is determined by the path offer/sId. In our XPath-oriented notation this key is expressed as follows:

db/part[pid]/supplier[sid].

6 Normal form for XML

To eliminate redundancies in XML documents, some normal forms (XNF) for XML schemas have been proposed [1, 24, 4, 5]. In this paper we will define XNF in the spirit of BCNF defined for relational schemas. This approach was also used in [1].

Definition 6.1. Let S be an XML schema and [summation] be a set of XFDs defined over S. (S, [summation]) is in XBF iff for any XFD [phi]([A.sub.1], ..., [A.sub.k])/A [member of] [(S, [summation]).sup.+], also [phi]([A.sub.1], ..., [A.sub.k]) [member of] [(S, [summation]).sup.+],where [(S, [summation]).sup.+] is a set of XFDs being consequences of (S, [summation]).

In other word, if [phi]([A.sub.1], ..., [A.sub.k])/A is an XFD for q/l/A, then [phi]([A.sub.1], ..., [A.sub.k]) is a key for q/l. It means that there is at most one subtree of type q/l for any different valuation of ([A.sub.1], ..., [A.sub.k]) in [phi], if the value of a child A of q/l is determined by this valuation. In this way the redundancy, i.e. repetition of values in different subtrees of type q/l, is eliminated.

Let us consider the schema [S.sub.2] in Figure 5 and its instance [I.sub.2] in Figure 6.

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

The following XFD over [S.sub.2]

/db/part/supplier [delivPlace]/sId

says that delivery place (delivPlace) determines the supplier (sId). However, [S.sub.2] is not in XNF, since its instance [I.sub.2] does not satisfy the key

/db/part/supplier [delivPlace].

It means that [I.sub.2] (Figure 6) is not free of redundancy (there are two different subtrees of type supplier describing the same supplier, i.e. its possible name, address, etc.).

In the case of schema [S.sub.3] (Figure 7) the corresponding XFD and the key are:

db/supplier [part/delivPlace]/sId

db/supplier [part/delivPlace].

[FIGURE 7 OMITTED]

[FIGURE 8 OMITTED]

These constraints are satisfied by [I.sub.3] (Figure 8). We see that this time, there is only one subtree of type supplier for any value of delivPlace.

The other dependency of interest is sId, pId [right arrow] delivPlace. Its specification with respect to [S.sub.2] and [S.sub.3] is as follows:

db/part[pId]/supplier [sId]/delivPlace,

and

db/supplier[sId]/part[pId]/delivPlace.

It is easy to see then if these XTDs hold, then also keys

db/part[pId]/supplier[sId],

and

db/supplier [sId]/part[pId]

are satisfied.

However, neither [S.sub.2] nor [S.sub.3] is in XNF. We have already shown that there is redundancy in instances of [S.sub.2]. Similarly, we see that also in instances of [S.sub.3] redundancies may occur. Indeed, since one part may be delivered by many suppliers then the description of one part may be multiplied under each supplier delivering this part, so such data as part name, type, manufacturer etc. will be stored many times. It results from the fact that satisfaction of db/supplier/part[pid]/pname would not imply the satisfaction of db/supplier/part[pid].

In Figure 9 there is schema [S.sub.4] that is in XNF. To make the example more illustrative, we added node name to part data. Also the instance in Figure 10 was slightly extended as compared to instances [I.sub.2] and [I.sub.3].

[FIGURE 9 OMITTED]

[FIGURE 10 OMITTED]

XSD (XML Schema Definition) for [S.sub.4] in notation proposed in [26] is shown in Figure 11.

Note that we cannot use DTD since there are two subtrees labeled part, where each of them has different type: the part subtree under supplier consists of pId and delivPlace, whereas the part subtree under parts consists of pId and name. Recall that in the case of DTD each non-terminal symbol (label) can have only one type (definition), i.e. can appear on the left-hand side of exactly one production rule [26]. This difficulty might be overcame by introducing new labels (for example partDesc) for full descriptions of a parts.

It can be shown that [S.sub.4] satisfies the condition of XNF. Thus, this schema is both redundant-free and dependency preserving. However, as we will show in the next section, schema [S.sub.4] is not the best XNF for the considered running example.

Figure 11: An XSD describing the XML schema in Figure 9 db [right arrow] db[content] content [right arrow] suppliers[suppliers], parts[parts], offers[offers] suppliers [right arrow] supplier[supplier]* parts [right arrow] parts[[part.sub.1]]* offers [right arrow] offers[offer]* supplier [right arrow] sId, part[[part.sub.2]] [part.sub.2] [right arrow] pId, delivPlace [part.sub.1] [right arrow] pId, name offer [right arrow] oId, sId, pId, price, delivTime

7 Transforming XML schemas to XML normal form

In the previous section we discussed an example of transforming an XML schema into XNF. We started with the schema [S.sub.1] in Figure l, and the final schema was [S.sub.4] in Figure 9. However, the final schema has been created in a rather intuitive way. Thus, although it is in XNF it is not clear whether there exists another XNF reformulation of [S.sub.1], maybe better than [X.sub.4], or not. A systematic algorithm (the Decomposition Algorithm, DA) for normalizing an XML schema was proposed in [1]. If we apply DA to [S.sub.1], we obtain a schema that is worse than [S.sub.4]. It is so due to the following reasons:

1. In DA we always obtain a normal form relative to a context path, i.e. the XNF is restricted to the subtree determined by this context path. Only if the context path is equal to the root label (db in our case), the XNF is absolute.

2. In DA it is not possible to change ordering of elements on a path, because we can only move and discard attributes or create new element types [1]. For example, the part element precedes supplier (is above it) in [S.sub.1] and in the result schema this ordering must remain unchanged. But then the absolute XNF for [S.sub.1] cannot be achieved.

3. The result of normalization strongly depends on the starting XML schema.

7.1 Transforming ER model to XNF

In this section we will discuss how to transform an ER (Entity-Relationship) schema [28] into XML schema in XML normal form (XNF). Our method is based on analyzing functional dependencies among attributes of entities involved in the ER schema.

In Figure 12 there is an ER schema corresponding to the XML schema considered in previous sections (see [S.sub.1] in Figure 1) with two additional attributes sname (supplier name) and pname (part name), delivPlace is treated as an entity with the key attribute did.

[FIGURE 12 OMITTED]

The following functional dependencies are captured by the ER schema in Figure 12:

oid [right arrow] sid, pid, did sid, pid [right arrow] did did [right arrow] sid sid [right arrow] sname pid [right arrow] pname oid [right arrow] price, delivTime (1)

The first three of them are of particular importance because they state constraints between entities (key attributes of entities).

We will proceed in two steps:

1. An initial XML schema is created using the entities names from the ER schema, their attributes and functional dependencies between key attributes. The initial schema must follow the necessary condition formulated in Theorem 7.3. Satisfaction of this condition is the prerequisite to obtain an XML schema in absolute XNF in the next step.

2. The DA algorithm [1] is applied to obtain the final XNF schema. In fact, only the step called Creating New Element Types from this algorithm is to be applied. In this way all violations caused by functional dependencies over non-key attributes (appearing on the right-hand sides of these dependencies) are resolved. Such functional dependencies in our example are for example the three last in (1).

Definition 7.1. Let [phi]([A.sub.1], ..., [A.sub.n])/B be an XFD of type q/l/B over an XML schema S. S is called XNF-consistent with [phi]([A.sub.1], ..., [A.sub.n])/B, if for any instance I of S holds the implication:

I [??] [phi]([A.sub.1], ..., [A.sub.n])/B [??] [merge.sub.q/l](I) [??] [phi]([A.sub.1], ..., [A.sub.n]),

where [merge.sub.q/l](I) is the result of merging subtrees of type q/l in I

Let I be an XML tree and T = (r, [tau]) be a subtree of type q/l of I, i.e. T belongs to the result of evaluation of the path expression q/1 on the instance I, T [member of] [??]q/l(I)[??]. Then r is a list of the form r := ([A.sub.1] : [a.sub.l], ..., [A.sub.n] : [a.sub.n]), and [tau] is a list of subtrees (each of these lists may be empty).

Definition 7.2. Two subtrees [T.sub.1] = ([r.sub.1], [[tau].sub.1]), and [T.sub.2] = ([r.sub.2], [[tau].sub.2]) of type q/l of an instance I are joinable if [r.sub.1] = [r.sub.2], and then:

--(r, [[tau].sub.1]) [??] (r, [[tau].sub.2]) = (r, [[tau].sub.1] [parallel] [[tau].sub.2]), where [[tau].sub.1] [parallel] [[tau].sub.2] is concatenation of lists [[tau].sub.1] and [[tau].sub.2];

--the merge operation on all subtrees of type q/l of an instance I is defined as follows:

[merge.sub.q/l] (I) = for each [T.sub.1], [T.sub.2] [member of] [??]q/l(I)[??] if [T.sub.1] and [T.sub.2] are joinable then I := I - q/l[[T.sub.1]] - q/l[[T.sub.2]] [union] q/l[[T.sub.1] [??] [T.sub.2]].

Example 7.1. Let S be defined by

l [right arrow] [l.sub.1]*

[l.sub.1] [right arrow] [A..sub.1] B [l.sub.2]*

[l.sub.2] [right arrow] [A.sub.2] C

and let terminal labels [A.sub.1], [A.sub.2] functionally determine B, i.e. [A.sub.1], [A.sub.2] [right arrow] B. Then the corresponding XFD is

[phi]([A.sub.1], [A.sub.2])/B := l/[l.sub.1][[A.sub.1], [l.sub.2][[A.sub.2]]]/B.

We see that I (Figure 13) satisfies [phi]([A.sub.1], [A.sub.2])/B, i.e. for the values of the pair of paths (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]) the value of the path l/[l.sub.1]/B is uniquely determined. Similarly, the merged form of I, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], satisfies [phi]([A.sub.1], [A.sub.2]), i.e. the pair (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]) of path values uniquely determines the node of type l/[l.sub.1]. Note that I in its unmerged form does not satisfy [[phi]([A.sub.1], [A.sub.2]), because there are two nodes of type l/[l/sub.1] corresponding to the same pair of path values of the type (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]).

[FIGURE 13 OMITTED]

The following theorem formulates the necessary condition for XML schema to be XNF-consistent with an XDF defined over this schema.

Theorem 7.3. Let f := [phi]([A.sub.1], ..., [A.sub.n])/B be an XFD of type q/l/B over an XML schema S. Then S is XNF-consistent with f if the set consisting of labels of all terminal children of q/l is functionally dependent on the set of terminal labels occurring in f .

Proof We will proof the theorem by contradiction. Let us assume that there exists a terminal child of I labeled C and C is not functionally dependent on the set {[A.sub.1], ..., [A.sub.n]} of terminal labels occurring in f. Then the tree of the form

I = [phi]([A.sub.1] : [a.sub.1], ..., [A.sub.n] : [a.sub.n]) [union] {q/l[B : b, C: [c.sub.1]], q/l[B : b, C: [c.sub.2]]},

is a subtree of an instance of S in which [phi]([A.sub.1], ..., [A.sub.n])/B is satisfied. However, this subtree violates [phi]([A.sub.1], ..., [A.sub.n]). This violation follows from the fact that there are two subtrees rooted in q/l, one with terminal children (B : b, C : [c.sub.1]), and the other with terminal children (B : b, C : [c.sub.2]), so this subtrees cannot be merged. Thus S is not XDF-consistent with f.

In the schema in the following example the necessary condition formulated in Theorem 7.3 does not hold.

Example 7.2. Let S be defined by

l [right arrow] [l.sub.1]*

[l.sub.1] [right arrow] [A.sub.1] [l.sub.2]*

[l.sub.2] [right arrow] [A.sub.2] B C

and let terminal labels [A.sub.1], [A.sub.2] functionally determine B, i.e. [A.sub.1], [A.sub.2] [right arrow] B. Then the corresponding XFD is

[phi]([A.sub.1], [A.sub.2])/B := l/[l.sub.1][[A.sub.1]]/[l.sub.2] [[A.sub.2]]/B.

Assume that C does not depend on {[A.sub.1], [A.sub.2]}. Schema S is not XNF-consistent with [phi]([A.sub.1], [A.sub.2])/B, since we have (see Figure 14):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Thus, I violates [phi]([A.sub.1], [A.sub.2]) before and after application of the merging operation, in both cases there are two nodes of type l/[l.sub.1]/[l.sub.2] corresponding to the same pair of values ([a.sub.1], [a.sub.2]) of the pair of paths (l/[l.sub.1]/[A.sub.1], l/[l.sub.1]/[l.sub.2]/[A.sub.2]).

[FIGURE 14 OMITTED]

Now, using the Theorem 7.3 and DA, the transformation of ER into XNF is realized in the following two steps:

1. We start with functional dependencies defined over key attributes of entities modeled by ER schema. Following the requirements of Theorem 7.3 we obtain the following DTD for ER schema in Figure 12:

db [right arrow] supplier* supplier [right arrow] sid sname part* part [right arrow] pid pname did offer* offer [right arrow] oid price delivTime (2)

Now, the remaining functional dependencies specified in (1) must be taken into account.

2. The DTD obtained in the first step is the subject of the decomposition by means of the DA algorithm. It is easily seen that the XFD corresponding to pid [right arrow] pname is anomalous. The step Creating new element types of DA converts (2) into

db [right arrow] supplier * partDese * supplier [right arrow] sid sname part * partDesc [right arrow] pid pname part [right arrow] pid did offer * offer [right arrow] oid price delivTime (3)

Applying the above steps to the ER schema from Figure 12 gives the XML schema in XNF depicted in Figure 15 and its instance in Figure 16.

[FIGURE 15 OMITTED]

8 Conclusion

In this paper, we discussed how the concept of database normalization can be used in the case of XML data. Normalization is commonly used to develop a relational schema free of unnecessary redundancies and preserving all data dependencies existing in application domain. In order to apply this approach to design XML schemas, we introduced a language for expressing XML functional dependencies. In fact, this language is a class of XPath expressions, so its syntax and semantics are defined precisely. We define the notion of satisfaction of XML functional dependence by an XML tree. To define XNF we use the approach proposed in [24].

All considerations are illustrated by the running example. We discuss various issues connected with normalization and compare them with issues faced in the case of relational databases. We show how to develop redundancy-free and dependency preserving XML schema. It is worth mentioning that the relational version of the schema cannot be structured in redundancy-free and dependency preserving form. In this case, preservation of all dependencies requires 3NF but then some redundancy is present. Further normalization to BCNF eliminates redundancies but does not preserve dependencies. In the case of XML, thanks to its hierarchical nature, we can achieve both properties. However, it is not clear if this is true in all cases (see e.g. [5].

We have proposed a method for normalizing XML data in to steps. First, we build a conceptual model by means of ER schema and specify all functional dependencies among its attributes. Following the necessary condition formulated in Theorem 7.3, an initial XML schema is created. This schema is XNF-consistent with all XML functional dependencies under consideration. Such the schema can be further normalized, for example using the decomposition algorithm (DA) [1]. It was shown that in the presence of cyclic functional dependencies the procedure proposed in DA results in bad design (only a local XNF can be obtained, i.e. only subschemas of the schema can be transformed into XNF).

[FIGURE 16 OMITTED]

Acknowledgement: This work was supported in part by the Polish Ministry of Science and Higher Education under grant N N516 369536.

Received: March 4, 2009

References

[1] M. Arenas and L. Libkin, "A normal form for XML documents." ACM Trans. Database Syst., vol. 29, pp. 195-232, 2004.

[2] E. Codd, "Further normalization of the data base relational model," R. Rustin (ed.): Database Systems, Prentice Hall, and IBM Research Report RJ 909, pp. 33-64, 1972.

[3] E. F. Codd, "Recent investigations in relational data base systems," in IFIP Congress, 1974, pp. 1017-1021.

[4] S. Kolahi, "Dependency-Preserving Normalization of Relational and XML Data," Database Programming Languages, DBPL 2005, Lecture Notes in Computer Science, vol. 3774, pp. 247-261, 2005.

[5] --, "Dependency-preserving normalization of relational and XML data," Journal of Computer and System Sciences, vol. 73, no. 4, pp. 636-647, 2007.

[6] S. Kolahi and L. Libkin, "On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF," in PODS '06. ACM, New York, USA, 2006, pp. 114-123.

[7] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan, "Reasoning about keys for XML," Information Systems, vol. 28, no. 8, pp. 1037-1063, 2003.

[8] W, Fan and J. Simeon, "Integrity constraints for XML," J. Comput. Syst. Sci., vol. 66, no. 1, pp. 254-291, 2003.

[9] M. W. Vincent, J. Liu, and M. K. Mohania, "On the equivalence between FDs in XML and FDs in relations," Acta Inf., vol. 44, no. 3-4, pp. 207-247, 2007.

[10] M. W. Vincent and J. Liu, "Checking functional dependency satisfaction in XML," Database and XML Technologies--XSym 2005, Lecture Notes in Computer Science, vol. 3671, pp. 4-17, 2005.

[11] C. Yu and H. V. Jagadish, "XML schema refinement through redundancy detection and normalization," VLDB Journal, vol. 17, no. 2, pp. 203-223, 2008.

[12] R. Elmasri and S. B. Navathe, Fundamentals of Database Systems. Redwood City: The Benjamin/Cummings, 1994.

[13] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, Reading, Massachusetts: Addison-Wesley, 1995.

[14] M. Arenas and L. Libkin, "An information-theoretic approach to normal forms for relational and XML data," J. ACM, vol. 52, no, 2, pp. 246-283, 2005.

[15] P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner, "Pathfinder: XQuery-the relational way," in VLDB 2005. ACM, 2005, pp. 1322-1325.

[16] S. Pal, I. Cseri, O. Seeliger, M. Rys, G. Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D. Churin, and E. Kogan, "XQuery implementation in a relational database system" in VLDB 2005. ACM, 2005, pp, 1175-1186.

[17] E Pigozzo and E. Quintarelli, "An algorithm for generating XML Schemas from ER Schemas," in SEBD, 2005, pp. 192-199.

[18] R. Schroeder and R. dos Santos Mello, "Improving query performance on XML documents: a workload-driven design approach," in DocEng. ACM, 2008, pp. 177-186.

[19] W. Xu and Z. M. Ozsoyoglu, "Rewriting XPath queries using materialized views," in VLDB 2005, 2005, pp. 121-132.

[20] T. Pankowski, "XML data integration in SixP2P--a theoretical framework," in EDBT Workshop Data Management in P2P Systems (DAMAP 2008),ACM Digital Library, 2008, pp. 11-18.

[21] W. W. Armstrong, "Dependency structures of data base relationships," in IFIP Congress, 1974, pp. 580-583.

[22] P. A. Bernstein, "Synthesizing third normal form relations from functional dependencies" ACM Transactions on Database Systtems, vol. 1, no. 4, pp. 277-298, 1976.

[23] J. Rissanen, "Independent components of relations," ACM Transactions on Database Systems, vol. 2, no. 4, pp. 317-325, 1977.

[24] M. Arenas, "Normalization theory for XML," SIG-MOD Record, vol. 35, no. 4, pp. 57-64, 2006.

[25] XML Schema Definition Language (XSDL) 1.1, 2007, www.w3.org/TR/xmlschema 11-1.

[26] W. Martens, E Neven, and T. Schwentick, "Simple off the shelf abstractions for XML schema," SIGMOD Record, vol. 36, no. 3, pp. 15-22, 2007.

[27] XML Path Language (XPath) 2.0, 2006, www.w3.org/TR/xpath20.

[28] P. P. Chen, "The entity-relationship model - toward a unified view of data," ACM Transactions on Database Systtems, vol. 1, no. 1, pp. 9-36, 1976.

Tadeusz Pankowski

Institute of Control and Information Engineering

Poznan University of Technology

P1. M.S.-Curie 5, 60-965 Poznafi, Poland

E-mail: tadeusz.pankowski @put.poznan.pl

Tomasz Pitka

Faculty of Mathematics and Computer Science

Adam Mickiewicz University

ul. Umultowska 87, 61-614 Poznan, Poland

E-mail: tomasz.pilka@amu.edu.pl

Printer friendly Cite/link Email Feedback | |

Title Annotation: | extensible markup language |
---|---|

Author: | Pankowski, Tadeusz; Pilka, Tomasz |

Publication: | Informatica |

Article Type: | Report |

Geographic Code: | 4EXPO |

Date: | Nov 1, 2009 |

Words: | 9710 |

Previous Article: | Simulation and performance analysis of distributed Internet systems using TCPNs. |

Next Article: | Realization of UML class and state machine models in the C# code generation and execution framework. |

Topics: |