Printer Friendly

A minimum sequence matching scheme for efficient XPath processing.

1. Introduction

As XML is gaining complete success in being adopted as universal data representation and exchange format. However, particularly on the World Wide Web, the problem of querying XML documents poses interesting challenges to database researchers. Recently, there have been many studies on the structural join methods to efficiently process the XML queries involving ancestor-descendant relationships [1][2][3]. The structural join methods can efficiently answer simple queries. However, twig queries involving branching structures usually have to be disassembled into multiple sub-queries. The results of these sub-queries are then combined by expensive join operations to produce final answers. To improve the existing structural join methods, Wang et al. have proposed a new method, called ViST, that transforms the XML data trees and queries into structure-encoded sequences [4]. ViST performs a subsequence matching on the structure-encoded sequences without breaking down the twig query into sub-queries to find twig patterns in XML documents. Also, to quickly determine the structural relationship between two nodes, each node of the structure-encoded sequences is assigned with a pair of numbers ([n.sub.x], [size.sub.x]), where [n.sub.x] is the tree traversal order by preorder, and where [size.sub.x] is the total number of descendants of a node, x, in the suffix tree [5]. Rao et al. have proposed PRIX that transforms XML data trees and queries into LPS (Labeled Prufer Sequence) and NPS (Numbered Prufer Sequence) to improve the drawbacks of query processing cost and false alarms of the ViST [6]. Tatikonda et al. have proposed LCS-TRIM that uses CPS (Consolidated Prufer Sequence) instead of LPS and NPS of PRIX to reduce the sequence matching cost of PRIX [7]. However, ViST has other imminent drawbacks, and PRIX and LCS-TRIM require much processing time for structure matching of XML data trees and queries. In addition, the LCS-TRIM that's based on the main memory index is unsuitable for searching large XML documents.

In this paper, we propose an efficient index structure to solve the problems of the ViST, and we propose a novel query processing method that's suitable for the proposed index structure. The main contributions of this paper are summarized as follows. We propose a new structure-encoded sequence with the Durable numbering scheme [1] to support dynamic data insertion and deletion. We propose a minimum sequence matching scheme to speed up the subsequence match phase and to return correct answers without false alarms when a query is processed. Our approach is directly performed on the disk-based B+Tree and R-Tree, instead of relying on specialized data structures that are not well supported by DBMSs. We compare our index structure with ViST, PRIX and LCS-TRIM in terms of query processing costs to verify the superiority of our proposed index structure for being a minimum sequence matching scheme.

The rest of this paper is organized as follows. Section 2 discusses the background and motivations of our work. Section 3 proposes the new query indexing method, and section 4 presents our experimental results. Finally, section 5 summarizes the conclusion of this paper.

2. Background and Motivations

2.1 XML Numbering Schemes

The structural relationship between two element-nodes can be quickly determined by a region encoding scheme, where each element is assigned with a pair of numbers (start, end), based on the element's position in the data tree [2][3][8], with the following held: for any two distinct elements u and v, (1) the region of u is completely before or after v, or (2) the region of u completely contains v or is contained by the region of v. Formally, element u is an ancestor of element v iff u.start < v.start and v.end < u.end. Since regions of two distinct elements never partially intersect, the formula can be simplified as u.start < v.start < u.end. Usually, region codes for element nodes can be effectively generated by a depth-first traversal of the tree and sequentially assigning a number at each visit.

There are other approaches to numbering XML element nodes. One Dietz's numbering scheme uses tree traversal orders [9]. A tree node is assigned a pair of (preorder, postorder) tree traversal orders. Element u is an ancestor of element v iff u.preorder < v.preorder and v.postorder < u.postorder. The limitation of this approach is the lack of flexibility. That is, the preorder and postorder may need to be recomputed for many tree nodes when a new node is inserted. The Durable numbering scheme is proposed to more efficiently deal with the dynamic updates of XML data. This approach assigns a pair of (order, size) to each node in tree [1]. The order is an extended preorder and size is an arbitrary integer that's larger than the total number of the current descendants of each node in a tree to gracefully accommodate future insertions. Element u is an ancestor of element v iff u.order < v.order < u.order+u.size.

2.2 Problems of ViST

Wang et al. have proposed a new method called ViST that transforms XML data trees and twig queries into structure-encoded sequences [4]. The structure-encoded sequence is a two-dimensional sequence of (symbol, prefix) pairs {([a.sub.1], [p.sub.1]), ([a.sub.2], [p.sub.2]), ..., ([a.sub.n], [p.sub.n])} where [a.sub.i] represents a node in the XML document tree, and pi represents the path from the root node to a node [a.sub.i]. The nodes [a.sub.1], [a.sub.2], ..., an are in preorder. Also, to quickly determine the structural relationship between two nodes, each node of structure-encoded sequences has a pair of numbers ([n.sub.x], [size.sub.x]), where [n.sub.x] is the tree traversal order by preorder, and [size.sub.x] is the total number of descendants of a node x in the suffix tree. If u and v are labeled ([n.sub.u], [size.sub.u]) and ([n.sub.v], [size.sub.v]) respectively, then node u is an ancestor of node v iff [n.sub.v][member of]([n.sub.u], [n.sub.u]+[size.sub.u]]. ViST has a D-Ancestor B+Tree, S-Ancestor B+Trees and a DocID B+Tree. The D-Ancestor B+Tree indexes the structure-encoded sequences with their (symbol, prefix) as keys. Each S-Ancestor B+Tree indexes the [n.sub.x] and [size.sub.x] values with the same (symbol, prefix). Moreover, a DocID B+Tree, using the [n.sub.x] values of suffix tree's leaf nodes as keys, indexes the document's IDs involving the structure-encoded sequence of each path in a suffix tree. Fig. 1 (a) shows the suffix tree in ViST for XML Doc1 and Doc2. Fig. 1 (b) shows ViST on the suffix tree in Fig. 1 (a).

Suppose a node x, labeled ([n.sub.x], [size.sub.x]), is one of the nodes matching a query q1, ..., [q.sub.i]-1. To match the next element [q.sub.i] in the query, ViST consults the D-Ancestor B+Tree using [q.sub.i] as a key. The D-Ancestor B+Tree returns the root of an S-Ancestor B+Tree. ViST then issues a range query [n.sub.x] < n [less than or equal to] [n.sub.x]+[size.sub.x] on the S-Ancestor B+Tree to immediately find the descendants of node x. For each descendant, ViST uses the same process to match symbol [q.sub.i]+1, until ViST reaches the last element of the query. If a node, y, is one of the nodes that matches the last element in the query, then ViST performs a range query ([n.sub.y], [n.sub.y] + [size.sub.y]] on the DocID B+Tree to retrieve all the documents' IDs for y or y's descendants. Fig. 2 shows the query processing phase of ViST in Fig. 1.

ViST performs the subsequence matching of the structure-encoded sequences to optimize twig query processing without breaking a twig and it merges the results of sub-queries in XML documents. However, ViST has imminent drawbacks. One of those drawbacks, called Unacceptable Accesses, is that determining the structural relationship between two nodes is incorrect because the numbering scheme of ViST is associated after all the nodes of a XML document tree are indexed on a path of a suffix tree.

[FIGURE 1 OMITTED]

Example 1 The child node of (S, P) with n=2 is only (L, PS) with n=3 in Doc1 of Fig. 1. However, when the range query (symbol, prefix) of (L, PS) involved in (S, P) is executed, (L, PS) with n=6 is also answered as the child of (S, P) according to [n.sub.(S,P)] < [n.sub.(L,PS)] [less than or equal to] [n.sub.(S,P)]+[size.sub.(S,P)].

Another drawback of ViST, called as Unnecessary Accesses, is that ViST accesses unnecessary nodes when processing a query. The prefix of the structure-encoded sequence represents the path from the root node to its parent node. If ViST utilized the characteristic of prefix, then ViST would reduce many I/O costs of determining the structural relationship between two nodes.

Example 2 With the PSL of (v1, PSL) in Fig. 1, we can determine that (v1, PSL) has (L, PS) as its parent node, and (P, [epsilon]) and (S, P) as its ancestor nodes. Therefore, the range queries of (P, [epsilon]), (S, P) and (L, PS) involving (v1, PSL) are unnecessary, as is shown in Fig. 2.

Also, the query processing strategy of ViST may result in false alarms. Fig. 3 illustrates such a case. The structure-encoded sequence of the twig query Q is a subsequence of Doc1 and Doc2. However, the twig pattern, Q, occurs only in Doc1, and the match that's detected in Doc2 is a false alarm.

2.3 The Problems of PRIX and LCS-TRIM

Rao et al. have proposed PRIX to improve drawbacks of the query processing cost and the false alarms of ViST [6]. PRIX transforms XML data trees and twig queries into LPS and NPS as in Fig. 4. In PRIX, each node of a XML document tree and a twig query tree has a unique number according to the postorder numbering scheme. LPS and NPS can be then constructed according to the node removal method. To construct a sequence from a tree [T.sub.n] with n nodes labeled from 1 to n, the node removal method works as follows. From [T.sub.n], delete a leaf with the smallest label to form a smaller tree [T.sub.n-1]. Let [a.sub.1] denote the label of the node that was the parent

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

Repeat this process on [T.sub.n-1] to determine [a.sub.2] (the parent of the next node to be deleted), and continue until only two nodes that are joined by an edge are left. The sequence ([a.sub.1], [a.sub.2], [a.sub.3], ..., [a.sub.n-2]) is called the Prufer sequence of [T.sub.n]. If the Prufer sequence is constructed with numbers assigned to nodes, then it is called NPS. If the Prufer sequence is constructed with labels assigned to nodes, then it is called LPS.

A twig matching of PRIX can be found by performing a subsequence matching on the set of LPS and NPS, and by performing a structure matching with gap consistent, frequency consistent and matching leaf nodes [6]. The twig matching is faster than that of ViST. However, the structure matching of PRIX needs many disk I/O and time.

Tatikonda et al. have proposed LCS-TRIM to reduce the query processing cost of PRIX [10]. LCS-TRIM uses CPS that consists of LS (Label Sequence) and NPS. The NPS of LCS-TRIM is similar to that of PRIX. However, PRIX doesn't take the sequence of the root node, but LCS-TRIM takes it. Also, LS takes the labels of the deleted nodes instead of their parent node labels. The sequence matching by using LS outperforms that by using LPS because LPS additionally performs the matching of leaf nodes. Moreover, LCS-TRIM uses LF (Label Filtering) and DM (Dominant Match) to reduce the number of unnecessary node accesses during structure matching. However, when processing queries with wild-cards, LF and DM significantly degrade the query performance. Moreover, The LCS-TRIM that's based on main memory index is unsuitable for searching large XML documents.

[FIGURE 4 OMITTED]

3. The Proposed Minimum Sequence Matching Scheme

3.1 The Proposed Index Structure

Our index structure is based on ViST. One notable difference is that we use the Durable numbering scheme [1] to avoid the Unacceptable Accesses of ViST and to more efficiently deal with dynamic updates of XML data. Fig. 5 shows our proposed index structure on XML Doc1 and Doc2 of Fig. 1.

[FIGURE 5 OMITTED]

Another difference is that each path of the suffix tree used in our index structure is composed of each of the paths of all XML data trees, whereas each path of the suffix tree used in ViST is composed of all the nodes of each XML data tree. The D-Ancestor B+Tree and the S-Ancestor B+Trees of our index structure are the same as those of ViST. However, each S-Ancestor B+Tree of our index structure indexes the orderx and [size.sub.x] values assigned by the Durable numbering scheme. Also, we use DocID R-Tree that uses the second smallest [n.sub.x] values and the largest [n.sub.x] values of each document as keys. Therefore, our index can find all the DocIDs with [n.sub.x] values of all nodes.

3.2 The Proposed Minimum Sequence Matching Schemes

If we use the characteristic of prefix as mentioned above, the query performance can be significantly improved. We propose achieving the minimum sequence matching by using the characteristic of prefix and the bottom-up query processing method. In the case of a single path query, as shown in Fig. 6, our index only executes the range query of the last node of the structure-encoded sequence of a query. Moreover, as shown in example 3, our query processing method is very efficient for wild-cards queries because the number of node-accesses by our minimum sequence matching is smaller than that by ViST.
Fig. 6. The minimum sequence matching of a single path query

Input : Q : q1, ..., qk, a query sequence
  D-Ancestor B+Tree, index of (symbol, prefix) pairs
  S-Ancestor B+Tree, index of (order, size) labels
  DocID R-Tree, mapping between the order values in node labels and
  document IDs
Output : all occurrences of Q in the XML data

LinearSearch(Q, k) /* k is the number of a query sequence's nodes */

Function LinearSearch(Q, k)

1 : while not found T do

2 : T [left arrow] retrieve the S-Ancestor B+Tree that represents
    [q.sub.k] from the D-Ancestor B+Tree;

3 : N [left arrow] retrieve all nodes with range inside (order,
    order+size) from T;

4 : for each node c [member of] N do

5 :   Perform a query with key(order) on the DocID R-tree to output
      all document

6 :   IDs in that key;

7 : end

8 : end


Example 3 Consider a query "/P/*/L/v2". The structure-encoded sequence of this query is (P, [epsilon])(L, P*)(v2, P*L) and the last node sequence of this query is (v2, P*L). Therefore, as shown in Fig. 7 (a), our method performs only a range query to match (v2, P*L) in the D-Ancestor B+Tree of Fig. 5. We find (v2, PBL) and (v2, PSL) as the result of the range query. Then, to find documents involving (v2, PBL) and (v2, PSL), range queries are executed in each S-Ancestor B+Tree of (v2, PBL) and (v2, PSL). However, the query processing method of ViST has many disk I/O and time because of the top-down query processing method, as is shown in Fig. 7 (b).

[FIGURE 7 OMITTED]

If a twig query is processed by our minimum sequence matching for a single path query, then false alarms occur because the structural relationship among multiple sub-queries disassembled from a twig query is not determined. Therefore, we also propose the efficient minimum sequence matching for a twig query; it only executes the range queries about the leaf nodes and branch nodes of a twig query tree, as is shown in example 4.

Example 4 Consider a query "/P/S[L/v1]/L/v2". The structure-encoded sequence of this query is (P, [epsilon])(S, P)(L, PS)(v1, PSL)(L, PS)(v2, PSL). The sequence nodes for the leaf nodes of this twig query tree are (v1, PSL) and (v2, PSL). And the sequence node for the branch node is (S, P), as is shown in Fig. 8 (a). Fig. 8 (b) shows the twig query processing phases by the proposed minimum sequence matching of a twig query. First, our method performs range queries to match (v1, PSL), (v2, PSL) and (S, P) in the D-Ancestor B+Tree of Fig. 5. We find the S-Ancestor B+Trees of (v1, PSL), (v2, PSL) and (S, P) as the results of the above range queries. Then, by range queries of the above S-Ancestor B+Trees, we find (30, 0) as the numbering of (v1, PSL), (41, 0) as the numbering of (v2, PSL), and (10, 20) and (31, 10) as the numbering of (S, P). Finally, to determine the structural relationship among above numbering values, we find the numbering values of branch node sequence involving the numbering values of each leaf node sequence by equation (1). For example, we compute {(30, 30+0][member of]10 & (41, 41+0][member of]10} and {(30, 30+0][member of]31 & (41, 41+0][member of]31}. However, since Doc1 and Doc2 do not have the sequences of this twig query, the result of this twig query does not exist. As a result, our minimum sequence matching has no false alarms.

[FIGURE 8 OMITTED]

4. Performance Evaluation

4.1 Experimental Environment

To determine the effectiveness of our minimum sequence matching scheme, we compare the performance of our index structure with that of ViST, PRIX and LCS-TRIM. We implemented our XML indexing in C++. The implementation uses the B+Tree API provided by GiST [11]. Also, we get the ViST algorithm, the PRIX algorithm and LCS-TRIM algorithm from [12], [13] and [10]. We run our experiments on 3.0GHz Pentium IV processor with 1GB RAM running on Linux kernel 2.6. The code is compiled using the GNU g++ compiler, version 4.0.2. A page size of 8KB is used, and 8-byte number ranges are used to label the nodes. However, the LCS-TRIM used the buffer size of 200MB because the index is based on main memory.

We provided a comparison with LCS-TRIM to verify that the structure matching of LCS-TRIM can cost more. We experiment with the datasets that are obtained from the University of Washington's XML repository [14]. We choose these three datasets since each has a different characteristic, as is shown in Table 1. The document tree in the DBLP dataset has good similarity in structure and is shallow. The document tree in the SWISSPROT dataset is bushy and shallow. The document tree in the TREEBANK dataset is skinny and has deep recursions of element names. Table 1 provides additional information such as the maximum depth, the number of elements and so on for the datasets.

The XPath queries listed in Table 2 are tested in our experiments. These queries have different characteristics in terms of selectivity, presence of values and twig structure. Since the values are encrypted, we choose queries without values (character data) for the TREEBANK dataset

4.2 Experimental Results

For evaluating various experiments, our index has two types. One, called C-index, indexes the structure-encoded sequences and the numbering values of character data on a D-Ancestor B+Tree and S-Ancestor B+Trees like ViST. The second, called NC-index, does not index the structure-encoded sequences and the numbering values of character data on a D-Ancestor B+Tree and S-Ancestor B+Trees like PRIX. Therefore, NC-index stores the values of character data in a database and each leaf node with character data in S-Ancestor B+Trees points each tuple with its value in a database. If NC-index has a query with character data, then the NC-index first executes the minimum sequence matching of this query without character data, and the NC-index then executes the matching character data between the leaf nodes of this query and the tuples of the database.

Fig. 9 shows the performance results of the total time elapsed and physical I/O (pages read from disk) to process queries [Q.sub.1], [Q.sub.2] and [Q.sub.3] of the DBLP dataset. The physical I/O of LCS-TRIM represents the size of the used main memory. [Q.sub.1] is a single path query with two element nodes and one character data node. ViST and PRIX execute range queries on all nodes of the structure-encoded sequence for [Q.sub.1]. LCS-TRIM requires scanning all of the main memory to construct R-matrix for processing queries [10]. However, C-index executes the range query of the leaf node of the structure-encoded sequence of [Q.sub.1]. Therefore, our index is better than the other indexes. C-index is better than NC-index because NC-index additionally requires the matching character data between the leaf node of [Q.sub.1] and the tuple of the database. [Q.sub.2] and [Q.sub.3] are twig queries with three element nodes and two branches, and with three element nodes, two character data nodes and two branches, respectively. Processing a single path query is better than processing a twig query because processing a twig query additionally requires determining the structural relationship between multiple sub-queries disassembled from a twig query. The performance results of NC-index and C-index to process [Q.sub.2] are the same because [Q.sub.2] has no character data. Also, the performance results of [Q.sub.2] are better than those of [Q.sub.1] because 'year' element and 'number' element as a child of 'phdthesis' element are more than 'author' element as a child of 'article' element in the DBLP dataset. The performance results of NC-index to process [Q.sub.3] are more faulty than those of the others because 'author' element and 'year' element as a child of 'inproceedings' element are plentiful in the DBLP dataset and they additionally require the matching character data to process 'Jim Gray' element and '1990' element. LCS-TRIM is better than our index because the structure matching of LCS-TRIM is executed in the main memory.

[FIGURE 9 OMITTED]

Fig. 10 shows the performance results of processing queries [Q.sub.4], [Q.sub.5] and [Q.sub.6] of the SWISSPROT dataset. [Q.sub.4] is the same single path query as [Q.sub.1]. Processing [Q.sub.4] is better than processing [Q.sub.1] because the elements that are related to [Q.sub.1] are more than those related to [Q.sub.4] of the SWISSPROT dataset. The performance results of C-index to process [Q.sub.1] and [Q.sub.4] are the same because the number of character data of [Q.sub.1] and [Q.sub.4] is same. [Q.sub.5] is the same twig query as [Q.sub.3]. Also, as shown in the performance results to process [Q.sub.1] and [Q.sub.4], processing [Q.sub.5] is better than the processing [Q.sub.3] because the elements that are related to [Q.sub.3] are more than those related to [Q.sub.5]. [Q.sub.6] is a twig query with an attribute node and a wild-card '//'. The performance results to process [Q.sub.6] are more erroneous than those to process [Q.sub.4] and [Q.sub.5] because of the many range queries to process the wild-card. But NC-index to process [Q.sub.6] is better than that to process [Q.sub.5] because the '@prim_id' attribute and 'Descr' element that are related to [Q.sub.6] are more than 'Org' and 'Author' elements that are related to [Q.sub.5] of the SWISSPROT dataset. If a query has wild-cards, then LF and DM of LCS-TRIM don't construct the small R-matrix because almost all the nodes of main memory are inserted into R-matrix to find the ancestor nodes of the nodes that are involved in the query. Therefore, the cost of structure matching is increased.

Fig. 11 shows the performance results of processing queries Q7, [Q.sub.8] and [Q.sub.9] of the TREEBANK dataset. Q7 is a single path query with two wild-cards '//'. To process Q7, ViST performs the range queries (S, //), (NP, //S//) and (SYM, //S//NP). The results of these range queries cause many other range queries to produce final answers. PRIX performs the range queries 'S', 'NP' and 'SYM'. Also, PRIX performs the subsequence matching and refinement phases on the set of LPS and NPS of the results of these range queries. Our index only performs the range query (SYM, //S//NP). Therefore, our index clearly outperforms other indexes. The TREEBANK dataset has no character data because the character data are encrypted. The performance results of NC-index and C-Index to process each query are the same. [Q.sub.8] and [Q.sub.9] are the twig queries with a wild-card '//'. As shown in Fig. 11, the performance results of our index to process a single query are better than those to process a twig query because our index additionally requires the range queries of vertex nodes to process a twig query. However, the performance results of PRIX to process [Q.sub.8] are better than those to process Q7 because the number of LPS and NPS of processing Q7 are more than those of processing [Q.sub.8]. Also, in LCS-TRIM, if the depth of the document tree is high and a query has wild-cards, then the size of R-matrix is much increased.

Fig. 12 shows the performance results of false alarms. PRIX performs a series of refinement phases with gap consistent and frequency consistent to avoid the false alarms of ViST. LCS-TRIM performs structure matching by LF and DM. Our index performs the minimum sequence matching according to the Durable numbering scheme. As shown in Fig. 12, PRIX, LCS-TRIM and our index have no false alarms. However, ViST has many false alarms. For example, the DBLP dataset has 72 'phdthesis' elements, 72 'year' elements as a child of 'number' and 3 elements as a child of 'phdthesis'. Also, there are 3 'phdthesis' elements with 'year' element and 'number' element as a child. However, ViST has 4602 'phdthesis' elements as the result of [Q.sub.2] because if a node, y, is a child or a descendant of a node x on ViST, then ViST treats node y as a child or a descendant of node x in spite of the fact that node y is not a child or a descendant of node x in a XML document.

[FIGURE 12 OMITTED]

5. Conclusion

In this paper, we have presented an efficient paradigm for the XPath query processing. We have proposed a novel index structure that uses the Durable numbering scheme and the structure-encoded sequences of the XML tree for indexing XML data. Then, to quickly process XPath queries, we have proposed the minimum sequence matching scheme using the characteristic of the prefix. We have also provided empirical performance analysis to demonstrate the efficient processing of XML queries when using our index structure. In the future, we will study how to efficiently process the delimiters of the prefix schemes, to decrease the label size and to maintain low-label update.

A preliminary version of this paper appeared in ICCS 2007, LNCS 4489, May 27-30, Beijing, China. This version includes a concrete analysis and supporting experimental results with recent sequence schemes. This work was supported by the Ministry of Education, Science and Technology Grant funded by the Korean Government (The Regional Research Universities Program/Chungbuk BIT Research-Oriented University Consortium) and by the Brain Korea 21 Project, the School of Information Technology, KAIST in 2009.

Received August 15, 2009; revised September 22, 2009; accepted September 28, 2009; published October 30, 2009

DOI: 10.3837/tiis.2009.05.005

References

[1] Q. Li and B. Moon, "Indexing and Querying XML Data for Regular Path Expressions," Proc. of 27th VLDB Conference, pp.361-370, 2001.

[2] S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu, "Structural Joins: A Primitive for Efficient XML Query Pattern Matching," Proc. of 18th IEEE International Conference on Data Engineering, pp.141-152, 2002.

[3] S. Y. Chien, Z. Vagena, D. Zhang, V. Tsotras, and C. Zaniolo, "Efficient Structural Joins on Indexed XML Documents," Proc. of 28th VLDB Conference, pp.263-274, 2002.

[4] H. Wang, S. Park, W. Fan, and P. S. Yu, "ViST: A Dynamic Index Method for Querying XML Data by Tree Structures", Proc. of 2003 ACM SIGMOD Conference, pp.110-121, 2003.

[5] E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm," Journal of the ACM, Vol. 23, pp.262-272, 1976.

[6] P. Rao and B. Moon, "Sequencing XML Data and Query Twig for Fast Pattern Matching," ACM Transactions on Database Systems(TODS), pp.299-345, 2006.

[7] S. Tatikonda, S. Parthasarathy, and M. Goyder, "LCS-TRIM: Dynamic Programming Meets XML Indexing and Querying," Proc. of 2007 VLDB Conference, pp.63-74, 2007.

[8] C. Zhang, J. F. Naughton, D. J, DeWitt, Q. Luo, and G. M. Lohman, "On Supporting Containment Queries in Relational Database Management Systems," Proc. of 2001 ACM SIGMOD Conference, pp.425-436, 2001.

[9] P. F. Dietz, "Maintaining Order in a Linked List," Proc. of the 4th Annual ACM Symposium on Theory of Computing, pp.122-127, 1982.

[10] S. Tatikonada, "LCS-TRIM Project," http://www.cse.ohiostate.edu/~takidond, 2007.

[11] J. M. Hellerstein, J. F. Naughton, and A. Pfeffer, "Generalized Search Trees for Database Systems," Proc. of the 21th VLDB Conference, pp.562-573, 1995.

[12] H. Wang, "The ViST Algorithm," http://wis.cs.uda.edu/~hxwang/pub.html, 2003.

[13] B. Moon, "PRIX Project," http://www.cs.arizona.edu/prix, 2006.

[14] G. Miklau, "UW XML Repository," http://www.cs.washington.edu/research/xmldatasets, 2006.

Dong-Min Seo received his B.S., M.S. and Ph.D. in Information and Communication Engineering from Chungbuk National University, Korea in 2002, 2004 and 2008, respectively. He is now the postdoctoral in the Dept. of Computer Science, the Korean Advanced Institute of Science and Technology, Korea. His main research interests include MOD (Moving-Objects Database) system, WSN (Wireless Sensor Networks) and XML database system.

Myung-Ho Yeo received his B.S. and M.S. in Information and Communication Engineering from Chungbuk National University, Korea in 2004 and 2006, respectively. He is currently working towards his Ph.D. degree in the Dept. of Information and Communication Engineering from Chungbuk National University, Korea. His main research interests include main-memory database systems and WSN.

Myoung-Ho Kim received his B.S. and M.S. in Computer Engineering from Seoul National University, Korea in 1982 and 1984, respectively. He received his Ph.D. in Computer Science from Michigan State University, USA. He is now a professor in the Dept. of Computer Science, the Korean Advanced Institute of Science and Technology, Korea. His main research interests include database system, sensor data management, and storage management system.

Jae-Soo Yoo received his B.S. in Computer Engineering from Chunbuk National University, Korea in 1989, and he also received his M.S. and Ph.D. in Computer Science from the Korean Advanced Institute of Science and Technology, Korea in 1991 and 1995. He is now a professor in Information and Communication Engineering, Chungbuk National University, Korea. His main research interests include database systems, sensor data management, location based services, distributed computing and storage management system.

Dong-min Seo (1), Myung-Ho Yeo (2), Myoung-Ho Kim (1) and Jae-Soo Yoo (2)

(1) Department of Computer Science, Korean Advanced Institute of Science and Technology, Daejen, Korea [e-mail: {dmseo, mykim}@dbserver.kaist.ac.kr]

(2) Department of Computer and Communication Engineering, Chungbuk National University, Cheongju, Korea [e-mail: {mhyeo, yjs}@chungbuk.ac.kr]

* Corresponding author: Jae-Soo Yoo
Table 1. The datasets used in our experiments

Data Name          DBLP     SWISSPROT   TREEBANK

Size in MB          134        115         86
# of Elements     3332130    2977031    2437666
# of Attributes   404276     2189859        1
Max-depth            6          5          36
# of Sequences    328858      50000      56385

Table 2. The XPath queries used in our experiments

No.         Query                                  Dataset

[Q.sub.1]   //article/author="E. F. Codd"           DBLP
[Q.sub.2]   //phdthesis[/year][/number]             DBLP
[Q.sub.3]   //inproceedings[/author="Jim Gray"]     DBLP
              [/year="1990"]
[Q.sub.4]   //Ref/Author="Moss J"                 SWISSPROT
[Q.sub.5]   //Entry[/Org="Piroplasmida"]          SWISSPROT
              [/Ref/Author="Kemp D.J"]
[Q.sub.6]   //Entry[/PFAM[@prim-id="PF00304"]     SWISSPROT
              [//SIGNAL//Descr]
[Q.sub.7]   //S//NP/SYM                           TREEBANK
[Q.sub.8]   //NP[/RBR-OR-JJR]/PP                  TREEBANK
[Q.sub.9]   //NP/PP/NP[/NNS-OR-NN][/NN]           TREEBANK

Fig. 9. The performance results of [Q.sub.1], [Q.sub.2] and [Q.sub.3].

            NC-Index   C-Index      PRIX         ViST      LCS-TRIM

[Q.sub.1]    8 pages   1 pages    23 pages   2,280 pages   23.61 MB
[Q.sub.2]    2 pages   2 pages     8 pages      17 pages   23.61 MB
[Q.sub.3]   39 pages   7 pages   185 pages   3,543 pages   23.61 MB

(b) Total physical I/O of [Q.sub.1], [Q.sub.2] and [Q.sub.3]

Fig. 10. The performance results of [Q.sub.4], [Q.sub.5] and [Q.sub.6].

            NC-Index   C-Index      PRIX        ViST       LCS-TRIM

[Q.sub.4]    2 pages   1 pages     9 pages   1,657 pages   23.14 MB
[Q.sub.5]   18 pages   2 pages    49 pages   1,885 pages   23.14 MB
[Q.sub.6]    7 pages   4 pages    83 pages   4,367 pages   23.14 MB

(b) Total physical I/O of [Q.sub.4], [Q.sub.5] and [Q.sub.6].

Fig. 11. The performance results of [Q.sub.7], [Q.sub.8] and [Q.sub.9].

            NC-Index   C-Index      PRIX         ViST        LCS-TRIM

[Q.sub.7]   3 pages    3 pages    46 pages    40,827 pages   11.73 MB
[Q.sub.8]   4 pages    4 pages    35 pages    94,505 pages   11.73 MB
[Q.sub.9]   4 pages    4 pages    55 pages   121,928 pages   11.73 MB

(b) Total physical I/O of [Q.sub.7], [Q.sub.8] and [Q.sub.9]
COPYRIGHT 2009 KSII, the Korean Society for Internet Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2009 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Seo, Dong-min; Yeo, Myung-Ho; Kim, Myoung-Ho; Yoo, Jae-Soo
Publication:KSII Transactions on Internet and Information Systems
Article Type:Technical report
Date:Oct 1, 2009
Words:5725
Previous Article:Associativity-based on-demand multi-path routing in Mobile Ad hoc networks.
Next Article:Detection of SIP flooding attacks based on the upper bound of the possible number of SIP messages.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters