# A novel graph containment query algorithm on graph databases.

1. IntroductionGraphs, as a general data structure, provide a powerful and primitive tool to model the data in a variety of applications, e.g. social or information networks, biological networks, 2D/3D objects in pattern recognition, wired or wireless interconnections, chemical compounds or protein networks. Nodes in graphs usually represent real world objects and edges indicate relationships between the objects. For example, in wireless environments, vertices represent the functional nodes and edges represent connections between those nodes [1]. In chemical informatics and bio-informatics, graphs are employed to denote compounds and proteins. In order to accurately describe the characters of data, the labeled graphs are often applied, in which nodes and edges are associated with attributes. With the tremendous amount of structured or networked data accumulated in large databases, how to efficiently support the scalable graph query processing becomes a challenging research issue in database area.

Given a graph database GD, generally, the graph queries over the database can be divided into two categories, subgraph query and containment query. The subgraph query retrieves the graphs in the database (answer set Q) that are supergraphs of the query graph q, or Q={g|gE [member of] GD^g[??]g}. For example, in chemistry, the subgraph can be used to retrieve the compounds containing the given special substructure. On the other hand, the graph containment query finds the graphs that are subgraph of the query graph, or Q={g|g [member of] GD^g[??]q}. For example, also in chemistry, a descriptor (model graph) is a set of atoms with designated bonds that has certain attributes in chemical reactions. Given a new molecule, identifying "descriptor" structures can help researchers to understand its possible properties.

Recently, the subgraph query, both exact and approximate query, has been received much more concern and been well studied [2-8]. However, the research on graph containment query is far from enough. While processing the containment queries, the methodology of filtering-and-verification approach is adopted to reduce the size of candidate graphs set, but the inclusion logic which is used in subgraph query processing can not be applied. Instead, the exclusion logic is employed. Suppose a query graph q is the supergraph of a graph g. Then, all features in g must be included in q. According to the exclusion logic, if a feature f is not a subgraph of q, the graphs that are supergraphs off can be safely pruned from candidate answer set. Chen [9] proposes an algorithm, cIndex, to process the containment query. The algorithm is built on discriminative redundancy-aware features set. The primary cost of the algorithm is composed by two parts, including the cost on performing subgraph isomorphism between the query graph and the picked features, performing subgraph isomorphism between the query graph and the candidate graphs. During query processing procedure, cIndex performs subgraph isomorphism tests between the query graph and all indexed frequent subgraps. This indiscriminate approach greatly reduces the efficiency of cIndex. Moreover, cIndex can not effectively reduce the size of candidate answer set.

In this paper, we propose an algorithm called Closed Frequent Subgraph Based Index (CFG-Index) to solve graph containment query problem. At first, a graph feature, closed frequent subgraph, is introduced. Next, a feature index is designed to facilitate the query processing algorithms. Then, the feature based query processing algorithm is proposed. Finally, theoretical analysis and experimental evaluation is presented to demonstrate the value of the proposed method. The main contributions of CFG-Index are:

1. Design a closed frequent subgraph index and corresponding algorithms to efficiently process the graph containment query;

2. Evaluate the performance of the propose algorithm theoretically and experimentally; The remainder of this paper is organized as followed. Related works is stated in section 2. Section 3 defines some preliminary concepts about graph query. Section 4 introduces the closed frequent subgraph and Index structure in CFG-Index. Algorithm is introduced in Section 4, too. Section 5 analyses the improvement of this algorithm and section 6 shows the experimental result. At the end, we draw the conclusion in section 7.

2. Related Works

Graphs are prevalently used in various domains. A basic problem among those applications is how to compare graphs and determine the subgraph relationship between graphs. This problem has been associated with different names, such as graph matching, (subggraph isomorphism testing, and so on. It has been proved that the subgraph isomorphism is a NP-complete problem [10]. Recently, the problem receives a growing attention [11]-[13]. For subgraph relationship decision, the Ullmann's algorithm [14] performs a tree search in terms of vertices, and in each step refines the future vertex pairs on the basis of the current partial matching. A recent algorithm VF2 [15] achieves significant improvement over other algorithms in many cases for more efficient refinement heuristic computing. Several years later, Haichuan proposed another subgraph isomorphism algorithm in [16]. These algorithms determine the subgraph isomorphism between two graphs. These algorithms can be use in the graph query processing algorithm for subgraph isomorphism test. For the detection of subgraph isomorphism from many graphs to one, [17] builds a decision tree in preprocessing phase and results in a quadratic time with respect to the input graph size, but with exponential space requirement and preprocessing time, which leads to its inapplicability to large-size databases. An inspiring decomposition-based method [18] results in a time sublinear w.r.t. the number of graphs in database. These detection algorithm methods only aim to induce subgraphs of the query owing to the underlying decomposition strategy. Furthermore, output of the methods is all subgraphs isomorphic to the graphs in answer set of a query graph, which is unnecessary and time consuming for the subgraph set query.

There have been a lot of studies on subgraph query processing. In order to overcome the difficulty on processing of the graph query in arbitrary structure, recently, some filtering-and-verification approaches are proposed, including both exact and approximate query processing algorithm. The exact algorithms for subgraph query processing are proposed in [28, 19]. Based on [2], approximate subgraph query algorithms are also developed, e.g. [20] and [21]. Those algorithms construct the indices of the graph fragments, such as paths, trees, and substructures. In [2], [4], [7], and [8], the data mining techniques are employed to build blocks for extracting features. In [18], [3], [5], [6], and [22] other strategy are used to construct feature set, such as graph closure, directed acyclic graph decomposition, special components in graph.

The subgraph query methods use features, which are subgraphs of database graphs, to distinguish whether a database graph is a candidate answer. Here, the inclusion logic is applied. If query graph q is a subgraph of graph g, all features in q must be included in g. On the contrary, if g without even one feature q contained, it can be safely addressed that q is not a subgraph of g. Then, all graphs with subgraph g can be pruned from the candidate answers set. The smaller size of the candidate set is, the fewer subgraph isomorphism test needs to be performed.

There are some more efficient algorithms which focus on specific queries in some professional domain. For example, in the XML context, [23] constructs a single deterministic pushdown automata to generalize and improve tree pattern matching technique (not for arbitrary in structure graphs) for the specific task of evaluating XPath queries. [24] focuses on finding XML schema embeddings by which an instance-level mapping can be automatically derived and it guarantees information preservation w.r.t. an XML query language. It does not focus on finding schema embeddings from large amounts of source DTD schemas to a single target DTD in a scalable manner.

Most methods mentioned above are designed for subgraph query processing and are not applicable to the graph containment query. To the best of our knowledge, cIndex [9] is the only algorithm that employs the filtering-and-verification approach to process graph containment query. However, the algorithm does not considered how to efficiently organize the features of the graph database to effectively improve the query processing performance, which is the emphasis of this paper. In [9], three algorithms are proposed, cIndex-Basic, cIndex-TopDown and cIndex-BottomUp. The main difference between them is the indexed features picking method. cIndex-Basic directly tests the subgraph isomorphism between query graph and every index feature. This is the most naive way. cIndex-BottomUp puts index features in several layers. At the bottom of this structure is database graph. cIndex-Basic method is repeatedly invoked until the number of features is small enough. The relationship between neighbor layers is supergraph-subgraph. In this way, feature in higher layer covers more database graphs. It is a useful method to reduce subgraph isomorphism. Though cIndex-TopDown is also a multi layer structure, it is different from cIndex-BottomUp. The structure of cIndex-TopDown is a decision tree like structure. Suppose f is a feature on layer i. If [f.sub.i] is a subgraph of q, cIndex-TopDown should check a part of features of layer (i+1) as [f.sub.(i+1)]. If [f.sub.i] is not a subgraph of q, it checks the other part of features of layer i+1 as [f.sub.(i+1)]'. This is the way cIndex-TopDown reducing the number of subgraph isomorphism in indexed features picking stage. But there are some problems. cIndex-Basic compares query graph with indexed features flatly. cIndex-BottomUp may introduce more subgraph isomorphism in feature picking stage. And cIndex-TopDown has to divide feature groups into two parts before query arrives. That introduces extra computations.

Finally, an introduction on graph mining is given in [25]. There has been many methods proposed [26], [27], [28], [29], [30] that can efficiently obtain (closed) frequent (induced) subgraphs from a database. To decrease the number of frequent subgraphs in a parameterized way, graph patterns summarization method was proposed in [31]. They play a very important role in the offline feature construction in the paper.

3. Problem Definition

In this section, we give some definition about graph mining and graph query. Then, we formalize the definition of graph containment query.

Definition 1 (Labeled Graph) A labeled graph G is a 4-tuple (V, E, 1, B), where V is the vertex set, E [??] V x V is the edge set, and I is the label function that maps a vertex in V or an edge in E to a label in the label set 8, i.e. 1: V [union] E [right arrow] 8.

Labeled graph is useful in many applications. In different situations, labels represent different attributes. For example, in chemical compounds, vertex labels represent different types of atoms, and edge labels represent different types of chemical bond between atoms. In this paper, we focus on the graph query over the undirected labeled graph. Our solution can be easily extended to process the queries over the directed labeled graph. In the rest of the paper, for clarity, we use V(g) and E(g) to denote the vertex set and the edge set of graph g, respectively. The size of a labeled graph g is the number of its edges, |g| = |E(g)|.

Definition 2 (Subgraph Isomorphism) Given two graphs [g.sub.1] = ([V.sub.1], [E.sub.1], [I.sub.2], [B.sub.2]) and [g.sub.2] = ([V.sub.2], [E.sub.2], [I.sub.2], [B.sub.2]), [g.sub.1] is subgraph isomorphic to [g.sub.2], if there exists an injective function h: [V.sub.1] [right arrow] [V.sub.2] that satisfied

1) For [for all]V [member of] [V.sub.1], h(v) [member of] [V.sub.2], [I.sub.1](v) = [I.sub.2](h(v))

2) For [for all](u, v) [member of] [E.sub.1], [there exists](h(u), h(v)) [member of] [E.sub.2], and [I.sub.1](u, v) = [I.sub.2](h(u), h(v))

Given two graphs [g.sub.1] and [g.sub.2], if [g.sub.1] is subgraph isomorphic to [g.sub.2], [g.sub.1] is a subgraph of [g.sub.2], denoted by [g.sub.1] [??][g.sub.2], and [g.sub.2] is a supergraph of [g.sub.1], denoted by [g.sub.2] [??] [g.sub.1].

[FIGURE 1 OMITTED]

Example 1. Figure 1 shows a set of labeled frequent subgraphs of graph database D. The edges of the graphs are assigned with labels, e.g. a, b and c are the labels of different edges. For simplicity, the labels of vertices are omitted. The vertices of those graphs are numbered by the canonical code. The code will be introduced in section 4.1. In this example, [g.sub.5] is a subgraph of [g.sub.6] while there exits a subgraph isomorphism from [g.sub.5] to [g.sub.6].

Definition 3 (Containment Query) Given a graph database D = {[g.sub.1], [g.sub.2], ..., [g.sub.n]}, and a query graph q, the containment query Qc returns the graphs in D that are subgraphs of q, i.e. [Q.sub.c] = {[g.sub.i]|[g.sub.i] [member of] D^gi [??]q}

While a containment query is processed, subgraph is omorphism test must be performed to verify whether a candidate graph is a subgraph of the query graph. It is a time consuming operation. In order to reduce the overall cost of subgraph isomorphism test, we design an index structure, CFG-Index, to increase the efficiency of query processing algorithm.

4. Containment Query Processing

In this section, we introduce the algorithm CFG-Index to process graph containment search problem. First, we give a canonical code of graphs. Graphs can be translated into sequences by such code. Then, we define closed frequent subgraphs to get smaller size of candidate answer set to promote the efficiency. These subgraphs are organized by a DFS tree according to the mining order. At last, the algorithm is proposed.

4.1. Canonical Code

Graph canonical code is using to translate a labeled graph into sequence by 1-1 mapping. Let f: G [right arrow] S be a canonical code function, if g is isomorphic to g', f (g) = f (g'), otherwise, f (g) [not equal to] f (g'). Here, G is a graph and S is a sequence which graph can be translated to.

One labeled graph can be represented by adjacent matrix or adjacent list, which makes the translation possible. However, given a graph g, there may be lots of sequences for different order of edge sequence. If the lexicographic order is introduced into these sequences, there will always be a unique minimum value or maximum value.

Each edge in a labeled graph is denoted by a 5-tupie code ([n.sub.1], [n.sub.2], [v.sub.1], e, [v.sub.2]), where [n.sub.1], [n.sub.2] are vertex serial numbers which are formed according to the order of mining, [v.sub.1], [v.sub.2] are vertex labels, and e is edge label. This is a transform of adjacent list. In this paper, we use the smallest 5-tupie edge sequence to represent a graph.

4.2 Closed Frequent Subgraph

Frequent subgraphs are the most common features of the graph database. They can be retrieved by graph mining algorithms. When containment query is evaluated against the frequent subgraphs, the candidate results can be obtained by filtering the graphs without the frequent subgraphs in the query graph. Nevertheless, the whole frequent subgraph set of the database is too large to manipulate and evaluate, since it contains many structural redundant features. Thus, this section proposes a set of special frequent features that brings more structural information and contains smaller redundant features. It can be indexed for efficient containment query processing.

Definition 4 (Frequent subgraph) Given a graph database D = {g1, [g.sub.2], ..., gn}, a graph g is a frequent subgraph of D, if the value of support(g, D) is not less than a given minimum threshold, where support (g, D) is the percentage of graphs in D that are supergraphs of g, i.e.

support(g,D) = [|{[g.sub.i]|[g.sub.i][member of]D^g[??][g.sub.i]}|/|D|]

The support set of a frequent subgraph g is a set of database graphs that are supergraphs of g, or {gi|gi [member of] D ^ g [??] gi}. In this way, the support of a subgraph can be represented by the ratio of its support set and graph database: support (g, D) = ssupport set of g|/|D|.

The frequent subgraphs of a graph database can be retrieved by graph mining algorithms. Most graph mining algorithms adopt the depth first search (DFS) order to generate the frequent subgraphs. This order introduces the filiations of the retrieved subgraphs naturally. All qualified frequent features form a DFS tree. Nodes of the tree denote frequent subgraphs. Two neighbor nodes with one edge difference represent that the corresponding graphs are subgraph and supergraph. The root node is a NULL graph without any nodes or edges. A node on level k denotes that it is a subgraph with k edges. The path from root to a node represents the combination of edges according to the mining order.

[FIGURE 2 OMITTED]

Example 2. Figure 2 presents DFS tree of the frequent subgraphs in figure 1. In the figure, the filiations between the nodes in the tree can be easily found. Since the graph [f.sub.5] is the subgraph of the graph [f.sub.6], in the tree, the node representing [f.sub.5] is the ancestor of the corresponding node of [f.sub.6]. The graph [f.sub.5] and [f.sub.6] have 2 and 3 edges respectively. Therefore, the corresponding nodes of the two graphs in the tree are located in level 2 and level 3. Moreover, because there is a path, which labeled as "abc" in canonical code, from root to [f.sub.6], it can be derived that [f.sub.6] contains three edges with the labels 'a', 'b' and 'c'.

Definition 5 (Closed Frequent Subgraph) Given the frequent subgraph set FG of the graph database D, the closed frequent subgraph set CFG of D is the subset of FG that only contains the frequent subgraphs which have no supergraphs in FG with the same support, i.e.

CFG = {g|g [member of] FG ^ [??]g'[member of] FG such that g [??] g' and support(g, D) = support(g', D)}

Generally, the size of a graph database's frequent subgraph set is much greater than that of the corresponding closed frequent subgraph set. In figure 2, for example, the closed frequent subgraph set is {[f.sub.2], [f.sub.4], [f.sub.6], [f.sub.8], [f.sub.10], [f.sub.11], [f.sub.12]}, which only contains 2/3 of the elements in frequent subgraph set. For clearly, the description of a node is in the form of figure [f.sub.i]: support([f.sub.i], D). Furthermore, while considered the NCI/NIH database, if the minimum threshold of support(g, D) is | NCI/ NIH | *0.05, the number of frequent subgraphs is 1,000,000, but the corresponding counterpart only contains 2000 closed frequent subgraphs. In the rest of the paper, we use FG and CFG as the abbreviation of frequent subgraph and closed frequent subgraph, respectively.

Though CFGs are a subset of corresponding FG, they have its own special properties.

Property 1 Every FG always has a supergraph in CFG set with the same support.

This property can be directly deduced from the definition of CFG. In a DFS tree of frequent subgraphs, a CFG must be the supergraph of all FGs with the same support on its path to root.

Property 2 The whole FG set can be deduced from CFG set. One can delete edges from a CFG to generate all FGs that have the same support with this CFG. When two CFGs generate the same FG with different support, the FG with the higher value should be the correct one.

Though CFG is more small and concentrated, it can substitute FG in some applications. If further analysis (such as classification and clustering) is performed on the set of closed graphs instead of on the whole set of frequent graphs, it will achieve the same accuracy with less redundancy and better efficiency [27].

Conclusion 1 ThesizeofCFGsetissmallerthan corresponding FG, and CFG covers all structural information FG brings.

Conclusion 1 shows that CFG has more discriminative character than FG. Using CFGs as indexes features can get a smaller size of candidate answers set while a smaller size of index. The size of the candidates set greatly influences the performance of the query processing procedure, since it determines the times of subgraph isomorphism test that needs to be performed during verification stage. Therefore, while processing the containment query, we can achieve the same accuracy with less redundancy and better efficiency by performing subgraph isomorphism test on closed frequent subgraph set instead of the whole frequent subgraph set.

4.3 CFG-Index

Subgraph isomorphism test must be performed between query graph and indexed feature. Thus, the size of feature index is one of the important issues that affect the efficiency of the graph containment query processing algorithm. CFGs are retrieved by graph mining. For each CFG feature, the graph mining algorithm records the identifiers of the graphs that contain this feature. While designing the CFG-Index to facilitate the query processing algorithm, without loss of accuracy, we must limit the indexed CFGs as few as possible. In practice, for building an index, there may exist some redundant features in all CFGs of a graph database. Given a CFG f, if the support set of f is similar to the intersection of all f's sub-features, i.e. [??]f'[??]f [S.sub.f], f can be considered as redundant feature, since most graphs contains ALL f 's sub-feature are also supergraph of f.

Definition 6 (Discriminative Ratio) Discriminative ratio [delta] is the proportion between support sets of a feature f and all its sub-features, i.e.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The support set of a feature f is always smaller than the intersection of support sets of sub-features of f, i.e. [delta] [greater than or equal to] 1. The greater [delta] is, the more discriminative f is.

Since the contribution of redundant feature can be substituted by its sub-features, in the process of index construction, the redundant features are discarded to reduce the size of index and avoid redundant subgraph isomorphism tests. The size of index can be controlled by a given discriminative ratio. When index is built, it begins with the smallest frequent CFGs, 1-edge features. The number of the indexed features in CFG-Index is increased edge by edge. In a word, the index is composed by two parts: 1) All 1-edge CFGs; 2) The CFGs does NOT satisfy

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Example 3. In figure 2, [f.sub.11] and [f.sub.12] are CFGs. |[D.sub.12]| = 160 and |[D.sub.11]| = 162. The discriminative ratio of [f.sub.12] is [delta] = 162/160 = 1.01. The feature [f.sub.2] and [f.sub.4] are CFGs. Discriminative ratio between them is [delta] = 1.30. If [[delta].sub.min] = 1.2, [f.sub.4] should be recorded in the CFG-Index while [f.sub.12] should be discarded.

For decreasing the number of subgraph isomorphism tests, CFG-Index builds a tree structure instead a flat one like cIndex does. In CFG mining process, a DFS tree is formed. This tree will be used in CFG-Index as DFS index tree T The root node of this tree is a NULL graph (a graph has no node and edge). Every leaf node of this tree is a discriminative CFG. The mid nodes of this tree are either discriminative or un-discriminative one. The un-discriminative ones are preserved for keeping the tree entirely.

Proposition 1 If a feature f is contained in a graph g, all subfeatures off are contained in g, otherwise all super-features of fare not contained in g.

Given two graphs [g.sub.1] and [g.sub.2], if [g.sub.1] [??] [g.sub.2], we can draw the relationships between [g.sub.1], [g.sub.2] and query graph q according to above proposition. If [g.sub.1] is not a subgraph of q, [g.sub.2] is not contained in q as well. On the contrary, if [g.sub.2] is a subgraph of q, [g.sub.1] is a subgraph of q([g.sub.1] [??] q).

After features are organized in a tree, proposition 1 can be used to distinguish whether a feature is contained by q more efficiently. The query processing algorithm travels CFG-Index T in depth first order. When algorithm travels T it checks whether current node [n.sub.c] is a subgraph of q. If [n.sub.c] is not a subgraph of q, CFG-Index prunes the whole subtree rooted at [n.sub.c] directly without any tests.

4.4 Query Processing Algorithm

The algorithm of CFG-Index is presented in algorithm 1. The algorithm invokes SMining at the beginning and gets set F that contains indexed features which are not included by q, i.e. F = {f|f [member of] T ^ f [??] q}. The union in line 2 is combined by support set of features in F, i.e. [??]f[member of]F {[g.sub.i]|[g.sub.i] [member of] D^[Florin][??][g.sub.i]} .

According to exclusion logic, these graphs should not be candidates which are contained by q. The complementary set of this union is the candidate answer set (line 3). To get the answer set, CFG-Index takes subgraph isomorphism tests between q and every candidate graph (line 4 to line 9).

Algorithm 1 CFG-Index

Input: query graph q, graph database D, DFS tree T Output: answer set C

1: F [left arrow] SMining(q, T);

2: [D.sub.List] [left arrow] union of supergraph ID lists of features in F;

3: C [left arrow] D - [D.sub.List].

4: for every graph c in C

5: verification(c);

6: if c is contained by q

7: continue;

8: else

9: prune c from C;

10: return C;

SMining is a procedure to implement features picking. It is important to distinguish whether an index feature is a subgraph of query graph efficiently. During query processing, subgraph isomorphism test is unavoidable. In [9], it takes subgraph isomorphism test between q and every index items. In CFG-Index, it compares q with index features according to the depth first order of tree T in CFG-Index. There are two advantages in CFG-Index. First, the times of subgraph isomorphism tests are reduced by generating subgraphs of q according to the nodes in T. Second, if a feature f is not a subgraph of q, the whole branch of f in DFS tree can be pruned by exclusion logic. All supergraphs of a feature can be pruned as soon as this feature is identified as a non-subgraph of q. This method improves the efficiency in the stage of feature distinguishing.

Definition 7 (Embedding) Given two labeled graphs [g.sub.1] and [g.sub.2], g' is a subgraph of [g.sub.1], i.e. g' [??] [g.sub.1]. If g' is isomorphic to [g.sub.2], g' is called an embedding of [g.sub.2] in [g.sub.1].

In figure 1, for example, [f.sub.2] has two embeddings of [f.sub.1], and f10 has one embedding of [f.sub.7]. In this paper, we denote an embedding of [g.sub.2] in [g.sub.1] as E([g.sub.2], [g.sub.1]).

Procedure SMining is presented in Algorithm 2. The input of SMining is a query graph q and current node in T. At the beginning of SMining, current node is root. The output is all index features which are not contained by q.

Algorithm 2 SMining

Input: query graph q, current_Node in T, embedding of edge e Output: index features not contained by q

1: F [arrow left] [phi];/* Features are not contained by q. */

2: for every son [s.sub.i] of current_Node do

3: if [s.sub.i] [??] q

4: e [arrow left] E ([s.sub.i], q);

5: current_Node [arrow left][s.sub.i];

7: else /* [s.sub.i] [??] q */

8: O [arrow left] all discriminative CFGs in the offspring of [s.sub.i];

9: F [arrow left] F [union] O;

10: prune subtree rooted at si from T;

11: endif

12: endfor

13: return F;

SMining distinguishes features as follows. When a query q arrives, SMining begins its depth first travel from root of tree T as current node. When it reaches a son s. of current node which is a subgraph of q, SMining records E(s., q) and sets si as current node. This operation repeats recursively until it meets leaf nodes or a node which is not contained by q (line 4 to line 6). If current node is not a subgraph of q, its entire offspring are not contained by q according to exclusion logic (line 8 to line 10). In this scenario, SMining puts all indexed discriminative CFGs into F and ends this recursive calling.

[FIGURE 3 OMITTED]

Example 4. When a query graph q which is shown in figure 3 arrives, SMining begins its travel from root of T and set it as current node. According to depth first order, it checks one son of the three, such as son [f.sub.1] which is a one edge graph with edge label 'a'. SMining cannot find any embeddings of [f.sub.1], it prunes the whole branch of [f.sub.1]. At the same time, it puts [f.sub.2], [f.sub.4], and [f.sub.6] into F. Then, SMining checks [f.sub.7]. Here, E([f.sub.7], q) = 3. SMining records E([f.sub.7], q) and checks sons of f.sub.1]. After SMining checks entire offspring of [f.sub.7], it only puts fro into F because f8 is contained by q. When SMining is ended, there are five features in F, {[f.sub.2], [f.sub.4], [f.sub.6], [f.sub.10], [f.sub.12]}. After F is formed, in CFG-Index, it finds out every support set of each feature in F and combines them. The candidate answer set should be the complimentary set about graph database D. After candidate answer set forms, it verifies graphs in this set one by one.

CFG-Index has three improvements to cIndex. One is that it does subgraph isomorphism test between subgraphs q and index features in a simple way through DFS tree. The second is that CFG-Index prunes a branch after it finds a feature is not contained by q immediately to avoid further computation. The last one is that it picks discriminative features as index ones to reduce the size of candidate answer set. In next section, we will illustrate the performance of CFG-Index.

5. Performance Analysis

While a containment query q is issued, the response time of the query processing algorithm can be estimated by [T.sub.response] [T.sub.search] + |F| x [T.sub.isSubQ] + |[C.sub.q]| x [T.sub.isSuperG], where [T.sub.response] is the response time, I F I is the size of indexed feature set, |[C.sub.q]| is the size of candidate graphs set, [T.sub.search] is the time for computing [C.sub.q] with the index, [T.sub.isSubQ] the time for subgraph isomorphism test between q and index features, and [T.sub.isSuperG] is the time for subgraph isomorphism test between q and candidate graphs. Comparing to the other two parts, [T.sub.search] is much smaller and can be ignored. In order to reduce the response time, the algorithm must focus on minimizing the size of F and [C.sub.q].

In CFG-Index, |[C.sub.q]| is reduced by the discriminative character of CFGs. Direct outcome from this reduction is the improvement on the efficiency of verification procedure by the subdue of subgraph isomorphism tests. In this section, the reduction of | F| in CFG-Index is analyzed.

5.1 Index Feature Filtering

Different from subgraph query, graph containment query employs exclusion logic to decrease the size of candidate answers set. Query processing algorithm must perform subgraph isomorphism test between query graph and the index features to figure out the features that are not contained by the query graph. Subgraph isomorphism is NP-complete problem. There are three algorithms for subgraph isomorphism test, Ullmann's algorithm [13], VF2 [14], and [15] algorithms. The three algorithms are based on enumeration. Though the algorithms uses the branch pruning method to accelerate the performance, the complexity of enumeration is O(n!), where n is the number of edges of the largest subgraph in indexed features contained by the query graph.

CFG-Index utilizes the DFS index tree to distinguish whether an index feature is a subgraph of the query graph. When a query q is issued, unlike other graph query algorithms, CFG-Index does not enumerate all subgraphs of q and compares with index features. The algorithm travels the index tree T according to DFS order. It identifies whether current node [n.sub.c] is a subgraph of q. If current node [n.sub.c] is not a subgraph of q, the algorithm prunes the subtree in T rooted by [n.sub.c],for all offspring of nc cannot contained by q. This step is guaranteed by proposition 1 in subsection 4.3. Then, the algorithm feeds back to upper layer of nc or a sibling of [n.sub.c] and sets that node as current node. If [n.sub.c] is a subgraph of q, the algorithm records E([n.sub.c], q) and sets every son of [n.sub.c] as current node recursively.

Theorem 1. If a feature in DFS tree is contained by the query graph, it must not be filtered by CFG-Index.

Proof: We prove the theorem by induction. Let g and q be a node in DFS tree Tand the query graph, respectively.

1. When |g| = 1, CFG-Index finds all embeddings of g in q. It guarantees that CFG-Index can filter g if it is contained by q.

2. Suppose the theorem is true, when |g| = k and q contains g.

Next, when |g'| = |g + e| = k + 1, If g' can be found by CFG-Index, the theorem is proved.

According to proposition 1 in subsection 4.3, if g' is a subgraph of q, then g is a subgraph of q, too. It can be get from this proposition that E(g', q) is formed on E(g, q), e.g. |E(g', q)| = n' and |E(g, q)| = n, then n' [less than or equal to] n. Or, E(g', q) is based on E(g, q). CFG-Index records all E(g, q) in SMining. DFS tree records the position information of edge 'e' that added into E(g, q) to form E(g', q). Thus, if g is a subgraph of q and can be certified by CFG-Index, g' can also be certified as well.

According to theorem 1, CFG-Index is complete and accurate. CFG-Index improves the query processing efficiency by reducing the subgraph isomorphism tests. According to the algorithm SMining, if the canonical code of current node [S.sub.c] is ([n.sub.0][e.sub.0][n.sub.0]') ([n.sub.1][e.sub.1][n.sub.1]')...([n.sub.k][e.sub.k][n.sub.k]') and it is contained by the query q, SMining should distinguish whether its sons [s.sub.1], [s.sub.2],..., [s.sub.r] are subgraphs of q. Every [s.sub.i] includes a distinct edge added to [S.sub.c], or [s.sub.i] = [S.sub.c]U([n.sub.i][e.sub.i][n.sub.i]'). Unlike subgraph isomorphism, in SMining, it only checks whether there exists an edge at n, with label 'ei' and another node with label 'ni". SMining records E([S.sub.c], q) and describes it as canonical code. Thus, SMining only needs to check edges connected to [n.sub.i] on E([S.sub.c], q). In chemical molecule graph database, average degree of vertex is three. Thus, if [n.sub.i] is in the middle of sequence [S.sub.c], there is only one choice on average. If [n.sub.i] is added to the end of sequence [S.sub.c],there are just two choices. Then, the response time of CFG-Index can be practically improved.

5.2 Bisearch in Feature Distinguishing

When current node is a subgraph of query q, CFG-Index should checks the entire subtree rooted at it. In this process, bisearch like method are employed in CFG-Index to accelerate the search procedure. Suppose that there is a path from root to leaf that can be described as "[Rn.sub.1] [n.sub.2] ... [n.sub.k]". Here, R is the root of DFS tree T [n.sub.i] is a node of T on the path, i is the level of [n.sub.i], and k is the height of the path. If [n.sub.i] is a subgraph of q, the algorithm does not check the path linearly but turns to check the node in half of the path. Then, [n.sub.[k/2]] is checked instead of [n.sub.2]. If [n.sub.[k/2]] is a subgraph of q, the algorithm jumps to check the path in [[n.sub.[k/2]]+1, [n.sub.k]] until meets an exception. If [n.sub.[k/2]] is not a subgraph of q, CFG-Index prunes the subtree rooted at and recursively performs this process in [R, [n.sub.[k/2]]). Thus, the approach reduces the check scan time from O(k) to O(log k).

6. Experimental Result

In order to evaluate the effectiveness and efficiency of the proposed query processing algorithm, we compare our solution with cIndex, the up-to-date approach, by comprehensive experiments. The real and synthetic datasets are chosen as the workload in the experiments. Most of the experiments are conducted on the real datasets. They are the source of real demand after all. The experiments are performed on a 3.2GHZ, 512MB-memory, Intel PC running Windows XP Professional SP3. Both cIndex and CFG-Index are compiled with gcc/[g.sup.++].

6.1 AIDS Antiviral Screen Dataset

The parameters of cIndex are set as follows. 10,000 graphs are randomly chosen from AIDS to form the dataset D of cIndex (In the dataset, there are more than 40,000 chemical compounds.). Next, D is randomly partition into 5 pieces [D.sub.1], ..., [D.sub.5] with equal size. Among the 5 pieces of dataset, one is query set and the other 4 pieces are query log sets. Contrast subgraphs are mined with the minimum support [[sigma].sub.c] = 0.05. Then, the support of the minimum average candidate answers set lies within 0.1 and 0.01. Meanwhile, in cIndex-TopDown algorithm, the smallest number of queries in each leaf node is set to 100. In CFG-Index, the minimum support, min_sup, is a variable with respect to the size of frequent subgraphs. The smaller subgraph size is, the greater min_sup is. And for variable min_sup and discriminative ability of features, we control the number of indexed features about 200. The graph dataset [D.sub.init] contains the frequent subgraphs that is retrieved by the graph mining algorithm under the support range [0.5% ~10%]. The test dataset, denoted by [D.sub.10,000], is 10,000 graphs that are randomly selected from [D.sub.init].

In the experiment, the performance of CFG-Index is compared with that of cIndex-TopDown which is more efficient on processing graph containment query than that of cIndex-Basic and cIndex-BottomUp. In order to compare the performance of the two algorithm, 2,000 queries are divided into eight bins with the range [0, 10), [10, 20), [20, 30), [30, 40), [40, 100), [100, 200), [200, 500), [500, [infinity]), which are the size of the average query answers set, i.e., the number of graphs in dataset that are contained in query.

Figure 4 compares the performance of the two algorithms. It is figured that the average query processing time of CFG-Index is about 3 times faster than that of cIndex-TopDown. The improvement is achieves by reducing the subgraph isomorphism test during feature distinguishing stage and decreasing the size of average candidate answers set.

[FIGURE 4 OMITTED]

The size of candidate answers sets is compared in figure 5. The size of the candidate answers set is one of the key parameters that influence the performance the query processing algorithm. In the figure, X axis represents the average size of the actual answers set, while Y axis denotes the average size of the candidate answers set. In the figure, the candidate answers set of cIndex is larger than that of CFG-Index. It depicts that the proposed method is more effective on filtering the useless candidates. This is one of the reasons that CFG-Index outperforms cIndex.

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

In order to evaluate scalability of the proposed algorithm, four datasets are generated based on randomly selecting graph from [D.sub.init]. The size of the four datasets ranges from 10,000 to 70,000. Under the varying dataset size, figure 6 demonstrates the performance of two query processing algorithms and figure 7 presents the average size of candidate answers sets. The figures show that the scalability of CFG-Index is better than that of cIndex.

[FIGURE 7 OMITTED]

These experiments present that the improvement ratio on the efficiency of the CFG-Index is not same as the decrease ratio on the size of candidate answer set. The extra gain comes from the feature distinguishing procedure. This is coincident with the performance analysis.

6.2 Synthetic Dataset

This section presents experiment result on synthetic dataset. The widely used graph generator [26] is employed to generate the synthetic dataset. Users can set parameters to decide the numbers of graphs wanted in dataset and their average size. The value of the parameters is listed in table 1.

In order to conduct experiment on datasets with different characteristics from the real one, we generate the graph dataset and query graphs separately. The naming method is shown by an example, such as a dataset which has 10,000 graphs, each graph has 15 edges on average, the potential frequent graph has 5 edges on average, 100 potential frequent graphs, and 5 available node labels, this dataset can be represented by D10kN515T15L100. The dataset we generated is D10kT15L10015V5S, and query sets are: D10kN515T40L100, D10kN515T45L100, D10kN515T50L100, D10kN515T55L100, and D10kN515T60L100. We denote these five query sets as Q1, Q2, Q3, Q4 and Q5. Each query set is divided into a query log set and a testing query set with 8,000 and 2,000 graphs, respectively.

Figure 8 reports the average query processing time of cIndex and CFG-Index on synthetic datasets and figure 9 reports the average size of candidate answer set. The performance of CFG-Index is still much efficient than that of cIndex-TopDown on synthetic datasets.

[FIGURE 8 OMITTED]

[FIGURE 9 OMITTED]

7. Conclusion

In the paper, a graph containment query processing algorithm is proposed. In the algorithm, the filtering-and-verification approach is adopted. Instead of using discriminative redundancy-aware features like cIndex, a subset of closed frequent subgraphs is chosen as the index features. Though index based on CFGs reduces the size of candidate answer set, the indexed features are organized as a DFS tree to accelerate the stage of index picking. Performance analysis and experimental results show that the proposed algorithm is more efficient than the up-to-date approach.

Received 13 November 2008; Revised 11 March 2009; Accepted 18 March 2009

References

[1] Marie Thilliez., Thierry Delot., Sylvain Lecomte (2005). An Original Solution to Evaluate Location-Dependent Queries in Wireless Environments. Journal of Digital Information Management, 3(2).

[2] Yan, X., Yu, P., Han, J (2004). Graph Indexing: A Frequent Structure based Approach. In: Proc. of the Special Interest Group on Management of Data 2004 (SIGMOD'04), Paris, France. p. 335-346

[3] He, H., Singh, A (2006). Closure-Tree: An Index Structure for Graph Queries. In: Proc. of the 22nd International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA. April, p. 38-49.

[4] Zhang, S., Hu, M., Yang, J (2007). TreePi: A Novel Graph Indexing Method. In: Proc. of the 23rd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey. October, p. 966-975.

[5] Williams, D., Huan, J., Wang, W (2007). Graph Database Indexing Using Structured Graph Decomposition. In: Proc. of the 23rd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey. October, p. 976-985.

[6] Jiang, H., Wang, H., Yu, P., Zhou, S (2007). GString: A Novel Approach for Efficient Search in Graph Databases. In. Proc. of the 23rd International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey. October, p. 566-575.

[7] Cheng, J., Ke, Y., Ng, W., Lu, A (2007). FG-Index: Towards Verification Free Query Processing on Graph Databases. In. Proc. of the Special Interest Group on Management of Data 2007 (SIGMOD'07), Beijing, China. June, p. 857-872.

[8] Zhao, P., Yu, J., Yu, P (2007). Graph Indexing: Tree + Delta >= Graph. In: Proc. of the 33rd International Conference on Very Large Data Bases (VLDB'07), University of Vienna, Austria. September, p. 938-949.

[9] Chen, C., Yan, X., Yu, P., Han, J., Zhang, D., Gu, X (2007). Towards Graph Containment Search and Indexing. In: Proc. of the 33rd International Conference on Very Large Data Bases (VLDB'07), University of Vienna, Austria. September, p.926-937.

[10] Garey, M., Johnson, D (1979). Computersand Intractability: A Guide to the Theory of NP-Completeness. New York, W. H. Freeman and Company.

[11] Bunke, H (2000). Graph Matching: Theoretical Foundations, Algorithms, and Applications. In. Proc. of the Vision Interface 2000, Quebec, Canada. May, p. 82-88.

[12] Conte, D., Foggia, P., Sansone, C., Vento, M (2004).Thirty Years Of Graph Matching In Pattern Recognition. International Journal of Pattern Recognition and Artificial Intelligence 18(3) 265-298.

[13] Fortin, S (1996). The Graph Isomorphism Problem. University of Alberta.

[14] Ullmann, J (1976). An Algorithm for Subgraph Isomorphism. Journal of the ACM (JACM) 23(1) 31-42.

[15] Cordella, L., Sansone, C., Vento, M (2004). A (SubGraph Isomorphism Algorithm for Matching Large Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10) 1367-1372.

[16] Haichuan Shang., Ying Zhang., Xuemin Lin., Jeffrey Xu Yu (2008). Taming Verification Hardness: an efficient algorithm for testing subgraph isomorphism. In: Proc. of the 34th International Conference on Very Large Data Bases (VLDB'08).

[17] Messmer, B., Bunke, H (1999). A decision tree approach to graph and subgraph isomorphism detection. Journal of the Pattern Recognition 32(12) 1979-1998.

[18] Messmer, B., Bunke, H(2000). Efficient Subgraph Isomorphism Detection: A Decomposition Approach. Journal of IEEE Transactions on Knowledge and Data Engineering 12(2) 307-323.

[19] Shasha, D., Wang, J., Giugno, R(2002). Algorithmics and Applications of Tree and Graph Searching. In: Proc. of the 21st Principles of Database Systems (PODS'02), Madison, Wisconsin. June, p. 39-52.

[20] Yan, X., Yu, P., Han, J (2005). Substructure Similarity Search in Graph Databases. In: Proc. of the Special Interest Group on Management of Data 2005 (SIGMOD'05), Baltimore, Maryland, USA. June, p. 766-777.

[21] Yan, X., Zhu, F., Han, J., Yu, P (2006). Searching Substructures with Superimposed Distance. In: Proc. of the 22nd International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA. April, p. 88-99.

[22] Zou, L., Chen, L., Yu, J., Lu, Y (2008). A novel spectral coding in a large graph database. In: Proc. of the 11th International Conference on Extending Database Technology (EDBT'08),. Nantes, France. March, p. 181-192.

[23] Gupta, A., Suciu, D (2003). Stream Processing of XPath Queries with Predicates. In: Proc. of the Speciallnterest Group on Management of Data 2003 (SIGMOD'03), California, USA. June, p. 419-430.

[24] Bohannon, P., Flaster, M., Narayan, P(2005). Information Preserving XML Schema Embedding. In: Proc. of the 31st International Conference on Very Large Data Bases (VLDB'05), Trondheim, Norway. August, p. 85-96.

[25] Washio, T., Motoda, H (2003). State of the art of graph-based data mining. Journal of SIGKDD Explorations 5(1) 5968.

[26] Kuramochi, M., Karypis, G (2001). Frequent Subgraph Discovery. In: Proc. of the IEEE International Conference on Data Mining 2001 (ICDM'01), San Jose, California, USA. November, p. 313-320.

[27] Yan, X., Han, J(2003). CIoseGraph: mining closed frequent graph patterns. In: Proc. of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD'03), Washington, DC, USA. August, p. 286-295.

[28] Yan, X., Han, J (2002). gSpan: Graph-Based Substructure Pattern Mining. In: Proc. of the 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan. December, p. 721-724.

[29] Borgelt, C., Berthold, M (2002). Mining Molecular Fragments: Finding Relevant Substructures of Molecules. In: Proc. of the 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan. December, p. 51-54.

[30] Worlein, M (2006). Extension and parallelization of a graph-mining algorithm. PhD thesis, Friedrich Alexander University.

[31] Liu, Y., Li, J., Gao, H (2008). Summarizing Graph Patterns. In: Porc. of the 24th International Conference on Data Engineering (ICDE'08),. Cancun, Mexico. April, p. 903-912.

Xiantong Li, (1) Wei Zhang, (1) Jianzhong Li (1)

(1) School of Computer Science and Technology Harbin Institute of Technology Harbin 150001. P.R. China {lxt, weizhang, lijzh}@hit.edu.cn

Table 1. Parameters of the data generator Parameter Description e.g. D The total number of graphs D10K in a generated dataset N The number of possible labels N5 T The average size of graphs T15 in terms of edges I The average size of potentially 15 frequent graphs L The number of potentially L100 frequent graphs

Printer friendly Cite/link Email Feedback | |

Author: | Xiantong, Li; Wei, Zhang; Jianzhong, Li |
---|---|

Publication: | Journal of Digital Information Management |

Article Type: | Report |

Date: | Jun 1, 2009 |

Words: | 8263 |

Previous Article: | A new authentication scheme for session initiation protocol. |

Next Article: | Image recognition of occluded objects based on improved curve moment invariants. |

Topics: |