Printer Friendly
The Free Library
19,595,263 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Applying SD-tree for object-oriented query processing.


1 Introduction

The advent of internet has made the volume of data going high everyday in all computer-based applications. This has entrusted researchers to design more powerful techniques to generate and manipulate large amounts of data to derive useful information. Indexing plays a vital role in the fast recovery of required data from large databases.

Among the many indexing techniques reported the signature file approach is preferred for its efficient evaluation of set-oriented queries and easy handling of insert and update operations. Initially applied on text data [2, 3, 12, 17, 21] it has now been used in other applications like office filing [6], relational and Object-Oriented Databases [14,16,26] and hypertext hypertext, technique for organizing computer databases or documents to facilitate the nonsequential retrieval of information. Related pieces of information are connected by preestablished or user-created links that allow a user to follow associative trails across the  [9].

Signatures are hash coded abstractions of the original data. It is a binary pattern of predefined length with fixed number of ls. The attributes' signatures are superimposed su·per·im·pose  
tr.v. su·per·im·posed, su·per·im·pos·ing, su·per·im·pos·es
1. To lay or place (something) on or over something else.

2.
 to form object's signature. To resolve a query, the query signature say Sq is generated using the same hash function An algorithm that turns a variable-sized amount of text into a fixed-sized output (hash value). Hash functions are used in creating digital signatures, hash tables and short condensations of text for analysis purposes (see hash buster).  and compared with signatures in the signature file for l s sequentially and many non-qualifying objects are immediately rejected.

If all the Is of Sq matches with that of the signature in the file it is called a drop. The signature file method guarantees that all qualifying objects will pass through the filtering mechanism; however some non-qualifying objects may also pass the signature test. The drop that actually matches the Sq is called an actual drop and drop that fails the test is called false drop. The next step in the query processing is the false drop resolution. To remove false drops Search results that meet your search criteria but have nothing to do with the information you were trying to find. For example, you want biographical details about Grover Cleveland, but you get information about Cleveland, Ohio. See full-text search.  each drop is accessed and checked individually. The number of false drops can be statistically controlled by careful design of the signature extraction method [7] and by using long signatures [3,6].

1.1 Related work

Different approaches have been discussed by researchers to represent Signature file in a way conducive for evaluating queries, such as Sequential Signature File [31], Bit-Slice Signature file [31], Multilevel Signature file [25], Compressed Multi Framed Signature file [23], Parallel Signature file [20], S-Tree and its variants [13,24], Signature Graph [28] and Signature tree [27,29,30].

1.2 Motivation for the current work

The signature tree developed by Chen [30] is having the following drawbacks:

* Signatures are inserted considering both 0s and 1s whereas actual weight age is for set bits only.

* Insertion path is dictated by the existing tree structure.

* To process a query, bits appearing in the tree from root node (mathematics, data) root node - In a tree, a node with no parents, but which typically has daughters.  are compared with query signature pattern for 0s and I s and not by its set bits.

* For a 0-bit in the query both left and right sub tree is followed leading to multiple traversals.

These observations laid the foundation for the current work. We study a new indexing technique for OODBSs using the dynamic balancing of B+ tree called Signature Declustering (SD)-tree in which the positions of 1s in the signatures are distributed over a set of leaf nodes. Using this for a given query signature all the matching signatures can be retrieved cumulatively in a single node.

The rest of the paper is organized as follows. In section 2 we discuss briefly the different approaches used to represent the signature file. In section 3 the structure of the proposed SD-tree is shown. Section 4 is devoted to the algorithms for insert, search and delete operations. Section 5 proposes a sample data set and queries to validate the new structure. Section 6 reports the results of the experiments conducted with the analytical comparison of SD-tree with that of Signature tree [30]. Finally section 7 concludes the work with further outlook on the work.

2 A summary of signature file techniques

2.1 Signature files

A Signature is a bit string formed from a given value. Compared to other index structures, signature file is more efficient in handling new insertions and queries on parts of words. Other advantages include its simple implementation and the ability to support a growing file. But it introduces information loss which can be minimized by carefully selecting the signature extraction method.

Techniques for signature extraction such as Word Signature (WS) [2,3,4,6], Superimposed Coding (SC) [1,2,3,4,5,6,10], Multilevel Superimposed Coding [12], Run Length Encoding See RLE.  (RL) [3,4,6], Bit-block Compression (BC) [3,4], Variable Bit-block Compression (VBC) [4,6] have been reported. The encoding See encode.  scheme sets a constant number say m, of 1s in the range [1..F], where F is the length of the signature.

The resulting binary pattern with m number of 1s and (Fm) number of 0s is called a word signature. The signature of a text block or object can be obtained by superimposing (logical OR operation) all its constituent signatures (i.e) word signatures for text block and attributes' signatures for object. The set of all signatures form a signature file. An example of Superimposed Coding and a sample query evaluation is given below.
Information               0010 0100
Retrieval                 0100 0001
Block Signature           0110 0101

Sample queries

Matching query
  Keyword = Information   0010 0100
  Query descriptor        0010 0100
Block signature matches   (Actual Drop)

False Match query
  Keyword = Coding        0010 0001
  Query descriptor        0010 0001
Block signature matches   (False Drop)
  but keyword does not

Non-matching query
  Keyword = Information   0010 0100
  Keyword = Science       0000 0110
  Query descriptor        0010 0110
Block signature does not match


2.2 Applications of signatures

Signatures are applied mainly in database access methods useful for text retrieval such as Full text scanning, Inversion inversion /in·ver·sion/ (in-ver´zhun)
1. a turning inward, inside out, or other reversal of the normal relation of a part.

2. a term used by Freud for homosexuality.

3.
, Clustering, Multi attribute retrieval methods like hashing Creating hash totals or hash tables. See hash total and hash table.

hashing - hash coding
 and signature files. Such applications are discussed in [3,5,21]. Here the documents are stored sequentially in the "Text File".

Signatures which are abstractions of the documents are stored in the "Signature File". The latter serves as a filter on retrieval. It helps in discarding a large number of non-qualifying documents. Signatures have been applied in areas rich in text documents like telephone directory [1], office systems [3,4,6], Optical and Magnetic disk access [8], Data base Management system and Library automation.

Other applications include Access method for documents Indexing method for large text file [17, 18]. Access method for formatted data [l]. To speed up searching in editor [7]. To compress a vocabulary [7]. For a spelling checking program [7]. In differential file [7].

2.3 Physical representations of signature files

This section discusses briefly the various techniques used to represent signature files. We follow the lead of [30] for figures in this section.

2.3.1 Sequential Signature File (SSF SSF Scalable Simulation Framework
SSF Single Stock Futures
SSF Service Switching Function
SSF Small Form Factor
SSF Svenska Simförbundet (Swedish Swimming Association)
SSF Space Station Freedom
SSF Society of St.
)

The signatures are sequentially stored in a file called SSF as in Fig. 1. In single level signature methods every signature must be accessed and tested. Since signatures are abstractions of original data with smaller size, the method is faster than sequential scan See progressive scan.  of objects themselves. This method is easy to implement and requires low storage space and low update cost. The disadvantage is that more the number of objects exist, the more is the time spent on scanning signature file [8,31]. Therefore it is generally slow in retrieval. To support faster access multilevel signature methods are suggested.

[FIGURE 1 OMITTED]

2.3.2 Bit-Sliced Signature File (BSSF BSSF Bromeliad Society of South Florida
BSSF Bone-Shaped Short Fiber
)

BSSF stores signatures in a column-wise manner as illustrated in Fig. 2. Thus F bit-slice files one for each bit position of the set signatures are used. In retrieval only a part of the F bit-slice files have to be scanned and hence the search cost is lower than SSF. However update cost is higher. This is because a new signature insertion requires about F disk accesses one for each Bit-slice file [8,31].

[FIGURE 2 OMITTED]

2.3.3 Compressed Bit-Sliced Signature File (CBSSF)

By choosing a proper hashing function for signature extraction the number of 1s is forced to be one. Here, the signature length should be increased to maintain the false drop probability at minimum. This creates a sparse matrix In the mathematical subfield of numerical analysis a sparse matrix is a matrix populated primarily with zeros.

Conceptually, sparsity corresponds to systems which are loosely coupled.
 which is easy to compress [8, 23]. A simple way to compress is to replace each 1 with its corresponding physical address.

In Fig. 3 the hash table A lookup table that is designed to efficiently store non-contiguous keys (account numbers, part numbers, etc.) that may have wide gaps in their alphabetic and numeric sequences.

Hash tables are created by using a hashing function (algorithm) to hash the keys into hash buckets.
 has a list of pointers pointing to the heads of linked list [8].

For example assume that the word Text has its first bit set to 1 and it appears at the 50th byte of text file then searching the first bucket list, we find the position of the word Text. Although this approach gives some space saving, the number of false drops will definitely be increased due to sparse signature files.

[FIGURE 3 OMITTED]

2.3.4 S-Tree

S-Tree [24] is a B+ tree like structure [11,22] with leaf nodes containing a set of signatures with their Object Identifiers (OIDs). The internal nodes are formed by superimposing the lower level nodes as shown in Fig. 4. For example to retrieve a query signature Sq = 11000000, we search S-ree top down. From root node v1 we compare and move to v6. In the next level both v7 and v8 match and finally we end up with signatures o11, o12 in v7 and o13 in v8.

The advantage is simple tree searching way of obtaining signatures rather than searching the whole signature file.

[FIGURE 4 OMITTED]

The disadvantage is that due to superimposing, internal nodes in the upper level tend to have more weight which ultimately decreases selectivity selectivity /se·lec·tiv·i·ty/ (se-lek-tiv´i-te) in pharmacology, the degree to which a dose of a drug produces the desired effect in relation to adverse effects.

selectivity

1.
. The S-Tree has been further improved in [13, 14]. In [13] a number of new split methods namely Linear split, Quadratic quadratic, mathematical expression of the second degree in one or more unknowns (see polynomial). The general quadratic in one unknown has the form ax2+bx+c, where a, b, and c are constants and x is the variable.  split, Cubic split and hierarchical clustering for S-tree is proposed to improve query response time. In [14] a new hybrid scheme combining linear hashing, S-tree and parametric weighted filter is used to evaluate subset-superset queries.

2.3.5 Multilevel Signature file

The structure is similar to S-Tree. However a signature at non-leaf node is formed by superimposed coding from all text blocks indexed by the sub-tree of which the signature is the root. Fig. 5 shows multilevel signature file for the set of signature values depicted in 4 Fig 4. Though this method improves selectivity in an n internal node, it requires more space. An improved e method for multilevel signature file is discussed in [25].

2.3.6 Signature Graph

The signature file is organized as a trie like structure [28,30]. However, the path visited in the graph to find a signature that matches a given query signature corresponds to a signature identifier which is not a continuous piece of bits, differentiating the signature graph from trie.

[FIGURE 5 OMITTED]

Though signatures are represented compactly, the search path length is not same for all queries. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
 the graph is not balanced. In worst case it degrades to a signature file. Fig. 6. shows the signature file and the corresponding signature graph.

[FIGURE 6 OMITTED]

2.3.7 Signature Tree

The signature tree is a binary tree A data structure in which each node contains one parent and no more than two children.



Binary Tree _title>

binary tree - (btree) A tree in which each node has at most two successors or child nodes.
 like structure with nodes representing the bit positions and left and right sub tree followed for binary values The following table shows the decimal value of a binary number when all bits in the binary number are 1. Just as the largest number in a group of decimal digits is all 9's, the largest number in a group of binary digits is all 1's.  0 and 1 respectively. Each signature is identified from the root by checking the bit positions dictated by the nodes. Nevertheless for a query signature the tree is searched top to bottom according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 the bit positions dictated by the nodes rather than the Is in query signature. Also, for a match with bit 1, searching follows the right sub tree and for 0 at a node both left and right sub trees are followed.

That is for a balanced signature tree more than one path is traversed. Fig. 7. indicates a signature file and its corresponding signature tree. The thick lines in Fig 7(b) indicates the signature identifier that corresponds to $3. In the following section we discuss a quite different method called SD-tree which considers only the positions of 1 s in a given signature and decluster them over a set of leaf nodes so that query response time is improved.

[FIGURE 7 OMITTED]

3 The Structure of SD-tree

In this section we describe the structure of SD tree. There are three types of nodes:

* Internal nodes

* Leaf nodes

* Signature nodes

The internal nodes and leaf nodes are somewhat similar to the internal nodes and leaf nodes of B+ trees respectively. The internal nodes form the upper tree and leaf nodes at last but one level. The signature nodes are at the bottom level of the SD-tree. We will now explain the structure of the nodes in detail. To make discussion simple, we assume the tree order as 3 for a signature file with 8 block signatures of length 12.

3.1 Structure of Internal node

An internal node of SD-tree is illustrated in Fig. 8. Like B+ tree internal node pointers and keys alternate each other. For a tree of order 3 the internal node has two keys K1 and K2 and three pointers P1, P2, P3. These pointers are tree pointers pointing to the nodes at the lower level.

While searching, the left tree pointer is followed for values less than or equal to the node value, else right pointer is followed for values greater than the node value as in B+ tree.

[FIGURE 8 OMITTED]

3.2 Structure of a leaf node Noun 1. leaf node - (botany) the small swelling that is the part of a plant stem from which one or more leaves emerge
node

phytology, botany - the branch of biology that studies plants
 

The leaf nodes appear in the last but one level of the SD tree. Like B+ tree all the key values appear in ascending ascending /as·cend·ing/ (ah-send´ing) having an upward course.

ascending

progressing to higher levels, usually used in reference to the nervous system.
 order of their values in the leaf nodes and are connected to promote sequential search A search for data that compares each item in a list or each record in a file, one after the other. Contrast with direct search and indexed search. . But unlike B+ tree in SD-tree each value is followed by a signature node instead of data pointer. This is depicted in Fig 2. Pointers P1 and P2 point to the corresponding signature nodes for K1, K2; P3 is the sequential pointer to next leaf node and P4 is the backward pointer from next leaf node.

[FIGURE 9 OMITTED]

3.3 Structure of signature node

The structure of a signature node is shown in Fig. 10. The signature node for Ki has [2.sup.i-1] binary combinations denoting the possible prefixes. When a signature Su with 1 in the ith position is to be inserted the intermediate prefix The beginning or to add to the beginning. To prefix a header onto a packet means to place the header characters in front of the packet. "To prefix" at the beginning is the opposite of "to append" characters at the end. See prepend.

1.
 formed (explained in section 4) is compared with the binary combinations in the signature node at Ki and u is inserted in the list.

3.4 Overall structure of SD-tree

Consider the partially filled SD-tree shown in Fig 11(a) for discussion. The tree has been constructed for signature length F = 12. It is obvious from Fig. 11 (a) tree of order 3 has height 2. Consider the signature file in Fig 11 (b). To insert signature S1 with first occurrence of 1 at position 2, access the leaf node with value 2, follow its signature node and write the signature value as 1 (for S1) with the prefix 0 (for bit 1). In the same way $2 is inserted in signature node 1 with no prefix and signature node 3 with prefix 10 (for bits 1 and 2). S4 will be inserted at signature node 1 with no prefix and signature node 2 with prefix 1.

While inserting a signature there are two possible options to move to next 1s position in the leaf node. One is via the sequential pointer between leaf nodes and the other is the top-down traversal of tree from root. To ensure optimal search path the threshold (Th) is fixed at h+1 (h being the tree height plus one more level for accessing signature nodes). As long as the number of sequential pointers traversed in leaf nodes is within the specified Th value, follow the sequential pointers.

This is calculated by finding the difference (d) between two consecutive 1s in a signature divided by the number of entries per leaf node. This is given by the condition

d / p-1 [less than or equal to] Th , where p is the order of the tree.

The division here is integer integer: see number; number theory  division. When the above condition is not true, a new tree access is initiated from root. Similarly to promote optimal query processing the query signature value Sq is taken in decimal form. The occurrence of the last one (in the least significant position) is found by performing D Mod 2 operation. The remaining binary prefix In computing, binary prefixes can be used to quantify large numbers where powers of two are more useful than powers of ten (such as computer memory sizes). Each successive prefix is multiplied by 1024 (210) rather than the 1000 (103  is formed in the process of decimal to-binary conversion of D. The procedures for tree maintenance are explained in the following section.

[FIGURE 11 OMITTED]

To minimize the number of false drops the database size is selected according to [4, 27, 30] as

F [ln.sub.2] = mD ... (1)

where F is the signature length, m the number of set bits and D the average data block size.

4 Tree Search and Updates

This section lists the algorithms for signature insert, delete and search operations on SD-tree. A global flag F is set to 0 indicating the search path from the root of the tree by default. In the procedure after the first 1's insertion, depending on the d value F may be set to 1.

4.1 Insertion

The algorithm for signature insertion is outlined in this section. The call to procedure New(node) returns a new link in the signature node.
Insert ([S.sub.u])

Input : The signature to insert [S.sub.u];
    1. Let [i.sub.1], [i.sub.2], ..... [i.sub.n] be the positions
       of 1 in [S.sub.u];
       F[left arrow] 0; Th = h+1 ; //for tree of order p
    2. Move [i.sub.2] to [i.sub.n] to queue Q. B = NULL;
    3. If([i.sub.1] = 1) then begin
    Access leaf node [i.sub.1]; //from root
    New(node);
    insert u;
    B = strcat ( B, ' 1 ');//to denote bit 1 position
     end
      else begin
        for k = 1 to [i.sub.1] - 1 do
        B = strcat ( B, '0');
        Access leaf node [i.sub.i]; //from root
        New(node); write (B);
        insert u ;
        B = strcat ( B, ' 1 ');
       end
        f = [i.sub.1];//store the current bit position
        F [right arrow] 1 ;//enable sequential search in leaf
                                                         nodes

    4. While Q not empty do
       begin
         read x from Q;
         if(x = f+1) then
         begin
           Access leaf node x; //via leaf node
                                       pointers
           If (not(B)) then New(node);//create
                                          node

       Write(B);
       Write u @ prefix B;
        end
       else begin
              d=x-f,
              If(d/(p-1)) > Th then F [right arrow] 0;
              for k = f+1 to x-1 do
                  B = strcat ( B, '0');
              Access leaf node x;
              If (not(B)) then New(node);
          Write(B);
           Write u @ prefix B
             end
           B = strcat (B, '1 ');
           f = x;
        end. //until queue is empty


4.2 Searching

The following algorithm outlines the steps to search for signatures matching a given query signature Sq. In the procedure F [left arrow (character) left arrow - The graphic which the 1963 version of ASCII had in place of the underscore character, ASCII 95. ] 0 always and the algorithm lands up directly in the signature node corresponding to last 1 from root. The match here is the exact match and the optimal signature length selected in equation (1) minimizes the false drops. Search([S.sub.q])

Input : The (query) signature to search as decimal D.

Output : The list of signatures matching the given signature.

1 Repeat i = D mod 2; D = D div 2; until i = 1

2 Let n be the corresponding bit position of i.

3 Let B = toBinary(D); //convert D to binary

4 Access leaf node n //from root node

5 Search for B.

6 If Found() then read and output the list of signatures.

4.3 Deletion

The algorithm to delete a signature from SD-tree is described below.

Delete ([S.sub.u])

Input : [S.sub.u], the signature to delete.

1. Let [i.sub.1], [i.sub.2],.....[i.sub.n] be the positions of 1 in [S.sub.u].

2. For each [i.sub.k] (l [less than or equal to] k less than or equal to] n) form prefix B as in Insert ([S.sub.u).

3. Access the leaf node and follow the signature node;

4. Access prefix B and search for u.

5. If present, delete it.

6. Repeat steps (2) through (5) for all [i.sub.k]s.

5 A sample validation model

This section discusses the evaluation of sample Object-Oriented queries on a hypothetical object base. Fig 12 shows the class diagrams in UML (Unified Modeling Language) An object-oriented analysis and design language from the Object Management Group (OMG). Many design methodologies for describing object-oriented systems were developed in the late 1980s.  notation [15,19]. The classes and their relationships are listed in Table 1.

The sample hashing outputs of attributes values are listed in Table 2.

The attributes' hash coded values are superimposed to produce object's signature. The set of all object signatures of a class correspond to the signature file. The signature files created for various attribute combinations of objects are listed in Table 3.

[TABLE 3 OMITTED]

6 Experimental results

To validate the proposed structure we implement SD-tree in Java and for every test run the tree is constructed statically before signature insertion. The parameters considered in the experiments' data sets are Signature length (F), Signature weight (m) and signature weight distribution (swd).

The experiments were carried out in a standalone system with Intel Pentium IV See Pentium 4.  processor. The main memory size is 512 MB and the hard disk capacity is 80 GB.

6.1 Signature tree Versus SD-tree

In this section the parameters which are generally considered in the analysis of indexing structures like time and space complexities are reported. We compare the results of Y. Chen's Signature tree [30] with that of the SD-tree. The observed results are listed in Table 5.

6.2 Time Complexity

Like in other signature applications we use the response time as the performance measure [23]. The time complexity of the insert algorithm initially is the sum of time taken to construct the B+ tree of order p and the time taken for inserting. Here B+ tree is constructed with values 1,2, ....., F where F is the length of the signature for a given dataset. Hence compared to the use of B+ tree as index structure for large datasets the value F is small which reduces the time taken for SD-tree construction considerably. In algorithm 4.1 the time complexity for insertion is bounded by O(nm) where n is the number of signatures in the file and m is the number of 1s in the given signature as against the O(nF) in the signature tree insertion[30], n being the number of signatures in the file and F is the full length of signature including O's and 1's. Since deletion follows similar steps as insertion the time complexity is same for both. Another desirable characteristic of SD-tree is that for higher F values, by varying p, the value of h, height of the tree can be kept small to promote faster search. It is observed that the

In Table 5,

n--Number of signatures in signature file

F--Length of signature

m--Number of set bits

p--Order of SD-tree

[lambda]--Number of path traversed in query searching

a--Average no. of signatures / signature node

k--log 2 n

f--Average no. of prefix values / signature node

Similarly search time is the sum of time taken to access the leaf node (T1) and signature node search time (Tsi). This is given by

Ts = T1 + Tsi

Here, T1 is constant for all leaf nodes for a dynamic balanced structure like SD--tree and Tsi is directly dependent on the i value and the number of signatures inserted in the signature node. In the worst case the search time is bounded by O(T1 + [2.sub.i-1]).

To analyze the query response time the signature weight distribution was fixed as 100% , 70%, 50% and 30% for signature lengths 10 and 30. The values are plotted in Fig. 13. through Fig. 16. The weight of the signature was biased in upper byte(U), lower byte(L) or uniformly distributed(M) and time values noted.

For 100% swd the insertion and search time is a constant of the signature weight bias. This is depicted in Fig. 13. In the same way the swd was fixed at 70, 50 and 30 and the observed values are plotted in Fig. 14, Fig. 15 and Fig. 16 respectively. The query response time between two consecutive queries is minimized by following the backward pointers in leaf nodes when the following condition is true. That is,

[d.sub.i,j]/p-1 [less than or equal to] Th

where [d.sub.i,j] is the difference in Sqi's last 1's position and Sqj's first 1's position.

All the graphs show that the time taken for signature insertion grows linearly with the values of swd, F and the weight bias. Insertion time The term insertion time is used to describe the length of time which is required to rearrange a subcritical mass of fissile material into a critical mass. This is one of the three main requirements in a nuclear weapon design to create a working fission atomic bomb.  increases in upper nodes due to the complexity of the circuit in creating prefixes.

SD-tree maintenance and space overhead

SD-tree maintenance is quite simple that the tree is not subject to extensive node split or merge. This is because the insertions and deletions do not affect the node values or the height of the tree. Operations are reflected only in the signature node. In the experiments the binary prefix pattern nodes are created dynamically. The space consumption for insertion of a signature number at a signature node depends on the binary prefix length (1), the pointer size (p) and the space for writing the signature number(n). The prefix is created anew a·new  
adv.
1. Once more; again.

2. In a new and different way, form, or manner.



[Middle English : a, of (from Old English of; see of) + new
 each time if it does not exist. Hence, the space complexity to insert with prefix for a signature of weight (w) is given by O[w((1+2p)+(n+p))]. Similarly the space complexity of a signature for which prefix already exists is bounded by O[w(n+p)].

Fig. 17 shows the space overhead of SD-tree. The tree is created statically. Signature nodes are created dynamically and are of fixed size. It is clear from the graph that the space consumption increases linearly with F and w. It is obvious that for all combinations of values the query search time for SD-tree is lesser than signature insertion time.

For swd of 70%, 50% and 30% the signature weight was biased in lower byte, upper byte or uniformly distributed and values noted. All the outputs clearly indicate that the time taken for signature insertion and query response is slightly higher in upper levels.

[FIGURE 13 OMITTED]

[FIGURE 14 OMITTED]

[FIGURE 15 OMITTED]

[FIGURE 16 OMITTED]

[FIGURE 17 OMITTED]

7 Conclusion and research directions

In this paper we presented a novel way to represent signatures in a B+ tree like structure called SD-tree and analyzed the performance for query response time.

By varying the signature length and distribution of Is in the signature the query response time was noted and results plotted. It is clear from the graphs that considerable search time is saved.

The space overhead in SD-tree may be higher due to the presence of binary prefixes in higher order signature nodes, but the flexibility provided by the SD-tree outweighs all besides simple maintenance and faster query retrieval time.

The work is proposed to extend in the following directions. The synthetic data sets are to be replaced with, run and verified on a real Object Oriented Data Base system. Another direction is when the signature weight is more than 50%, use 0s so that number of signature nodes accessed for insertion and search is optimal. Also the structure can be modified to support point and range queries in Object Oriented Data Base system.

Received: April 14, 2008

References

[1] Charles S. Roberts Charles S. Roberts is a wargame designer. He is renowned as "The Father of Board Wargaming", having created the first modern wargame (a boardgame) in 1952, and the first wargaming company in 1954. , (1979) Partial-Match Retrieval via the Method of Superimposed Codes, Proc. of IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. , Vol. 67, No. 12, pp 1624 1642.

[2] Chris Faloutsos, Stavros Christodoulakis, (1984) Signature Files." An Access Method Jor Documents and Its' Analytical Performance Evaluation Performance evaluation

The assessment of a manager's results, which involves, first, determining whether the money manager added value by outperforming the established benchmark (performance measurement) and, second, determining how the money manager achieved the calculated return
, ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  Trans on Office Information Systems, Vol. 2, No. 4, pp 267-288.

[3] Christos Faloutsos, (1985) Access Methods for Text", ACM Computing surveys, Vol. 17, No. 1, pp 49-74.

[4] Chris Faloutsos, (1985) Signature files: Design and performance comparison of some signature extraction methods" Proc. of ACM SIGMOD SIGMOD Special Interest Group On Management of Data , pp 63-82.

[5] Christos Faloutsos, Stavros Christodoulakis, (1985) Design of a Signature File Method that Accounts for Non-Uniform Occurrence and Query Frequencies, Proc. of VLDB (Very Large DataBase) An extremely large database. What constitutes a VLDB is debatable, but databases of 100GB or more are generally considered the starting point. , pp 165-170.

[6] Christos Faloutsos, Stavros Christodoulakis, (1987) Description and Performance Analysis of Signature File Methods for Office Filing, ACM Trans on Office Information Systems, Vol. 5, No. 3, pp 237-257.

[7] Christos Faloutsos, Stavros Christodoulakis, (1987) Optimal Signature Extraction and Information Loss, ACM Trans. On Database Systems, Vol. 12, No. 3, pp 395-428.

[8] Christos Faloutsos, Raphael Chan, (1988) Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison, Proc. of VLDB, pp 280-293.

[9] Chris Faloutsos, Raymond Lee, Catherine Plaisant, Ben Shneiderman Ben Shneiderman (born August 21, 1947) is an American computer scientist. He provided fundamental research in the field of human–computer interaction, such as with his eight rules of design. , (1990)

[10] Incorporating String Search in a Hypertext System Noun 1. hypertext system - a database management system that allows strings of text (`objects') to be processed as a complex network of nodes that are linked together in an arbitrary way : User Interface and Signature File Design Issues', Hypermedia hypermedia: see hypertext.


The use of hyperlinks, regular text, graphics, audio and video to provide an interactive, multimedia presentation. All the various elements are linked, enabling the user to move from one to another.
, Vol. 2, No. 3, pp 183-200.

[11] D. Dervos, Y. Manolopoulous, P. Linardis, (1998) Comparison of Signature File Methods with Superimposed Coding, J. Information Processing information processing: see data processing.
information processing

Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations.
, 6, pp 101-106.

[12] Douglas Comer, (1979) The Ubiquitos B-Tree, Computing Surveys, Vol. 11, No. 2, pp 121-137.

[13] Dik Lun Lee, Young Man Kim, Gaurav Patel, (1995) Efficient Signature File Methods" for Text Retrieval, IEEE TKDE TKDE Transactions on Knowledge and Data Engineering , Vol. 7, No. 3, pp 423 435.

[14] Eleni Tousidou, Alex Nanopoulos, Yannis Manolopoulos, (2000) Improved Methods for Signature-Tree Construction, The Computer Journal, Vol. 43, No. 4, pp 301-314.

[15] Eleni Yousidou, Panayiotis Bozanis, Yannis Manolopoulos, (2002) Signature-based structures for objects' with set-valued attributes, Information Systems, Vol. 27, No. 2, pp 93-121.

[16] Grady Booch Grady Booch (born February 271955) is a software designer, a software methodologist and a design pattern enthusiast. He is chief scientist of Rational Software (now a part of IBM) and a series editor for Benjamin/Cummings. , James Rambaugh, Ivar Jacobson Ivar Hjalmar Jacobson (born in Ystad, Sweden, on September 2, 1939) is a Swedish computer scientist.

He got his Master of Electrical Engineering at Chalmers Institute of Technology in Göteborg in 1962 and a Ph.D.
, (2003) The Unified Modeling Language User Guide, Pearson Education Pearson Education is an international publisher of textbooks and other educational material, such as multimedia learning tools. Pearson Education is part of Pearson PLC. It is headquartered in Upper Saddle River, New Jersey.  Pte Ltd PTE LTD Private Limited .

[17] Hwan-Seung Yong, Sukho Lee, Hyoung-Joo Kim, (1994) Applying Signatures for Forward Traversal Quet3, Processing in Object-Oriented Databases, Proc. 10th Intl. conf. Data Engg, pp 518-525.

[18] Justin Zobel, Alistair Moffat, Kotagiri Ramamohanarao, (1988) Inverted inverted

reverse in position, direction or order.


inverted L block
a pattern of local filtration anesthesia commonly used in laparotomy in the ox.
 Files" Versus Signature Files for Text Indexing, ACM Trans. On Database systems, Vol. 23, No. 4, pp 453-490.

[19] A. Kent, R. Sacks-Davis, K. Ramamohanarao, (1990) A Signature File Scheme Based on Multiple Organizations for Indexing Very Large Text Databases, J. Am. Soc. Information Science, 1990, Vol. 41, No. 7, pp 508-534.

[20] Martin Fowler
This article is about the information technology author/speaker. For the soap opera character, see Martin Fowler (EastEnders).


Martin Fowler
, Kendall Scott, (2003) UML Distilled, A brief guide to the standard Object Modeling Language, II edition, Pearson Education Pte Ltd.

[21] Paolo Ciaccia, Paolo tiberio, Pavel Zezula, (1996) Declustering of Key-Based Partitioned Signature Files, ACM Trans. On Database Systems, Vol. 21, No. 3, pp 295-338.

[22] Per-Ake Larson, (1984) A Method for Speeding up Text Retrieval, ACM SIGMIS, Database Winter, Vol. 15, No. 2, pp 19-23.

[23] Rudolf Bayer Rudolf Bayer has been Professor (emeritus) of Informatics at the Technical University of Munich since 1972. He is famous for inventing two data sorting structures: the B-tree with Edward M. McCreight, and later the UB-tree with Volker Markl. , Karl Unterauer, (1977) Prefix B-Trees, ACM Trans. On Database Systems, Vol. 2, No. 1, pp 11-26.

[24] 23. Seyit Kocberber, Fazli Can, (1999) Compressed Multi-Framed Signature Files: An Index Structure for Fast Information Retrieval information retrieval

Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links.
, Proc. ACM Symp. Applied Computing, pp 221-226.

[25] 24. Uwe Deppisch, (1986) S-Tree: A Dynamic Balanced Signature Index for Office Retrieval, Proc. ACM SIGIR SIGIR Special Interest Group on Information Retrieval (Association for Computing Machinery)
SIGIR Special Inspector General for Iraq Reconstruction
 conf, pp 77-87.

[26] Walter W. Chang, Hans J. Schek, (1989) A Signature Access Method for the Starburst Database System, Proc. of VLDB, 145-153.

[27] Wang-chien Lee, Dik L. Lee, (1992) Signature File Methods for Indexing Object-Oriented Database Systems, Proc. 2nd Intl. Comp. Sc. Conf, pp 616 622.

[28] Yangjun Chen, (2002) Signature Files and Signature Trees, Information Processing Letters, 82, pp 213 -221.

[29] Yangjun Chen, Yibin Chen, (2004) Signature File Hierarchies and Signature Graphs: a New Index Method for Object-Oriented Databases, Proc. of ACM Symp. on Applied Computing, pp 724-728.

[30] Yangjun Chen, (2005) On the Signature Trees and Balanced Signature Trees, Proc. of ICDE ICDE International Conference on Data Engineering
ICDE International Council for Open and Distance Education
ICDE International Council for Distance Education
, pp 742-753.

[31] Yangjun Chen, Yibin Chen, (2006) On the Signature Tree Construction and Analysis, IEEE TKDE, Vol. 18, No. 9, pp 1207-1224.

[32] Yoshiharu Ishiwaka, Hiroyuki Kitagawa, Nobuo Ohbo, (1993) Evaluation of Signature Files as Set Access Facilities in OODBs, Proc. of ACM SIGMOD, pp 247-256.

I. Elizabeth Shanthi

Dept of Computer Science

Avinashilingam University History
Avinashilingam University For Women was originally a part of University of Madras and in June 1987 this institution got its separate identity. It is now the largest institution in the country for imparting Home Science Education at all levels.
 for Women, Coimbatore, India.

E-mail: shanthianto@yahoo.com

R. Nadarajan

Dept of Mathematics and Computer Applications

PSG College of Technology PSG College of Technology, also called PSG Tech, is an certified institute located in Coimbatore in the state of Tamil Nadu, India. It was founded in 1951 by the Managing Trustee Sri G.R. Govindarajulu and the founder Principal Dr. G.R. Damodaran. Under the leadership of Dr. , Coimbatore, India.

E-mail: nadarajan_psg PSG,
n polysomnograph; polygraph performed during sleep. Physiological variables such as pulse, blood pressure, and respiration are monitored and charted.
@yahoo.co.in
Table 1: Class relationships

S. No   Class 1      Class 2      Relationship

   1.   University   Dept         Composition
   2.   University   Student      Aggregation
   3.   Student      Programme    Association
   4.   Dept         Instructor   Association
   5.   Student      Male         Generalization
   6.   Student      Female       Generalization
   7.   Programme    Subject      Association

Table 2: Sample attributes

Dept-name

Mathematics--1010 0000
Computerscience--0100 1000
Physics--0001 0010

Pgm-name

M.E--0001 0100
M.Sc (S.E)--1000 1000
M.Sc (Mat)--0100 1000
M.sc (Phy)--0011 0000

Inst-name

John--1000 0010
Adams--1000 1000
James--0000 1001
Janes--0110 0000

Stud-name

David--0100 0010
Elena--1001 0000
Maria--0100 0001
Peter--0010 0001
Grace--0000 1100
Antony--0000 0011

Male
Sex--0010000

Female
Sex-00001001

Table 4: Sample queries

1. List of students       Pgm-name =           Student      1000 1000
doing a given programme   M.Se(S.E)

2. Instructors handling   M.Sc(S.E) in         Instructor   0100 0001
a given subject           subjects

3. Dept(s) offering a     1. Comp. applns in   Programme    0010 1000
given subject             subjects

                          2. Pgm-name =        Dept
                          00010100,00110000

3. All instructors of     Dept-name =          Instructor   0100 1000
a given dept              Comp.sc

4. Female students        Pgm-name =           Female       1000 1000
attending a given         M.Se(S.E)
programme

1. List of students       Pgm-name =           1100 1010 (David)
doing a given programme   M.Se(S.E)            1001 1000 (Elena)
                                               1100 1001 (Maria)

2. Instructors handling   M.Sc(S.E) in         l110 1011 (John)
a given subject           subjects             0100 1001 (James)

3. Dept(s) offering a     1. Comp. applns in   1011 1101 (M.E)
given subject             subjects             O111 1000 (M.ScPhy)

                          2. Pgm-name =        1101 1100 (Comp.sc)
                          00010100,00110000    0011 0010
                                               (Phy)

3. All instructors of     Dept-name =          1110 1011 (John)
a given dept              Comp.sc              0100 1001 (James)

4. Female students        Pgm-name =           1001 1001 (Elena)
attending a given         M.Se(S.E)            1100 1001 (Maria)
programme

Figure 10: A typical signature node entry.

B1     Signatures having prefix B1
B2     Signatures having prefix B2
...    ...
Bn     Signatures having prefix Bn

Table 5: Signature tree Vs SD-tree

Parameter     Signature tree          SD-tree         Inference

Time          O(nF)                   O(nm)           m < F;
complexity                                            Faster
                                                      insertion

Tree height   O([log.sub.2] n)        O([log.sub.p]   P>2;
                                      (F/(p-1))       Shorter tree

Search cost   O([lambda].             O([log.sub.p]   F<n;
              [log.sub.2] n)          F+a)            Cost < Sig.
                                                      tree
Space         [nlog.sub.2] F + 2      O(F(a+f))
complexity    [summation] [2.sup.i]                   < Sig. tree
              (i+1)                                   when k>F
              i=0 to k

Figure 12: Sample Object schema.

Class : Subject
Sub-name

1. Software engg--0100 0001
2. Comp. Apples-- 0010 1000
3. Comp. Engg--0100 0100
4. Applied Maths--1000 0001
5. Calculus--0010 0100
6. Nuclear physics--0101 0000

Class : Male

Stud-name   Pgm-name     Sex      Object signature

1. David    M.Sc (S.E)   Male     1101 1010
2. Peter    M.E          Male     1011 0101
3. Antony   M.E          Male     1001 0111

Class : Female

Stud-name   Pgm-name     Sex      Obj signature

l. Elena    M.Sc (S.E)   Female   1001 1001
2. Maria    M.Sc (S.E)   Female   1100 1001
3. Grace    M.Sc (Phy)   Female   0011 1101

Class : Dept

Dept-name             Programs         Obj signature

1. Computer science   M.E,M.Sc (S.E)   1101 1100
2. Mathematics        M.Sc (Mat)       1110 1000
3. Physics            M.Sc(Phy)        0011 0010

Class : Programme

Pgm-name        Subjects          Obj. signature

1. M.E          Comp.apptns,      1011 1101
                Applied Maths

2. M.Sc (S.E)   Software engg,    1100 1101
                Comp. engg

3. M. Sc(Phy)   Comp. apples,     0111 1000
                Nuclear physics

Class : Instructor

Inst-
name       Dept-name     Subjects      Obj signature

l. John    Comp.sc       Software      1110 1011
                         engg,
                         Comp.
                         applns

2. Adams   Mathematics   Applied       1010 1101
                         Mathematic
                         s, calculus

3. James   Comp.sc       Software      0100 1001
                         engg

4. Janes   Physics       Nuclear       0111 0010
                         physics

Class : Student

Stud-name   Pgm-name    Obj signature

1. David    M.Sc(S.E)   1100 1010
2. Peter    M.E         0011 0101
3. Antony   M.E         0001 0111
4. Elena    M.Sc(S.E)   1001 1000
5. Maria    M.Sc(S.E)   1100 1001
6. Grace    M.Sc(Phy)   0011 1100
COPYRIGHT 2009 Slovenian Society Informatika
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2009 Gale, Cengage Learning. All rights reserved.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:signature declustering
Author:Shanthi, I. Elizabeth; Nadarajan, R.
Publication:Informatica
Article Type:Report
Geographic Code:9INDI
Date:May 1, 2009
Words:5924
Previous Article:Improving part-of-speech tagging accuracy for Croatian by morphological analysis.
Next Article:Improving design pattern adoption with an ontology-based repository.
Topics:



Related Articles
TogetherSoft Corporation and Valtech Lead Nationwide Seminar Series on Software Development Strategies; Thought Leaders Peter Coad and Craig Larman...
Impact of key locality on database performance.
Ipedo Releases New Version of Enterprise Information Integration (EII) Platform.
C++ Design Patterns and Derivatives Pricing. (CD-ROM included).
Object-Oriented Oracle.
UML for developing knowledge management systems.
Data structures and algorithms for game developers. (CD-ROM included).
TwigINLAB: A decomposition-matching-merging approach to improving XML query processing.
Pro VB 2008 and the .NET 3.5 platform.
An XML-based infrastructure to enhance collaborative geographic visual analytics.

Terms of use | Copyright © 2012 Farlex, Inc. | Feedback | For webmasters | Submit articles