Printer Friendly

Locality preserving scheme of text databases representative in distributed information retrieval systems.

I. Introduction

Finding relevant information to a user request with higher probability in any global information retrieval system such as P2PIR [1, 2, 3, 4, 5], requires the databases representatives (vectors of index terms) which generated by different nodes, should be mapped to a same node, or to a set of nodes that have closer identifier(s) in a P2P identifier space. To achieve this goal, the mapping method should preserve the locality of the index terms vectors key space; i.e. relevant index terms' vectors should have closer keys in the target identifier space.

Generally, mapping is the criteria used to determine images (of elements in a given space or the domain) into another target space (or the range). The type of mapping is determined by the relation between two images of two related elements; i.e. if two related elements are mapped into two related images, then the mapping is locality preserving, otherwise it is a random mapping.

In this paper we proposed a locality preserving mapping scheme that maps similar vectors of terms to closer keys (or points in the identifier space). In order to achieve this objective, a variant of the Locality Sensitive Hash (LSH) functions will be used, such that groups of hash functions generated as approximate minwise independent permutations.

In our proposed scheme, a representative is converted (encoded) into a binary code of fixed length. This binary code is slotted into equal-length chops. The resulted chops form an ordered set (referred to as subsequence) on which the "approximate independent min-wise permutations" is applied. The result is a set of keys each corresponds to a group of hash functions. Empirical examinations show that the proposed LSH scheme gives closer keys (smaller Hamming distance) for vectors of terms that have similarity greater than some threshold.

Several approaches of locality preserving mapping that are used by different P2PIR proposals are introduced, for example:

Spectral Locality Preserving Mapping [6, 7], addresses the problem of mapping closer multi-dimensional data representative vectors into closer images in a one dimensional space (mainly the field of real numbers). Points in the multidimensional space are viewed as vertices of a graph (G), and there is an edge between two nodes iff (if and only if) there is a direct connection between them in the fully connected network of nodes.

The process of this mapping begins by constructing three matrices: (1) A diagonal matrix (D) with each entry on the diagonal represents the degree of the vertex, (2) the adjacency matrix (A) where A[(G).sub.ij] = 1 iff there is an edge between vertices (i) and (j), and (3) the Laplacian matrix (L), where L(G)=D(G)-A(G).D(G). Then find the second smallest eigenvalue and its corresponding eigenvector of the matrix (L). Finally, the eigenvector entries are ascended sorted to preserve their orders; the original index of the entries (after sorting) is the locality order of points in the target space.

This type of mapping involves extensive matrix computations, and all of the points (and the distance from all their neighbors) must be known in advance. This information is not always available in P2PIR systems.

Space Filling Curves (SFC) [8], or Fractal Mapping (Peano, Gray, and Hilbert Space Filling Curves), is a continuous mapping from d-dimensional to 1-dimensional space. The d-dimensional cube is first partitioned into [n.sup.d] equal sub-cubes (or fragments), where n is the partition length. An approximation to a space-filling curve is obtained by joining the centers of these sub-cubes by line segments such that each cell is joined to two adjacent cells. A problem arises because once a fractal starts visiting a fragment no other fragment could be visited until the current session is completely exhausted. Other problem is the boundary effect problem; i.e. points that are far from borders are favored [7]. This means that two d-dimensional points, each of which belongs to a different fragment (and each is close to the border), does not have the chance to map to closer images in the 1-dimensional space.

Locality Sensitive Hashing (LSH) [9], a family of hash functions (H) is locality preserving if a function (h) is randomly selected from H, then the probability of two objects to having equal images under (h), equals the similarity between those two objects. LSH has become popular in many fields, and it has been used for mapping in P2PIR for example in [1, 2, 3, 4, 5].

Our proposed model uses a locality preserve mapping to map similar vectors to closer keys (or points in the identifier space). More specifically, to map vectors of index terms (databases representatives) to the desired identifier space, that is owned by different nodes in the overlay network. In order to achieve such objective, we will use a variant of the Locality Sensitive Hash (LSH) functions and the approximate Minwise independent permutations.

The rest of the paper is structured as follows: Section 2 presents a background about LSH and Minwise independent permutations. We present the proposed scheme for Locality Preserving Mapping in Section 3. Section 4 introduces the implementation and evaluation over real-world data sets that showed efficiency. Finally, Section 5 concludes the paper.

2. Background

In this section, more details about LSH (section 2.1) and Minwise independent permutations (section 2.2) will be presented.

2.1 Locality Sensitive Hashing (LSH)

Locality sensitive hashing functions [1-5, 9, 10] (LSH) were defined as a locality preserving hash function family used in order to maximize the probability ([p.sub.1]) of collision of two closer points (laying within the same circle (B) of radius [r.sub.1]), and minimize the probability of collision to be less than a smaller value ([p.sub.2]) when the two points are not within any circle of radius [r.sub.2].

In our context, the term 'closer' means 'similar', so we need similar points to 'collide' with higher probability. A similarity measure (D) could be used to determine the closeness of points. D could be the cosine similarity, or the distance function. The circle (B) of similarity measure (D) is defined as: B(q, r) = {p : D(p, q) [greater than or equal to] r], or all points that have similarity [greater than or equal to] r with a point q. The existence of such a locality sensitive hash function family, as defined above, admitted by a similarity measure whose distance function: D = 1-sim(A,B), satisfies the triangle inequality: D(A, B) + D(B, C) [greater than or equal to] D(A, C), for the points A, B, and C [3].

The next definition of LSH is sufficient to meet the mapping objective needed for this study. The implication of the definition 2.1 is that it uses similarity as an indicator of locality preservation; this definition has been introduced by Charikar [3] and adopted by Gupta et al [5].

Definition 2.1: Locality Sensitive Hashing (LSH)

If A, B are two sets of values from domain D then a family of hash functions H is said to be locality preserving if for all h e H we have:

[Pr.sub.h[member of]H](h(A) = h(B)) = sim(A, B) (2.1)

where sim(A, B) is a measure of similarity between the sets A and B.

2.2 Minwise Independent Permutations

Hash function in 2.1 above could be replaced by min-wise independent permutations [4].

Definition 2.2: Min-wise independent permutations

Given an integer n, let the set of all possible permutations of [n] is S where [n] is the set of integers: [0,1, ..., n-1], a family of permutations F [subset or equal to] [S.sub.n], and for a subset X [subset or equal to] [n], if a permutation [pi] [member of] F is selected uniformly at random from F, and x [member of] X, and: Pr(min{[pi](X)} = [pi](x))=1l[absolute value of X], then the family of permutations F is said to be min-wise independent. The definition suggests that all the elements of X have the same chance to be the minimum under the permutation [pi][4].

In [5], the min-wise independent permutations of two range sets (Q, and R) are equal with probability equal to the Jaccard set similarity [4] between Q and R:

[Pr.sub.[pi]][[h.sub.[pi]](Q) = [h.sub.[pi]](R)] = [absolute value of Q [intersection] R]/[absolute value of Q [union] R] (2.2)

So, min-wise independent permutations hashing scheme is a particular instance of a locality sensitive hashing [3]. We can observe (from equation 2.2) that as the similarity between objects increased, the chance for them to have closer keys became more probable.

For a subset of the universe of interest; say S [subset or equal to] U, a number of hash functions (or permutations) are each applied on the elements of S, and each time the minimum among all images is recorded. In the case of choosing k functions; i.e. [h'.sub.1], [h'.sub.2], ..., [h.sub.k] from the family F uniformly at random, the set S is then represented as a vector <[h.sub.1](S), [h.sub.2](S), ..., [h.sub.k](S)>, two subsets [S.sub.1], [S.sub.2] [subset or equal to] U are said to be similar if they share more matched elements among their vectors [3]. It was found that the (exact) min-wise independent permutations are inefficient (time consuming), and so for the purpose of this research we will use an approximate min-wise independent permutations, such that the probability for any element to be the minimum permutation is approximated by a fraction:

[absolute value of Pr(min{[pi](X)} = [pi](x) - 1/[absolute value of X]] [less than or equal to] e/[absolute value of X] (2.3)

3. The Proposed Locality Preserving Mapping Scheme

In a P2PIR system, and in order to find relevant information to a user request with higher probability, similar database representatives that are generated by different nodes, should be mapped to the same node, or to a set of nodes having close identifiers in the P2P identifier space.

Formally a function f: C [right arrow] K, maps the set of database representatives C to the set of keys K, f is locality preserving if, for any x, y, z, e [member of] C, and a distance function D defined on both C and K, where D(x, y) < D(z, e), we must have D(f(x), fy)) < D(f(z), f(e)), if K is the set of real numbers then we must have [absolute value of f(x) - f(y)] < [absolute value of f(z) - f(e)].

Locality sensitive hashing functions (LSH) introduced by Indyk and Motwani [9] were used first to solve the nearest neighbour approximation, and improved in [4, 11], it became a popular tool [2, 5, 12, 13] to map nearest points in a space to closer images in the target space with high probability. They will be used in the current proposed model as a locality preserving mapping function, where the family of approximate min-wise independent permutations [14] are used for generating random hashing schemes.

3.1 Mapping Database Representative to the Key Space

Before using LSH, the database representatives must be preprocessed in such a way to convert its terms into a stream of bits (or binary encoded). In the next section (3.2), we will present a selected method for binary encoding. The next step is to convert the generated binary code into a sequence of integers; each belongs to the targeted key space (for example the space of 32 bits integers). A general example showing the steps of the conversion process is shown in Figure 3.1.

[FIGURE 3.1 OMITTED]

The mapping process receives a list of terms' texts, and the output is an integer between 0 and [2.sup.n], where n is the number of bits (n = 32 in the example above). Figure 3.2 summarizes key generation steps as a pseudo code.
Figure 3.2. Key Generation Procedure

Generate_LSH_Key(List_Of_Terms, Number_Of_Binary_Digits)
  Let BCode is the binary code that represents the List_Of_Terms
  For each term T of the List_Of_Terms do
    Let D = UH(T) // the digest of T after applying the universal
      hash function UH
    Let BCD = a string of binary digits that equivalent to D
    Append BCD to the right of BCode
  Next term
    Divide BCode into slots each has a length equals the Number_Of_
Binary_Digits
    Create subsequence by using shingling to overlap consecutive
      slots
  Let Key be the binary code of the final output
  For each subsequent of the created subsequence do
    For each hash group of LSH scheme do
        Apply each function in the current group as Min-wise
          permutation scheme
        Let Min is the minimum among all results of all functions
        Select a number of binary digits (t) from Min
        Append the selected binary digits to Key
     Next group
   Next slot
   Return Key
End.


In this study, we will adopt the scheme used by Gopta et al. [5], but we will adapt this scheme to map a set of index terms instead of a list of numbers. In this scheme, the terms of a representative is converted into a binary code of length (L) bits. Then this binary code is slotted into equal chops each of [2.sup.x] bits length. These chops are considered as the elements of the ordered set (subsequence) on which approximate independent min-wise permutations will be applied. More details can be seen in [15].

For each subsequent, a key of the same length ([2.sup.x]) is randomly generated where half of its bits are ones distributed randomly among all of its digits. Move each bit (of the subsequent within the representative binary code) which corresponds to a one bit in the random key to the next left most bit of the subsequent binary code, and move the bit (corresponds to a zero bit) to the left most bit of the right half of the subsequent binary code (in order). If the process ends her, then, the resulted permutation is an approximate min-wise independent permutation, it is more time efficient than applying the exact min-wise independent permutations [5].

[FIGURE 3.3 OMITTED]

Proceeding into the next step of the process, where more random binary keys (each of [2.sup.x] divided by [2.sup.(stepnumber)]) are applied on the result of the previous step, the total number of steps is equal to x. Figure 3.3 presents an example of permutations applied on a subsequent binary code of 8 bits.

3.2 Binary Encoding

Encoding is to convert the set of terms forming the database representative into a sequence of binary digits. Each index term is used as an input to a universal hash function. The generated digests of all of the terms form a series of integers. Terms in this context are used as a countable sequence of tokens used to view a database representative. The tokens are used to generate ordered sets (subsequences) each of width w tokens. The overlapped fraction of two consecutive subsequents is referred to as shingle width, and the process is referred to as shingling [16].

Our encoding method divides the range of the digest space into [2.sup.y] slots, each slot is a binary number of y digits which represents the code of a term. Combine the binary codes of all terms to generate the overall binary code of the database representative. Figure 3.4 presents an example of generating a subsequence.

For more explanation, we will present a simple example: let the digest length generated by a universal hash function (h) equals 6 bits, so the digest values range from 0 to 63. Now if we have the vector of terms ('computer', 'programming', 'teacher'), and we have the following digests for each term: h('computer') = 12, h('programming') = 53, and h('teacher') = 17, if we divided the space of the digest values into 4 slots, then the four slots are: (0 ... 15), (16 ... 31), (32 ... 47), (48 ... 63), so each of the slots has a binary index as: 00, 01, 10, 11 respectively. Now to encode the vector of terms above: 'computer' is in slot number '00', 'programming' in slot number '11', and teacher in slot number '01', and so the code representing the vector is: '001101'.

3.3 Applying the Min-wise Independent Permutations

A database representative is given a binary code which is generated as a subsequence, each of its elements (an element is referred to as a subsequent) is a sequence of [2.sup.x] bits. The first subsequent of each subsequence begins at bit number zero, and the location of the first bit of the nth subsequent of a subsequence is given by:

[y.sub.n] = [2.sup.x]/w x n n = 0, 1, ..., 1 (3.1)

where x = [log.sub.2](subsequent size -Number of bits of each subsequent in the subsequence-), n is the subsequent index, w is an even natural number which represents the shingle width, w represents the ratio of bits in a subsequent that are shared by its successor.

For example if w = 2, then a subsequent shares 50% of its bits with its successor, and l = (Binary code length)/[2.sup.x]. Figure 3.4 presents an example of a subsequence formation out from a binary code of a database representative.

The next step is to generate a family of hash functions, as a family of approximate min-wise independent permutations. A function [h.sub.i] belongs to a hash function group g. is applied to all elements of a subsequence set: S = {[s.sub.1], [s.sub.2], ..., [s.sub.n]}, and the final output of the function is the minimum value among all values of the hash function application over all the subsequence, that is:

[h.sub.i](S) = min{[h.sub.i]([s.sub.1]), [h.sub.i]([s.sub.2]), ..., [h.sub.i]([s.sub.n])} (3.2)

The output of applying all of the t functions in each group is the key representing the database representative; i.e. each database representative is represented by k different keys generated by the k hash function groups. For example two similar database representatives [E.sub.1], [E.sub.2] having the two keys' sets [key.sub.1] = {[g.sub.1]([E.sub.1]), ..., [g.sub.k]([E.sub.1])}, and [key.sub.2] = {[g.sub.1]([E.sub.2]), ..., [g.sub.k]([E.sub.2])} respectively, should share more elements among their corresponding sets, Charikar [3]. When using approximate (instead of the exact) min-wise independent permutations it could be found that key1 and key2 share no elements, instead having closer element values with some error, see equation 2.3.

3.4 Building a key

In order to generate a key from the results of all t permutations in a group, each hash function can have only one bit contribution to the overall key, and the results of the t hash functions are concatenated to produce a key of t bits. For example, in [2, 3, 5] a key is generated by concatenating "1", to the overall key, if the dot product between the data vector and a random vector is positive, otherwise "0" is concatenated.

In this study we propose several schemes (referred to as Bit selection methods) to generate a key out from the output of t permutations. All of the proposed schemes depend on choosing a number of bits. Then the selected chunks from all of the t permutations are concatenated to construct the final key. The number of bits concatenated by each permutation is given by equation 3.3.

[tau] = L/t, the final key = [[tau].sub.1][[tau].sub.1] ... [[tau].sub.t] (3.3)

L is the key length (in bits), and t is the number of permutations in each group. We will examine the following bit selection methods, and then the best method is used for the key generation, a bit selection method could be one of the following options:

1. [tau] Bits are selected randomly among L bits from all the (t) permutations' results.

2. A chunk of [tau] bits selected randomly among the L/t chunks of each permutation result.

3. A fixed chunk of [tau] bits selected from the result of each permutation, it could be: the first, the second, the third, ..., chunk from left to right.

4. Select [[tau].sub.i] chunk where i = permutation index within the group: is {0,1, ..., t} , select (for example) the first chunk of bits from the first permutation (named as [[tau].sub.1]), chunk [[tau].sub.2] from the result of the second permutation, ... etc.

4. Evaluation, Experiments and Results

To implement and evaluate the proposed key generation and LSH scheme, we used a family of permutations (H) that consists of 1000 randomly generated keys, such that for each experiment, k groups (each of t permutations) are randomly selected from H. The evaluation process should be applied on two similar collections of database representatives (D1 and D2), where a database representative in the first collection is known to be similar to its adjacent database representative in the other collection. If the resulted keys of each similar pair of database representatives are closer to each other, then the locality is achieved.

In order to construct two similar collections of database representatives, documents that form the representative of a cluster, is divided into two equal subsets: the first sample (D1) and the second sample (D2). The first document in a cluster representative is selected as a member of both D1 and D2, and the remaining documents are distributed equally; one to D1 and the next to D2. Then weights of terms of each representative are descend sorted and top-n% vectors [17] of terms representing each of D1 and D2 are recalculated, Figure 4.1 gives an example of this splitting process.

Bit selection methods were applied on both (D1 and D2) as described in section 3.4. Three types of experiments were implemented; each experiment is repeated using different values for the following parameters:

* First parameter: the selection method, each time experiments are applied on binary codes that generated using one of the following three slotting criteria:

1. No slotting of the digest space; i.e. each subsequent (32 bits) represents one term.

2. Digest space is divided into 16 slots; each index term is represented by 4 bits, and so each subsequent can represent (32/4 = 8) terms.

3. 256 slots; each term is represented by 8 bits, so each subsequent represents 4 terms.

Second parameter: number of permutation groups (k), or hash function groups.

* Third parameter: number of permutations in each group (t), or number of hash functions in a group.

* Fourth parameter: shingle width (w).

Hamming distance between two keys is used to evaluate the closeness between the generated keys. It is calculated for each pair of keys generated for two similar database representatives. Finally, the average Hamming distance is calculated.

--The first experiment: examines the effect of number of hash function groups, and the number of hash functions in each group, and for each slotting criteria, such that a fixed number of hash function groups are used, versus variable number of hash functions. The average Hamming distance is plotted in Figure 4.2. We can notice the locality preserving is enhanced when using more hash groups, as shown in Figure 4.2(a), where the locality preserving degrades as the number of hash functions increased. This was the case for the entire selected slotting criteria, as shown in Figure 4.2(b).

The same observation (regarding the effect of hash group number and hash functions within a group) could be concluded by formula 4.1, which was used in [5], to calculate the probability ([beta]) of two similar objects (equivalence classes) to have a same keys.

[beta] = 1 - [(1 - [p.sup.t]).sup.k] (4.1)

Where p is the similarity between two representatives, k: the number of groups, and t: is the number of hash functions in a group. p [member of](0,1) implies that [p.sup.t] [less than or equal to] p for t > 1, and (1 - [p.sup.t]) [member of] (0, 1), consequently, as k increases [(1 - [p.sup.t]).sup.k] decreases and [beta] increased. But using larger number of hash function groups needs more calculations that increases the time cost. To approximate the number of groups that maximize p, it could be expanded by using the Taylor expansion of the binomial equation:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4.2)

Taking the first three terms of the series of equation 4.2, and differentiate the approximated p with respect to k, so [beta] will have maxima when:

K = [(1/[p.sup.t]) + 1/2] (4.3)

According to equation 4.3, two representatives d1i [member of] D1, and d2i [member of] D2 with p = 0.5, could have same (or closer) keys when k = l7, and t = 4 hash functions for each group.

--The second experiment. examines the bit selection method where the four methods are implemented using three parameters. (no slotting, 16, and 256 slots), the results presented in Figure 4.3 show that the 'random bit selection' is the worst, and the fixed chunk selection method produces the minimum average Hamming distance for all parameters. But this result is fallacious, since the fixed chunk selection method generates closer keys for both similar and dissimilar database representatives, for example, in the case where the selected chunk is the left most chunk, the probability of having zero values for the most significant bits of a key becomes higher, since a key is the minimum of a hash function value among all generated integers. Consequently the fixed chunk bit-selection method is not significant for mapping similar database representatives to closer keys in the target space. Empirically, the best method is the random selected chunk.

--The third experiment. examines the effect of the shingle (w) on locality preservation. Using greater value of w is shown to enhance the locality of two similar samples, since the average hamming distance between the keys, which generated using more shingles, is smaller; see Figure 4.4. The enhancement of keys' locality preservation is justified as successive subsequent overlapping increases as the number of shingles increases.

[FIGURE 4.2 OMITTED]

[FIGURE 4.3 OMITTED]

[FIGURE 4.4 OMITTED]

5. Conclusion

In this paper a locality preserving mapping scheme is proposed. In such scheme a database representative vector is encoded into a binary number, split as a subsequence, and a set of hash function groups are applied on each subsequent. The key is generated by selecting a chunk of bits from the result of each function in a group. The proposed approach depends on using Locality Sensitive Hash functions (LSH), and approximate minwise independent permutations to achieve such task. Several methods (of bit selection) are tested, and empirically, the best method is the random chunk selection. Each database representative could have a number of keys equal to the number of hash function groups. The locality preservation could be enhanced using a larger number of hash groups, and wider shingle width, but this increases the time cost.

References

[1] Bawa, M., Condie, T., Ganesan, P. (2005). LSH forest. self tuning indexes for similarity search, in Proceedings of the 14th international conference on World Wide Web (WWW '05), New York, NY, USA, p. 651-660.

[2] Bhattacharya, I., Kashyap, S. R., Parthasarathy, S. (2005). Similarity Searching in Peer-to-Peer Databases, In. The 25th IEEE International Conference on Distributed Computing Systems (ICDCS05), Columbus, OH, p. 329-338.

[3] Charikar, M. S.(2002). Similarity estimation techniques from rounding algorithms, In. the 34th ACM Symposium on Theory of Computing, p. 380-388.

[4] Datar, M., Immorlica, N., Indyk, P., Mirrokni, V. S. (2004). Locality-Sensitive Hashing Scheme Based on p-Stable Distributions, In. The twentieth annual symposium on Computational geometry (SCG'04), Brooklyn, New York, USA. 253-262.

[5] Gupta, A., Agrawal, D., Abbadi, A. E. (2005). Approximate Range Selection Queries in Peer-to-Peer Systems, in the CIDR Conference, p. 254-273.

[6] Cai, D., He, X., Han, J. (2005). Document Clustering Using Locality Preserving Indexing, IEEE Transactions on Knowledge and Data Engineering, 17 (12) December 2005, p. 1624-1637.

[7] Mokbel, M. F., Aref, W. G., and Grama, A. (2003). Spectral LPM. An Optimal Locality-Preserving Mapping using the Spectral (not Fractal) Order, In. the 19th International Conference on Data Engineering (ICDE'03) p. 699-701.

[8] Sagan, H. (1994). Space-Filling Curves, Berlin, Springer-Verlag.

[9] Indyk, P., Motwani, R. (1998). Approximate Nearest Neighbors. towards Removing the Curse of Dimensionality, in the Symp. Theory of Computing, p. 604-613.

[10] Indyk, P. (2003). Nearest neighbors in High-Dimensional Spaces, In. CRC Handbook of Discrete and Computational Geometry. CRC.

[11] Motwani, R., Naor, A., Panigrahy, R. (2006). Lower bounds on Locality Sensitive Hashing, In: the ACM twenty-second annual symposium on Computational geometry SCG'06, Sedona, Arizona, USA., p. 154-157.

[12] Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.

(2006). Efficient Filtering with Sketches in the Ferret Toolkit, In: the 8th ACM international workshop on Multimedia information retrieval (MIR'06), Santa Barbara, California, USA, p. 279-288.

[13] Qamra, A., Meng, Y., Chang, E. Y. (2005). Enhanced Perceptual Distance Functions and Indexing for Image Replica Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (3) p. 379-391,

[14] Broder, A. Z., Charikar, M., Frieze, A. M., Mitzenmacher, M. (2000). Min-Wise Independent Permutations, Journal of Computer and System Sciences, 60. 630-699.

[15] Hassan, M., Hasan, Y. (2010). Locality Preserving Scheme of Text Databases Representative in Distributed Information Retrieval Systems, In: The second International Conference of Networked Digital Technologies NDT, Prague, Czech republic, July 2010, p162.

[16] Broder, A. Z. (1997). On the resemblance and containment of documents, In: Proceedings of Compression and Complexity of Sequences, p. 21.

[17] Hassan, M., Hasan, Y. (2011). Efficient Approach for Building Hierarchical Cluster Representative, International Journal of Computer Science and Network Security 11 (1) p. 178-184.

Mohammad Hassan, Yaser A. Al-Lahham

Zarqa University

Jordan

mohdzita@zu.edu.jo

Authors Biographies

Mohammad Hassan received the B.S from Yarmouk University in Jordan in 1987, the M.S. degree from Univ.

Of Jordan in 1996, and the PhD in Computer Information Systems from Bradford University (UK) in 2003. He is working as an assistant professor in the department of computer science at Zarqa University in Jordan. His research interest includes information retrieval systems and database systems.

Yaser A. Al-Lahham received the B.S degree from University of Jordan in 1985, the M.S. degree from Arab Academy (Jordan) in 2004, and the PhD in Computer Information Systems from Bradford University (UK) in 2009. He is working as an assistant professor in the Department of Computer Science at Zarqa University in Jordan. His research interest includes information retrieval systems, database systems, and data mining.
Figure 3.4. Example of subsequence set generation

Representative binary code:

0100111001010101000010001 ...

W = 2, and item size = 8 = [2.sup.3], x = 3

Item #   index of the first bit             item value
                                            (subsequent)

n=0,     [y.sub.0]=0,                      [i.sub.0]=01001110
n=l,     [y.sub.1]=([2.sup.3]/2) x l=4.    [i.sub.1]=11100101
n=2.     [y.sub.2]=([2.sup.3]/2) x 2=8.    [i.sub.2]=10101010
n=3,     [y.sub.3]=([2.sup.3]/2} x 3=12.   [i.sub.3]=10100001
...

Subsequence set = {01001110, 11100101, 10101010 ...}
COPYRIGHT 2011 Digital Information Research Foundation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2011 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Hassan, Mohammad; Lahham, Yaser A. Al-
Publication:Journal of Digital Information Management
Article Type:Report
Date:Oct 1, 2011
Words:5248
Previous Article:Evaluation of topic identification methods on Arabic corpora.
Next Article:Querying and ranking XML documents based on data synopses.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters