Printer Friendly

An improved approach to the PageRank problems.

1. Introduction

The rapid growth of the World Wide Web has created a need for search tools. One of the best-known algorithms in web search is Google's PageRank algorithm [1]. Google's PageRank algorithm is based on a random surfer model [1] which can be viewed as a stationary distribution of a Markov chain. Simultaneously with the random surfer model, a different but closely related approach, the HITS algorithm, was invented in [2]. Another model SALSA [3] incorporated ideas from both HITS and PageRank to create another ranking of webpages.

In this paper, we focus on Google's PageRank algorithm. Let us introduce some notations about Google's PageRank algorithm. We can model the web as a directed graph with the web pages as the nodes and the hyperlinks as the directed edges. In the graph, if there is a link from page [P.sub.i] to page [P.sub.j], then, for page [P.sub.i], it has an outlink to page [P.sub.j], and, for page [P.sub.j], it has an inlink from page [P.sub.i]. Then we can define the elements of a hyperlink matrix H as follows.

If the web page [P.sub.i] has outlinks [Q.sub.i] [greater than or equal to] 1, then, for each link from page [P.sub.i] to another page [P.sub.j], the element [h.sub.i,j] of the matrix H is 1/Q; .If there is no link from page [P.sub.i] to page [P.sub.j], then the element [h.sub.i,j] of H is 0. The scalar [Q.sub.i] is the number of outlinks from the page [P.sub.i]. Thus, each nonzero row of H sums to 1. If the page [P.sub.i] has no outlinks at all (such as a pdf, image, or audio file), it is called a dangling node, and all elements in the "th row of H are set to 0.

The problem is that if at least one node has zero outdegree, that is, no outlinks, then the Markov chain is absorbing, so a modification to H is needed. In order to resolve this, the founders of Google, Brin and Page suggest replacing each zero row (corresponding to a dangling node) of the sparse hyperlink matrix with a dense nonnegative vector [v.sup.T]([v.sup.T]e = 1; e is the column vector of all ones and [v.sup.T] also could be a personalized vector, see [4, 5]) and create the new stochastic matrix denoted by S, S = H + [dv.sup.T]. In the vector d, the element [d.sub.i] = 1 if the ith row of H corresponds to a dangling node, and 0 otherwise. Another problem is that there is nothing in our definition so far that guarantees the convergence of the PageRank algorithm or the uniqueness of the PageRank vector with the matrix S. In general, if the matrix S is irreducible, this problem can be settled. Thus, Brin and Page added another dense perturbation matrix [ev.sup.T] that creates direct connections between each page to force the matrix to be irreducible. Then, the stochastic, irreducible matrix is called the Google matrix G and given by

G = [alpha]S + (1 - [alpha])[ev.sup.T] = [alpha]H + [alpha][dv.sup.T] + (1 - [alpha])[ev.sup.T], (1)

where 0 < [alpha] < 1 (a typical value for a is between 0.85 and 0.95. It is shown in [6] that a controls the convergence rate of the PageRank algorithm). Mathematically, the PageRank vector [pi] is the stationary distribution of the so-called Google matrix G.

Now, we have got many methods for solving the PageRank vector [pi], such as the famous Power Method [1, 7, 8]. Due to the sheer size of the web (over 3 billion pages), this computation can take several days. In [9], Arasu et al. used values from the current iteration as they become available, rather than using only values from the previous iteration. They also suggested that exploiting the "bow-tie" structure of the web [10] would be useful in computing PageRank. In [11], Kamvar et al. presented a variety of extrapolation methods. In [12], Avrachenkov et al. showed that Monte Carlo methods already provide good estimation of the PageRank for relatively important pages after one iteration. Gleich et al. in [13] presented an inner-outer iterative algorithm for accelerating PageRank computations. To put it another way, for the existence of the dangling nodes, Lee et al. [14] partitioned the web into dangling and nondangling nodes and applied an aggregation method to this partition.

Recently, the structure of the web link graph has been noticed. Kamvar et al. in [4] brilliantly exploited the block structure of the web for computing PageRank. They also exploited the fact that pages with lower page rank tend to converge faster and propose adaptive methods in [15]. Based on the characteristics of the web link graph, research on parallelization of PageRank can be found in [16-21]. In [21], Manaskasemsak and Rungsawang discussed a parallelization of the power method. In [17], Gleich et al. introduced a method to compare the various linear system formulations in terms of parallel runtime performance. Cevahir et al. in [16] proposed the site-based partitioning and repartitioning techniques for parallel PageRank computation. Some special models for parallel PageRank were proposed in [18-20].

In our paper, we combine ideas from the existence of the dangling nodes and the block structure of the web and exploit a new structure for the hyperlink matrix H. Then some parallel computation methods are applied to speed up the computation of PageRank by using a partition of the nodes. Firstly, we present that our target is to compute the PageRank of the nondangling nodes in the linear system for the Google problem [22] (Section 2). Secondly, according to the partition of the web pages, we get a special structure of the hyperlink matrix, and then we propose an algorithm (Section 3). Finally, we make an analysis of our algorithms, and some numerical results are given (Sections 4 and 5).

2. The Problem

Generally, the Google problem is to solve the eigenvector [pi] of the matrix G in the following equation:

[[pi].sup.T] = [[pi].sup.T]G, [pi] [greater than or equal to] 0, [[parallel][[pi].sup.T][parallel].sub.1] = 1. (2)

Here, we introduce some theorems to show that the Google problem can turn out to be a linear system problem and only need to compute the unnormalized PageRank subvector of the nondangling nodes. In the following, the matrix I denotes the identity matrix.

Theorem 1 (see [22, linear system for Google problem]). Suppose that the matrix H is a hyperlink matrix. Solving the linear system,

[x.sup.T](I - [alpha]H) = [v.sup.T], (3)

and letting [[pi].sup.T] = [x.sup.T]/[[parallel]x[parallel].sub.1] produce the PageRank vector.

Since the coefficient matrix (I - [alpha]H) in (3) is an M-matrix (Theorem 8.(4.2) in [23]) as well as nonsingular and irreducible, thus, the solution of the linear system in Theorem 1 is existent and unique.

The rows in the matrix H corresponding to the dangling nodes would be zero. It is natural as well as efficient to exclude the dangling nodes from the PageRank computation. This can be done by partitioning the web nodes into nondangling nodes and dangling nodes. This is similar to the method of "lumping" all the dangling nodes into a single node [24]. Supposing that the rows and columns of H are permuted corresponding to the partition, then the rows corresponding to the dangling nodes are at the bottom of the matrix:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4)

where ND is the set of the nondangling nodes and D is the set of the dangling nodes.

Then, the coefficient matrix (I - [alpha]H) in (3) becomes

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (5)

and the inverse of this matrix is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (6)

Therefore, the unnormalized PageRank vector [x.sup.T] = [v.sup.T][(I - [alpha]H).sup.-1] in (4) can be written as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (7)

Then, Langville and Meyer [22] proposed two reordered PageRank algorithms for computing the PageRank vector. One is Algorithm 1, called reordered PageRank algorithm, and the other is called reordered PageRank algorithm. However, unfortunately, the reordered PageRank algorithm is not necessarily an improvement over Algorithm 1 in some cases.

In this reordered PageRank Algorithm 1, the only system that must be solved is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The reordered PageRank Algorithm 2 is based on a process of locating zero rows which can be repeated recursively on smaller and smaller submatrices of [[??].sub.11], continuing until a submatrix is created that has no zero rows. For interested readers, the detail of the reordered PageRank algorithms can be found in [22]. However, this structure of the web they exploit in reordered PageRank Algorithm 2 is not practical, as reordering the web matrix according to this structure requires depth-first search, which is prohibitively costly on the web. To put it another way, even though some hyperlink matrices H can be suited to the reordered PageRank algorithm, the structure may not exist for some hyperlink matrices. Thus the reordered PageRank Algorithm 2 will have no advantage over Algorithm 1 in this worst case. Similarly, we can find the same conclusion in their experiments. Thus, we come back to (4) and reorder the structure of the matrix Hu to speed up the computation of PageRank vector. The objective function becomes

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (8)

where the coefficient matrix (I - [alpha][[??].sub.11]) is the nontrivial leading principal submatrix of (I - [alpha]H) and it is nonsingular (Theorem 6.(4.16) of [23]).

ALGORITHM 1: Reordered PageRank Algorithm [22].

(1) Partition the web nodes into dangling and nondangling
    nodes, so that the hyperlink matrix H has the
    structure of (4).
(2) Solve [[??].sup.T.sub.1] from
    [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
(3) Compute [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

ALGORITHM 2: An algorithm based on a separation of the common nodes.

(1) Partition the web nodes which form m blocks: S = ([S.sub.1],
    [S.sub.2], ..., [S.sub.m]) into m + 2 blocks: S =
    ([S.sup.2.sub.1], [S.sup.2.sub.2], ... [S.sup..sub.2m], CN, D),
    so the hyperlink matrix H has the structure of (12).
(2) Partition the given vector [v.sup.T] = ([[??].sup.T.sub.1],
    [[??].sup.T.sub.2]) and PageRank vector [x.sup.T] =
    ([[??].sup.T.sub.1], [[??].sup.T.sub.2]) according to the size
    of the m + 2 blocks:
      [[??].sup.T.sub.1] = ([w.sup.T.sub.1], [w.sup.T.sub.2]),
        [w.sup.T.sub.1] = ([v.sup.T.sub.1], [v.sup.T.sub.2], ...,
        [v.sup.T.sub.m]), [w.sup.T.sub.2] = ([v.sup.T.sub.m+1]),
        [[??].sup.T.sub.2] = ([v.sup.T.sub.m+2])
      [[??].sup.T.sub.1] = ([y.sup.T.sub.1], [y.sup.T.sub.2]),
        [y.sup.T.sub.1] = ([x.sup.T.sub.1], [x.sup.T.sub.2], ...,
        [x.sup.T.sub.m]), [y.sup.T.sub.2] = ([x.sup.T.sub.m+1]),
        [[??].sup.T.sub.2] = ([x.sup.T.sub.m+2]).
(3) Compute the limiting vector of [y.sup.T.sub.1] by iterations
    as follow:
    (a) Compute ([r.sup.T.sub.1], [r.sup.T.sub.2], ...,
      [r.sup.T.sub.m]) = [y.sup.(k-1).sub.1]B[(I -
      [alpha]F).sup.-1]E + [w.sup.T.sub.2][(I - [alpha]F).sup.-1]E
      + [w.sup.T.sub.1];
    (b) Solve for [y.sup.(k)T.sub.1] = ([x.sup.(k)T.sub.1],
      [x.sup.(k)T.sub.2], ..., [x.sup.(k)T.sub.m]) in
      [x.sup.(k)T.sub.i](I - [alpha][H.sub.ii]) = [r.sup.T.sub.i],
      i = 1, 2, ..., m.
(4) Compute
    [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
(5) Normalize [[pi].sup.T] = ([y.sup.T.sub.1], [y.sup.T.sub.2],
   [[??].sup.T.sub.2])/[[parallel]([y.sup.T.sub.1],
   [y.sup.T.sub.2], [[??].sup.T.sub.2])[parallel].sub.1].


3. PageRank Algorithms Based on a Separation of the Common Nodes

3.1. The Block Structure of the Web. It is noted in [4] that when sorted by Uniform Resource Location (URL), the web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host. This property was demonstrated by examination on realistic datasets. So in the following sections, we consider the webs that have block structure. To simplify notation, without loss of generality, we will assume that a web link graph has a block structure of m blocks: [S.sub.1], [S.sub.2], ..., [S.sub.m]. So the hyperlink matrix H is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (9)

Then, we separate the dangling nodes from each of the blocks. Thus, we get the new blocks [S.sup.1.sub.k], k [member of] (1, ..., m), which are the original blocks [S.sub.k] with dangling nodes removed. The set of nodes S = ([S.sub.1], [S.sub.2], ..., [S.sub.m]) is S = (ND, D), where ND = ([S.sup.1.sub.1], [S.sup.1.sub.2], ..., [S.sup.1.sub.m]) and D is the set of the dangling nodes. The rows and columns of H can be permuted, making the rows corresponding to the dangling nodes at the bottom of the matrix just like (4) in Section 2:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (10)

In the above equation, the submatrix [[??].sub.11] is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (11)

3.2. A Separation of the Common Nodes. To investigate the detail of the web structure, we can see the experiments in [4]. They used LARGEWEB link graph [25] and considered the version of LARGEWEB with dangling nodes removed, which contains roughly 70 M nodes, with over 600 M edges, and requires 3.6 GB of storage. They partitioned the links in the graph into "intrahost" links, which means links from a page to another page in the same host, and "interhost" links, which means links from a page to a page in a different host. Through counting the number of the two different links separately, Table 2 in [4] shows that 93.6% of the links in the datasets are intrahost links and 6.4% are interhost links, which means that larger majority of links are intrahost links and only a minority of links are interhost links. They also found the same result by partitioning the links according to different domains. This result leads to a deeper study of the structure of the hyperlink matrix H. That is, if the pages are grouped by domain, host, or others, the graph for the pages will appear as a block structure. Then in each subblock, a minority of nodes have links to other blocks, and in this paper we call them common nodes. The definition of common node is given as follows.

Definition 2 (common node). Assume that a web link graph with dangling nodes removed has n blocks [S.sub.1], [S.sub.2], ..., [S.sub.n]. If a node in a block [S.sub.i](1 [less than or equal to] i [less than or equal to] n) has at least one outlink to another different block [S.sub.j](j [not equal to] i, 1 [less than or equal to] j [less than or equal to] n) or inlink from another different block [S.sub.j](j [not equal to] i, 1 [less than or equal to] j [less than or equal to] n), we call it common node.

If a node in a web link graph is not a dangling node or a common node, then we call it general node. The nodes in a web link graph are divided into three classes: dangling node, common node, and general node. Specially, the common nodes and general nodes belong to the nondangling nodes.

There is no dangling node in the blocks [S.sup.1.sub.1], [S.sup.1.sub.2], ..., [S.sup.1.sub.m], so we consider separating all the common nodes from the blocks [S.sup.1.sub.1], [S.sup.1.sub.2], ..., [S.sup.1.sub.m] and form a new block denoted by CN. Hence, the set of nodes S = ([S.sup.1.sub.1], [S.sup.1.sub.2], ..., [S.sup.1.sub.m], D) is S = ([S.sup.2.sub.1], [S.sup.2.sub.2], ..., [S.sup.2.sub.m], CN, D). The new block [S.sup.2.sub.k](1 [less than or equal to] k [less than or equal to] m) is the block [S.sup.1.sub.k] with common nodes removed. Thus, any hyperlink submatrix [H.sub.i,j](i [not equal to] j) corresponding to two different blocks [S.sup.2.sub.i] and [S.sup.2.sub.j] becomes zero matrix because there are no interlinks between different blocks in [S.sup.2.sub.1], [S.sup.2.sub.2], ..., [S.sup.2.sub.m].

In Figure 1, a simple example is shown to illustrate the change after a separation of the common nodes. In Figure 1(a), there are four blocks [S.sub.1], [S.sub.2], [S.sub.3], and [S.sub.4] in a web link graph, and each of them has links to others. However, in Figure 1(b), after separating the common nodes from the four blocks and lumping the common nodes into a block denoted by CN, there are no links among the four new blocks. The links exist only between the CN and the four new blocks. Once the above is done, the hyperlink matrix H corresponding to the partition of the web nodes, S = ([S.sup.2.sub.1], [S.sup.2.sub.2], ..., [S.sup.2.sub.m], CN, D), has the following structure:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (12)

Then the submatrix Hn, corresponding to the hyperlinks among the nondangling nodes, turns out to be

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (13)

It is apparent that after the separation of the common nodes, the structure of the above matrix [[??].sub.11] seems much simpler than the former one in (11).

3.3. A PageRank Algorithm. Notice that the matrix in (13) has nonzero submatrices only in the diagonal, the last row, and the last column. This special structure can reduce the computation in every iteration. Let

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (14)

Then

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (15)

The coefficient matrix (I - [alpha][[??].sub.11]) has the following structure:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (16)

Therefore, after Gaussian elimination, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] can be written as

[y.sup.T.sub.1](U - [alpha]A) = [alpha][y.sup.T.sub.1]B[(I - [alpha]F).sup.-1]E + [w.sup.T.sub.2], (17)

[y.sup.T.sub.2] = ([w.sup.T.sub.2] + [y.sup.T.sub.1][alpha]B)[(I - [alpha]F).sup.-1], (18)

where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are divided into general and common sections. The only system that must be solved is (17).

Notice that the matrix A is a block diagonal matrix. Therefore, the subvectors [x.sup.T.sub.i] of [y.sup.T.sub.1] = ([x.sup.T.sub.1], [x.sup.T.sub.2], ..., [x.sup.T.sub.3]) which are partitioned according to the number and size of the blocks can be calculated independently in each iteration. For example, in kth iteration, calculate and divide ([w.sup.T.sub.2] + [y.sup.(k-1).sub.1][alpha]B)[(I - [alpha]F).sup.-1]E + [w.sup.T.sub.1] into ([r.sup.T.sub.1], [r.sup.T.sub.2], ..., [r.sup.T.sub.m]) according to the number and size of the blocksl then, for vectors [x.sup.T.sub.i], we have the following function:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (19)

or

[x.sup.(k)T.sub.i](I - [alpha][H.sub.ii]) = [r.sup.T.sub.i], i = 1, 2, ..., m. (20)

As a result, the PageRank system in (8) can be reduced into the smaller linear system formulation in (17) inwhich the subvectors can be calculated independently in each iteration by (20). In summary, we now have an algorithm based on the separation of the common nodes. Meanwhile, this algorithm is an extension of the dangling node method in Section 2.

4. Analysis of Algorithm 2

As we know, some web link graphs appear to have a nested block structure. Then according to the definition of common node, it is not difficult to find the common nodes among the different blocks. This can be done by a process of locating nonzero entries on submatrices of [H.sub.i,j] in (10)(i [not equal to] j, 1 [less than or equal to] i [less than or equal to] m, 1 [less than or equal to] j [less than or equal to] m). For example, if the ([k.sub.1], [k.sub.2])th entry of [H.sub.i,j] is nonzero, then the [k.sub.1]th nodes and the [k.sub.2]th nodes are common nodes. This process can be repeated on different submatrices of [H.sub.i,j] at the same time by using separate computers. At the end, gather the common nodes together from different computers and get rid of the repetitive nodes, and then we get the last set of the common nodes. Since the dimension of [H.sub.i,j] is much smaller and we can use parallel searching, so the step 1 in Algorithm 2 will not take much time for separating the common nodes.

Note that there is no links among the new blocks [S.sup.2.sub.1], [S.sup.2.sub.2], ..., [S.sup.2.sub.m] after the separation of the common nodes just as the zero submatrices in the matrix H in (12). In effect, step 3 in Algorithm 2 reduces time consuming for large matrices by turning a large matrix [[??].sub.11] into many smaller submatrices [H.sub.ii]. It shows that vectors [x.sup.(k)T.sub.i], i = 1, ..., m, can be computed separately by [x.sup.(k)T.sub.i] (I - [alpha][H.sub.ii]) = [r.sub.i] and the results are used together to yield a new vector for the next iteration. The parallel computation in this step can save much time.

Since [x.sup.T.sub.i] are not required to be accurate in each iteration, we can compute [x.sup.(k)T.sub.i] by [x.sup.(k)T.sub.i] = [x.sup.(k-1)T.sub.i]([alpha][H.sub.ii]) + [r.sub.i]. Moreover, it can be solved by any appropriate direct or iterative method. Meanwhile, in [22], they have found that acceleration methods [9, 11, 15, 26], such as extrapolation and preconditioned, can be applied to the small Hi t system to achieve even greater speedups.

5. Numerical Experiments

5.1. Experiment Foundation. In this section, we give an example to present our algorithms.

Example. We consider three experiments based on three web link graphs: graph 1, graph 2, and graph 3. We assume that each of the graphs contains 200 nodes and four blocks; moreover, the size of the blocks is the same in each graph. Based on our definition about web pages, there are three classes of pages in a web: dangling nodes, common nodes, and general nodes. In order to make comparisons among the experiments, we suppose that the numbers of the dangling nodes are equivalent in these three graphs. Then we set different proportions of the general nodes and the common nodes in these three graphs. Without loss of generality, we assume that there are three kinds of proportions: they are 3: 7 in graph 1, 5: 5 in graph 2, and 7: 3 in graph 3, which indicate that the number of the common nodes relatively decreases and the number of the general nodes relatively increases. We also assume that, in each graph, the proportion between the general nodes and the common nodes in each subblock is similar to the proportion in the whole web link graph. Meanwhile, in these three web link graphs, the choosing of the common nodes and the links in and between the subblocks is random.

For the dot plot graph of these three web link graphs, if there exists a link from node i to node j, then point (i, j) is colored; otherwise, point (i, j) is white. We assure that these three web link graphs satisfy three characters in [4].

(1) There is a definite block structure to the web.

(2) The individual blocks are much smaller than entire web.

(3) There are clear nested blocks.

For example, Figure 2, it is the graph 3 which contains 200 pages and has a nested block structure of four blocks. The proportion is 7 : 3 in the whole graph.

Then, in each experiment, we separate the nodes into dangling nodes, common nodes, and the rest (general nodes). The result of this process is a decomposition of the H matrix. Figure 3 shows the change of the structure of [[??].sub.11] in (4) after this process, which is based on the dataset of Figure 2. Figure 3(a) is the web link graph of [[??].sub.11] before reordering, and Figure 3(b) is the new web link graph of [[??].sub.11] after reordering. This process amounts to a simple reordering of the indices of the Markov chain. It shows that the character of the new structure is better than the original one.

5.2. Experimental Results and Analysis. Based on the three experiment datasets, we compare Algorithm 2 to the other two algorithms: original PageRank and reordered PageRank. We assume the scaling factor [alpha] = 0.85 and the convergence tolerance [tau] = [10.sup.-10]. The experimental results are shown in Figure 4 and Table 1. Figures 4(a), 4(b), and 4(c) are the comparison among the three algorithms about the acceleration of convergence in the three separate experiments. It shows that Algorithm 2 possesses both good capability to search PageRank vector and rigid convergence speed in comparison with reordered PageRank. That is because the dimension of the linear system for Algorithm 2 is smaller than the dimension of the linear system for reordered PageRank. The result in Table 1 implies that Algorithm 2 needs more iterations than Power method. However, since the application of parallel computation in Algorithm 2, Algorithm 2 can largely accelerate the computation time of PageRank. For the next work, we will try to experiment on real data.

6. Conclusion

It has investigated that the hyperlink graphs of some web pages have nested block structure which can be found in [4]. Then we exploit a reordered block structure and present an algorithm to compute PageRank in a fast manner. Algorithm 2 has basically two stages. In Stage 1, the focus is on the partition of nodes in a web. In Stage 2, the vector of general nodes in each block for next iteration is computed independently. Then we calculate the unnormalized PageRank vectors for common nodes and dangling nodes directly. At last, normalize the vector and give the PageRank. The numerical experiments show that Algorithm 2 is guaranteed to outperform the other two algorithms, as long as an appropriate block structure of web exists. However, in real data, the common nodes may increase as the number of the blocks increases, and the dimension of the submatrix D could be larger. Then it will take much time to calculate the value of [(I - [alpha]D).sup.-1]. In this case, similar to Algorithm 2, we will consider calculating the vector for common nodes first and then calculating the vector for general nodes in each block independently. We aslo need to experiment on real data and make comparison with other more existing methods in the future work.

http://dx.doi.org/10.1155/2013/438987

Acknowledgments

The authors would like to express their great thankfulness to the referees and the editor for their much helpful suggestions for revising this paper. This research is supported by 973 Program (2013CB329404), NSFC (61370147, 61170311, and 61170309), Chinese Universities Specialized Research Fund for the Doctoral Program (20110185110020), and Sichuan Province Sci. & Tech. Research Project (2012GZX0080).

References

[1] S. Brin, L. Page, R. Motwami, and T. Winograd, "The PageRank citation ranking: bringing order to the web," Tech. Rep. 19990120, Computer Science Department, Stanford University, Stanford, Calif, USA, 1999.

[2] J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999.

[3] R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect," in Proceedings of the 9th International Conference on the World Wide Web, pp. 387-401, ACM Press, New York, NY, USA, 2000.

[4] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Exploiting the block structure of the web for computing PageRank," Tech. Rep. 2003-17, Stanford University, Stanford, Calif, USA, 2003.

[5] E. Minkov, "Adaptive graph walk based similarity measures in entity-relation graphs," type CMU-LTI-09-004, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa, USA, 2008.

[6] T. H. Haveliwala and S. D. Kamvar, "The second eigenvalue of the Google matrix," Tech. Rep. 2003-20, Stanford University, Stanford, Calif, USA, 2003.

[7] G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md, USA, Third edition, 1996.

[8] R. S. Wills and I. C. F. Ipsen, "Ordinal ranking for Google's PageRank," SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 4, pp. 1677-1696, 2008/09.

[9] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, "PageRank computation and the structure of the web: experiments and algorithms," in Proceedings of the 11th International World Wide Web Conference, ACM Press, New York, NY, USA, 2002.

[10] A. Broder, R. Kumar, and F. Maghoul, "Graph structure in the web: experiments and models," in Proceedings of the 9th International World Wide Web Conference, 2000.

[11] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Extrapolation methods for accelerating the computation of PageRank," in Proceedings of the 12th International World Wide Web Conference, pp. 261-270, ACM Press, New York, NY, USA, 2003.

[12] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova, "Monte Carlo methods in pagerank computation: when one iteration is sufficient," SIAM Journal on Numerical Analysis, vol. 45, no. 2, pp. 890-904, 2007.

[13] D. F. Gleich, A. P. Gray, C. Greif, and T. Lau, "An inner-outer iteration for computing PageRank," SIAM Journal on Scientific Computing, vol. 32, no. 1, pp. 349-371, 2010.

[14] C. P. Lee, G. H. Golub, and S. A. Zenios, "Partial state space aggregation based on lumpability and its application to PageRank," Tech. Rep., Stanford University, 2003.

[15] S. D. Kamvar, T. H. Haveliwala, and G. H. Golub, "Adaptive methods for the computation of PageRank," Tech. Rep. 200326, Stanford University, Stanford, Calif, USA, 2003.

[16] A. Cevahir, C. Aykanat, A. Turk, and B. B. Cambazoglu, "Site-based partitioning and repartitioning techniques for parallel pagerank computation," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 5, pp. 786-802, 2011.

[17] D. Gleich, L. Zhukov, and P. Berkhin, "Fast parallel PageRank: a linear system approach," Tech. Rep. YRL-2004-038, 2004.

[18] C. Kohlschtter, R. Chirita, and W. Nejdl, "Efficient parallel computation of PageRank," in Proceedings of the 28th European Conference on IR Research (ECIR '06), pp. 241-252, 2006.

[19] G. Kollias and E. Gallopoulos, "Asynchronous PageRank computation in an interactive multithreading environment," in Proceedings of the Seminar Web Information Retrieval and Linear Algebra Algorithms, 2007.

[20] G. Kollias, E. Gallopoulos, and D. B. Szyld, "Asynchronous iterative computations with web information retrieval structures: the PageRank case," in Proceedings of the ParCo, 2006.

[21] B. Manaskasemsak and A. Rungsawang, "An efficient partition-based parallel PageRank algorithm," in Proceedings of the 11th International Conference on Parallel and Distributed Systems Workshops (ICPADS '05), pp. 257-263, July 2005.

[22] A. N. Langville and C. D. Meyer, "A reordering for the PageRank problem," SIAM Journal on Scientific Computing, vol. 27, no. 6, pp. 2112-2120, 2006.

[23] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, vol. 9, SIAM, Philadelphia, Pa, USA, 1994.

[24] C. P. Lee, G. H. Golub, and S. A. Zenios, "A fast two-stage algorithm for computing PageRank and its extensions," Tech. Rep. SCCM-03-15, Stanford University, Stanford, Calif, usa, 2003.

[25] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke, "WebBase: a repository of Web pages," Computer Networks, vol. 33, no. 1, pp. 277-293, 2000.

[26] G. H. Golub and C. Greif, "Arnoldi-type algorithms for computing stationary distribution vectors, with application to PageRank," Tech. Rep. SCCM-2004-15, Scientific Computation and Computational Mathematics, Stanford University, Stanford, Calif, USA, 2004.

Yue Xie, (1,2) Ting-Zhu Huang, (1) Chun Wen, (1) and De-An Wu (1)

(1) School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China

(2) Department of Mathematics, The University of Hong Kong, Pokfulam, Hong Kong

Correspondence should be addressed to Ting-Zhu Huang; tingzhuhuang@126.com

Received 1 August 2013; Revised 25 November 2013; Accepted 25 November 2013

Academic Editor: Zhongxiao Jia

TABLE 1: Comparison of original PageRank, reordered
PageRank and Algorithm 2.

                            Dataset 1   Dataset 2   Dataset 3

Reordered     Iterations       81          80          72
  PageRank    Time (sec.)    0.0377      0.0366      0.0400
Original      Iterations       20          25          31
  PageRank    Time (sec.)    0.0292      0.0270      0.0305
Algorithm 2   Iterations       21          31          42
              Time (sec.)    0.0165      0.0145      0.0187
COPYRIGHT 2014 Hindawi Limited
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Research Article
Author:Xie, Yue; Huang, Ting-Zhu; Wen, Chun; Wu, De-An
Publication:Journal of Applied Mathematics
Article Type:Report
Date:Jan 1, 2014
Words:5659
Previous Article:Day-ahead wind speed forecasting using relevance vector machine.
Next Article:Adaptive impulsive observer for outer synchronization of delayed complex dynamical networks with output coupling.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters