代数方法求PageRank

本文结合实例介绍如何用代数方法求PageRank。

1. PageRank

博文《网页排序算法PageRank》介绍了PageRank,计算PageRank可以用迭代的方法也可以用代数的方法,其背后的数学基本运算是一样的,即:

$$
PR(p_i) = \frac{1-d}{N} + d \sum_{p_j \in B(p_i)} \frac{PR(p_j)}{L(p_j)}
$$

下文结合图1介绍如何用代数方法求PageRank。

Wikipedia-PageRanks-Example
Fig. 1: PageRanks for a simple network (image from Wikipedia).

为了便于讨论,将图1下方的节点分别标上G, H, I, J, K,如下图所示:

wikipedia_pagerank_example
Fig. 2: Draw Fig. 1 in NetworkX.

2. 代数方法

根据1中的等式,把所有节点都放在一块,可以得到:

$$
\begin{bmatrix}
PR(p_1) \\
PR(p_2) \\
\vdots \\
PR(p_N)
\end{bmatrix} =
\begin{bmatrix}
{(1-d)/ N} \\
{(1-d) / N} \\
\vdots \\
{(1-d) / N}
\end{bmatrix}
+ d
\begin{bmatrix}
\ell(p_1,p_1) & \ell(p_1,p_2) & \cdots & \ell(p_1,p_N) \\
\ell(p_2,p_1) & \ddots & & \vdots \\
\vdots & & \ell(p_i,p_j) & \\
\ell(p_N,p_1) & \cdots & & \ell(p_N,p_N)
\end{bmatrix}
\cdot
\begin{bmatrix}
PR(p_1) \\
PR(p_2) \\
\vdots \\
PR(p_N)
\end{bmatrix}
$$

上述等式可以缩写为:

$$
\mathbf{R} = \frac{1-d}{N} \mathbf{1} + d \mathcal{M}\mathbf{R}
$$

其中,\(\mathbf{1}\)为\(N\)维的列向量,所有元素皆为1。以图1为例,该列向量为,

N = len(G.nodes())      # N = 11
column_vector = np.ones((N, 1), dtype=np.int)

[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]

2.1 Adjacency function

邻接函数(adjacency function)\(\ell(p_1,p_2)\)组成了矩阵\(\mathcal{M}\),

$$
\mathcal{M}_{ij} =
\ell(p_i,p_j) =
\begin{cases}
1 /L(p_j) , & \mbox{if }j\mbox{ links to }i. L(p_j)\mbox{是指从}p_j\mbox{链出去的网页数目} \\
0, & \mbox{otherwise}
\end{cases}
$$

这样矩阵每一行乘以\(\mathbf{R}\),就得到了新的PR值,比如第二行(图1的节点B),

$$
\begin{split}
M_{2j} & = \ell(p_2,p_1) \cdot PR(p_2) + \ell(p_2,p_2) \cdot PR(p_2) + \cdots + \ell(p_2,p_N) \cdot PR(p_2) \\
&= 0 \mbox{ (‘A’)} + 0 \mbox{ (‘B’)} + 1 \mbox{ (‘C’)} + \frac{1}{2} \mbox{ (‘D’)} + \frac{1}{3} \mbox{ (‘E’)} + \frac{1}{2} \mbox{ (‘F’)} \\
& + \frac{1}{2} \mbox{ (‘G’)} + \frac{1}{2} \mbox{ (`H’)} + \frac{1}{2} \mbox{ (‘I’)} + 0 \mbox{ (‘J’)} + 0 \mbox{ (‘K’)}
\end{split}
$$

以节点G为例,GBE投票,所以B得到1/2

矩阵\(\mathcal{M}\)每一列加起来都是1(值得注意的是,对于没有出链的节点,列加起来等于0,比如图1的节点A),即\(\sum_{i = 1}^N \ell(p_i,p_j) = 1\)。事实上,\(\mathcal{M}\)是一个转移矩阵transition matrix(也叫概率矩阵probability matrix,马尔可夫矩阵Markov matrix)。因此,PageRank是eigenvector centrality的一个变体。

2.2 矩阵\(\mathcal{M}\)

事实上,\(\mathcal{M}\)可以被看成normalized的图邻接矩阵,即:

$$
\mathcal{M} = (K^{-1} A)^T
$$

其中,\(\mathcal{A}\)为图的邻接矩阵,以图1为例,

# Get adjacency matrix
nodelist = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']  # sorted(G.nodes())
A = nx.to_numpy_matrix(G, nodelist)

  'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]]

\(\mathcal{A}\)是对角矩阵,对角线上的元素是对应节点的出度。

nodelist = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']  # sorted(G.nodes())
list_outdegree = map(operator.itemgetter(1), sorted(G.out_degree().items()))
K = np.diag(list_outdegree)

[[0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 0 3 0 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 0 0 0]
 [0 0 0 0 0 0 2 0 0 0 0]
 [0 0 0 0 0 0 0 2 0 0 0]
 [0 0 0 0 0 0 0 0 2 0 0]
 [0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 1]]

\(\mathbf{K}\)的逆矩阵\(\mathbf{K^{-1}}\)为,

K_inv = np.linalg.pinv(K)

[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    1.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    1.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.5   0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.5   0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.5   0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.5   0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.5   0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    1.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    1.  ]]

那么,根据公式\(\mathcal{M} = (K^{-1} A)^T\)就可以求得\(\mathcal{M}\),如下,

M = (K_inv * A).transpose()

[[ 0.    0.    0.    0.5   0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    1.    0.5   0.33  0.5   0.5   0.5   0.5   0.    0.  ]
 [ 0.    1.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.5   0.5   0.5   0.5   1.    1.  ]
 [ 0.    0.    0.    0.    0.33  0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]

2.3 求解

\(\mathbf{R}\)是2.1等式的特征向量(eigenvector),求解等式得:

$$
\mathbf{R} = (\mathbf{I}-d \mathcal{M})^{-1} \frac{1-d}{N} \mathbf{1}
$$

其中\(\mathbf{I}\)是单位矩阵。

d = 0.85
I = np.identity(N)
R = np.linalg.pinv(I - d*M) * (1-d)/N * column_vector 

[[ 0.028]
 [ 0.324]
 [ 0.289]
 [ 0.033]
 [ 0.068]
 [ 0.033]
 [ 0.014]
 [ 0.014]
 [ 0.014]
 [ 0.014]
 [ 0.014]]

咦,结果怎么跟图1不一样。得到\(\mathbf{R}\)需要normalized,如此,所有节点的PR加起来才能等于1。

R = R/sum(R)    # normalized R, so that page ranks sum to 1.

[[ 0.033]
 [ 0.384]
 [ 0.343]
 [ 0.039]
 [ 0.081]
 [ 0.039]
 [ 0.016]
 [ 0.016]
 [ 0.016]
 [ 0.016]
 [ 0.016]]

用NetworkX作出来的图,是这样的:

wikipedia_pagerank_example_pr
Fig. 3: PageRanks for a simple network

本文涉及到的源代码已经分享到我的GitHub,pagerank目录下的pagerank_algebraic.py.

3. Python源代码

NetworkX实现了PageRank的代数计算方法nx.pagerank_numpy,源代码在这里

def pagerank_numpy(G, alpha=0.85, personalization=None, weight='weight', dangling=None):
    """Return the PageRank of the nodes in the graph.
    """

    if len(G) == 0:
        return {}

    M = google_matrix(G, alpha, personalization=personalization,
                      weight=weight, dangling=dangling)

    # use numpy LAPACK solver
    eigenvalues, eigenvectors = np.linalg.eig(M.T)
    ind = eigenvalues.argsort()

    # eigenvector of largest eigenvalue at ind[-1], normalized
    largest = np.array(eigenvectors[:, ind[-1]]).flatten().real
    norm = float(largest.sum())

    return dict(zip(G, map(float, largest / norm)))

References:
[1] StackOverflow: Incorrect PageRank calculation result

专题: 网页排序算法PageRank (3/3)

发表评论

电子邮件地址不会被公开。 必填项已用*标注