PageRank 与 SVD
Pagerank 作用于一系列页面的节点图以及它们各自向内和向外形成的有向边链接。因此,特定页面的排名大致是节点图中本地引起的影响。
另一方面,SVD 适用于整个值矩阵,并且没有方向性 - a站点 A 和站点 B 之间的链接只会在正确的矩阵元素上注册为 1。这是一个全球系统,因此排名具有全球效应。
考虑到网络衍生矩阵的极度稀疏性,我认为 SVD 在这里表现不佳,因为它需要完整的数据集,并且具有大量的内存要求。
这是真的吗? Pagerank 超越 SVD 很大程度上是因为它是基于节点图的算法吗?除了单词被提及的次数之外,Pagerank 如何从页面推断语义相关性?或者这是在 Pagerank 对页面进行排名后执行的第二步?
Pagerank works on the nodegraph of a series of pages and the directed edges formed by their respective inward and outward links. Thus the rank of a particular page is broadly a locally-induced effect in the nodegraph.
SVD, on the other hand, works on a whole matrix of values, and has no directionality - a link between site A and site B would only register as a 1 on the correct matrix element. It is a global system, so ranking is a global effect.
Given the extreme sparseness of web-derived matrices, I would expect SVD to be a bad performer here, as it requires a complete dataset, and has significant memory requirements.
Is that true? Does Pagerank outdo SVD largely because it is a nodegraph-based algorithm? How can Pagerank infer semantic relevance from a page beyond the number of times a word is mentioned? Or would that be a second step, performed after Pagerank has ranked the pages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有两个问题:哪个度量很容易计算,哪个可以产生我们正在寻找的信息?我不知道这两个问题的答案,但我也许可以给出部分答案。
首先,相关性。用网络理论中的术语来说,这两个量都是中心性度量。 PageRank 计算特征向量中心性(的变体),而 SVD 显然导致了超链接诱导主题搜索 (HITS) 算法。我从 这份讲义来自 Peter Dodds(佛蒙特大学)。它们衡量不同的事物,但我不清楚哪一个与衡量网页的重要性最相关。
其次,计算成本。从数学上讲,PageRank 是(修改后的)邻接矩阵的主要特征向量(如维基百科页面上所解释的),而 HITS 给出了邻接矩阵的主要奇异向量。两者都是由网页的全局网络及其之间的链接定义的,并且两者都可以通过仅考虑本地节点图来计算。所以乍一看,我认为计算成本大致相等。
总之,我不知道为什么PageRank比SVD更好;我什至不清楚它是否比 SVD 更好。
There are two issues here: which measure is easy to compute, and which yields the information that we're looking for? I don't know the answer to either question, but I can perhaps give a partial answer.
Firstly, the relevance. Both quantities are centrality measures, to use a term from network theory. PageRank computes (a variant of) eigenvector centrality, while the SVD apparently leads to the Hyperlink-Induced Topics Search (HITS) algorithm. I got this from this handout from Peter Dodds (University of Vermont). They measure different things, but it's not clear to me which one is the most relevant for measuring the importance of a webpage.
Secondly, the computational cost. Mathematically speaking, the PageRank is the dominant eigenvector of the (modified) adjacency matrix - as explained on the Wikipedia page - while HITS gives the dominant singular vector of the adjacency matrix. Both are defined by the global network of webpages and links between them, and both can be computed by only considering the node graph locally. So at first sight, I think that the computational cost is roughly equal.
In conclusion, I don't know why PageRank is better than SVD; it's not even clear to me that it is better than SVD.
请注意,PageRank 使用传送的随机游走矩阵。隐形传输对于避免随机游走矩阵的(低度)局部特征向量非常重要。我认为 PageRank 比 HITS 更好,因为随机游走矩阵(度归一化邻接矩阵)会抑制大度节点和循环的影响,而 HITS 则大度节点可以生成局部向量。
Note that PageRank uses a teleported random walk matrix. Teleportation is important to avoid (low-degree) localized eigenvectors of the random walk matrix. I think PageRank is better than HITS, because the random walk matrix, which is a degree-normalized adjacency matrix, depresses the effects of large degree nodes and loops, as opposed to HITS where large-degree nodes could make localized vectors.