使用 MapReduce 实施 PageRank
我正在尝试解决使用 MapReduce 实现 PageRank 的理论问题。
我有以下具有三个节点的简单场景:AB C。
邻接矩阵在这里:
A { B, C }
B { A }
例如,B 的 PageRank 等于:
(1-d)/N + d ( PR(A) / C(A) )
N = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A) = number of outgoing links from page A
我对所有原理图以及映射器和减速器如何工作都很好,但我无法理解围绕如何在reducer计算时知道C(A)。当通过聚合 B 的传入链接来计算 B 的 PageRank 时,reducer 将如何知道每个页面的传出链接数量。这是否需要在某些外部数据源中查找?
I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.
I have the following simple scenario with three nodes: A B C.
The adjacency matrix is here:
A { B, C }
B { A }
The PageRank for B for example is equal to:
(1-d)/N + d ( PR(A) / C(A) )
N = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A) = number of outgoing links from page A
I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个伪代码:
重要的是,在reduce中您应该输出外链而不是内链,正如intenret上的一些文章所建议的那样。这样,连续迭代也将具有外链接作为映射器的输入。
请注意,同一页面中具有相同地址的多个外链计为 1。另外,不要计算循环(链接到自身的页面)。
阻尼系数传统上为 0.85,但您也可以使用其他值。
Here is a pseudocode:
It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.
Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).
The damping factor is traditionally 0.85, although you can play around with other values, too.
我们迭代地评估 PR。 PR(x) = Sum(PR(a)*weight(a), a in_links) 因此
输出等于输入,我们可以这样做直到覆盖。
We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by
so the output equals input and we can do this until coverage.