使用 MapReduce 实施 PageRank

发布于 2024-10-17 18:07:19 字数 476 浏览 1 评论 0原文

我正在尝试解决使用 MapReduce 实现 PageRank 的理论问题。

我有以下具有三个节点的简单场景:AB C。

邻接矩阵在这里:

A { B, C }
B { A }

例如,B 的 PageRank 等于:

(1-d)/N + d ( PR(A) / C(A) ) 

N     = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A)  = number of outgoing links from page A

我对所有原理图以及映射器和减速器如何工作都很好,但我无法理解围绕如何在reducer计算时知道C(A)。当通过聚合 B 的传入链接来计算 B 的 PageRank 时,reducer 将如何知道每个页面的传出链接数量。这是否需要在某些外部数据源中查找?

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.

I have the following simple scenario with three nodes: A B C.

The adjacency matrix is here:

A { B, C }
B { A }

The PageRank for B for example is equal to:

(1-d)/N + d ( PR(A) / C(A) ) 

N     = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A)  = number of outgoing links from page A

I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凹づ凸ル 2024-10-24 18:07:19

这是一个伪代码:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

重要的是,在reduce中您应该输出外链而不是内链,正如intenret上的一些文章所建议的那样。这样,连续迭代也将具有外链接作为映射器的输入。

请注意,同一页面中具有相同地址的多个外链计为 1。另外,不要计算循环(链接到自身的页面)。

阻尼系数传统上为 0.85,但您也可以使用其他值。

Here is a pseudocode:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

The damping factor is traditionally 0.85, although you can play around with other values, too.

梦情居士 2024-10-24 18:07:19

我们迭代地评估 PR。 PR(x) = Sum(PR(a)*weight(a), a in_links) 因此

map ((url,PR), out_links) //PR = random at start
for link in out_links
   emit(link, ((PR/size(out_links)), url))

reduce(url, List[(weight, url)):
   PR =0
   for v in weights
       PR = PR + v
   Set urls = all urls from list

   emit((url, PR), urls)

输出等于输入,我们可以这样做直到覆盖。

We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by

map ((url,PR), out_links) //PR = random at start
for link in out_links
   emit(link, ((PR/size(out_links)), url))

reduce(url, List[(weight, url)):
   PR =0
   for v in weights
       PR = PR + v
   Set urls = all urls from list

   emit((url, PR), urls)

so the output equals input and we can do this until coverage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文