当前位置：文江博客话题详情

PageRank 及其数学：需要解释

发布于 2024-08-05 04:53:37 字数 682 浏览 16 评论 0原文

我是一名学生，有兴趣开发一个搜索引擎来索引来自我的国家的页面。我已经研究了一段时间的算法，并且我认为 HITS 和 PageRank 是最好的。我决定使用 PageRank，因为它比 HITS 算法更稳定（或者我读过）。

我找到了无数与PageRank相关的文章和学术论文，但我的问题是我不理解这些论文中构成算法的大部分数学符号。具体来说，我不明白谷歌矩阵（不可约随机矩阵）是如何计算的。

我的理解基于这两篇文章：

有人可以提供基本的解释吗（示例友善一点）用更少的数学符号？

提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扛起拖把扫天下 2024-08-12 04:53:37

PageRank 的正式定义，如引用文档第 4 页所定义，用带有有趣的“E”符号的数学方程表示（它实际上是大写的 Sigma 希腊字母。Sigma 是字母“S”，在这里代表求和）。

简而言之，这个公式表示计算页面 X 的 PageRank...

   For all the backlinks to this page  (=all the pages that link to X)
   you need to calculate a value that is
         The PageRank of the page that links to X    [R'(v)]
         divided by 
         the number of links found on this page.    [Nv]
         to which you add
           some "source of rank",  [E(u)] normalized by c
             (we'll get to the purpose of that later.)

     And you need to make the sum of all these values [The Sigma thing]
     and finally, multiply it by a constant   [c] 
        (this constant is just to keep the range of PageRank manageable)

这个公式的关键思想是链接到给定页面 X 的所有网页都是增加其“价值”的价值。通过链接到某个页面，他们“投票”支持该页面。然而，这种“投票”的权重或多或少取决于两个因素：

链接到 X 的页面的受欢迎程度 [R'(v)]
链接到 X 的页面是否也链接到许多其他页面这一事实。 [Nv]

这两个因素反映了非常直观的想法：

通常从该领域公认的专家那里得到一封推荐信比从一个不知名的人那里得到推荐信要好。
无论是谁提供推荐，通过向其他人提供推荐，他们都会降低他们向您推荐的价值。

正如您所注意到的，这个公式使用了有点循环引用，因为要知道 X 的页面范围，您需要知道链接到 X 的所有页面的 PageRank。那么您如何计算这些PageRank 值？...这就是文档部分中解释的下一个收敛问题开始的地方。

本质上，通过从所有页面的某些“随机”（或最好是“合理的猜测”）PageRank 值开始，然后计算使用上面的公式计算 PageRank 后，新的计算值会变得“更好”，当您迭代此过程几次时，这些值会收敛，即它们会越来越接近实际/理论值。因此，通过迭代足够多的次数，我们会达到这样的时刻：额外的迭代不会为上次迭代提供的值增加任何实际精度。

从理论上讲，这很好。将此算法转换为等效的算法，但可以更快地完成。有几篇论文描述了如何完成此任务以及类似的任务。我暂时没有这样的参考资料，但稍后会添加这些参考资料。请注意，它们确实会涉及大量的线性代数。

编辑：正如所承诺的，这里有一些有关计算页面排名算法的链接。
PageRank Haveliwala 1999 的高效计算 ///
利用网络的块结构进行计算 PR Kamvar 等人 2003 ///
一种快速两阶段算法计算 PageRank Lee 等人。 2002

虽然上面提供的链接的许多作者都来自斯坦福大学，但很快我们就意识到寻求高效的类似 PageRank 的计算是一个热门的研究领域。我意识到这些材料超出了 OP 的范围，但重要的是要暗示基本算法对于大型网络来说并不实用。

为了以非常易于理解的文本结束（但有许多深入信息的链接），我想提一下维基百科的优秀文章

如果您认真对待这类事情，您可以考虑数学入门/复习课程，特别是线性代数，以及处理一般图形的计算机科学课程。顺便说一句，Michael Dorfman 在这篇文章中对 OCW 1806 讲座的视频提出了很好的建议。

我希望这会有所帮助......

The formal defintion of PageRank, as defined at page 4 of the cited document, is expressed in the mathematical equation with the funny "E" symbol (it is in fact the capital Sigma Greek letter. Sigma is the letter "S" which here stands for Summation).

In a nutshell this formula says that to calculate the PageRank of page X...

   For all the backlinks to this page  (=all the pages that link to X)
   you need to calculate a value that is
         The PageRank of the page that links to X    [R'(v)]
         divided by 
         the number of links found on this page.    [Nv]
         to which you add
           some "source of rank",  [E(u)] normalized by c
             (we'll get to the purpose of that later.)

     And you need to make the sum of all these values [The Sigma thing]
     and finally, multiply it by a constant   [c] 
        (this constant is just to keep the range of PageRank manageable)

The key idea being this formula is that all web pages that link to a given page X are adding to value to its "worth". By linking to some page they are "voting" in favor of this page. However this "vote" has more or less weight, depending on two factors:

The popularity of the page that links to X [R'(v)]
The fact that the page that links to X also links to many other pages or not. [Nv]

These two factors reflect very intuitive ideas:

It's generally better to get a letter of recommendation from a recognized expert in the field than from a unknown person.
Regardless of who gives the recommendation, by also giving recommendation to other people, they are diminishing the value of their recommendation to you.

As you notice, this formula makes use of somewhat of a circular reference, because to know the page range of X, you need to know the PageRank of all pages linking to X. Then how do you figure these PageRank values?... That's where the next issue of convergence explained in the section of the document kick in.

Essentially, by starting with some "random" (or preferably "decent guess" values of PageRank, for all pages, and by calculating the PageRank with the formula above, the new calculated values get "better", as you iterate this process a few times. The values converge, i.e. they each get closer and closer to what is the actual/theorical value. Therefore by iterating a sufficient amount of times, we reach a moment when additional iterations would not add any practical precision to the values provided by the last iteration.

Now... That is nice and dandy, in theory. The trick is to convert this algorithm to something equivalent but which can be done more quickly. There are several papers that describe how this, and similar tasks, can be done. I don't have such references off-hand, but will add these later. Beware they do will involve a healthy dose of linear algebra.

EDIT: as promised, here are a few links regarding algorithms to calculate page rank.
Efficient Computation of PageRank Haveliwala 1999 ///
Exploiting the Block Structure of the Web for Computing PR Kamvar etal 2003 ///
A fast two-stage algorithm for computing PageRank Lee et al. 2002

Although many of the authors of the links provided above are from Stanford, it doesn't take long to realize that the quest for efficient PageRank-like calculation is a hot field of research. I realize this material goes beyond the scope of the OP, but it is important to hint at the fact that the basic algorithm isn't practical for big webs.

To finish with a very accessible text (yet with many links to in-depth info), I'd like to mention Wikipedia's excellent article

If you're serious about this kind of things, you may consider an introductory/refresher class in maths, particularly linear algebra, as well a computer science class that deal with graphs in general. BTW, great suggestion from Michael Dorfman, in this post, for OCW's video of 1806's lectures.

I hope this helps a bit...

回复收藏 0 原文

梦在夏天 2024-08-12 04:53:37

如果您真的想为搜索引擎开发算法，我强烈建议您参加线性代数课程。在没有面对面课程的情况下，Gilbert Strang 的 MIT OCW 课程相当不错（视频讲座位于 http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/）。

像这样的课程肯定会让您理解您提供的文档中的数学符号——该论文中的所有内容都是第一年线性代数课程中不会涵盖的。

我知道这不是您正在寻找的答案，但这确实是您的最佳选择。当你没有很好地掌握基本概念时，让别人尝试向你解释各个符号或算法并不是很好地利用任何人的时间。

回复收藏 0 原文