PageRank 及其数学:需要解释
我是一名学生,有兴趣开发一个搜索引擎来索引来自我的国家的页面。我已经研究了一段时间的算法,并且我认为 HITS 和 PageRank 是最好的。我决定使用 PageRank,因为它比 HITS 算法更稳定(或者我读过)。
我找到了无数与PageRank相关的文章和学术论文,但我的问题是我不理解这些论文中构成算法的大部分数学符号。具体来说,我不明白谷歌矩阵(不可约随机矩阵)是如何计算的。
我的理解基于这两篇文章:
- http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf
- http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf
有人可以提供基本的解释吗(示例友善一点)用更少的数学符号?
提前致谢。
I am a student interested in developing a search engine that indexes pages from my country. I have been researching algorithms to use for sometime now and I have identified HITS and PageRank as the best out there. I have decided to go with PageRank since it is more stable than the HITS algorithm (or so I have read).
I have found countless articles and academic papers related to PageRank, but my problem is that I don't understand most of the mathematical symbols that form the algorithm in these papers. Specifically, I don't understand how the Google Matrix (the irreducible,stochastic matrix) is calculated.
My understanding is based on these two articles:
- http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf
- http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf
Could someone provide a basic explanation (examples would be nice) with less mathematical symbols?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
PageRank 的正式定义,如引用文档第 4 页所定义,用带有有趣的“E”符号的数学方程表示(它实际上是大写的 Sigma 希腊字母。Sigma 是字母“S”,在这里代表求和)。
简而言之,这个公式表示计算页面 X 的 PageRank...
这个公式的关键思想是链接到给定页面 X 的所有网页都是增加其“价值”的价值。通过链接到某个页面,他们“投票”支持该页面。然而,这种“投票”的权重或多或少取决于两个因素:
这两个因素反映了非常直观的想法:
正如您所注意到的,这个公式使用了有点循环引用,因为要知道 X 的页面范围,您需要知道链接到 X 的所有页面的 PageRank。那么您如何计算这些PageRank 值?...这就是文档部分中解释的下一个收敛问题开始的地方。
本质上,通过从所有页面的某些“随机”(或最好是“合理的猜测”)PageRank 值开始,然后计算使用上面的公式计算 PageRank 后,新的计算值会变得“更好”,当您迭代此过程几次时,这些值会收敛,即它们会越来越接近实际/理论值。因此,通过迭代足够多的次数,我们会达到这样的时刻:额外的迭代不会为上次迭代提供的值增加任何实际精度。
从理论上讲,这很好。将此算法转换为等效的算法,但可以更快地完成。有几篇论文描述了如何完成此任务以及类似的任务。我暂时没有这样的参考资料,但稍后会添加这些参考资料。请注意,它们确实会涉及大量的线性代数。
编辑:正如所承诺的,这里有一些有关计算页面排名算法的链接。
PageRank Haveliwala 1999 的高效计算 ///
利用网络的块结构进行计算 PR Kamvar 等人 2003 ///
一种快速两阶段算法计算 PageRank Lee 等人。 2002
虽然上面提供的链接的许多作者都来自斯坦福大学,但很快我们就意识到寻求高效的类似 PageRank 的计算是一个热门的研究领域。我意识到这些材料超出了 OP 的范围,但重要的是要暗示基本算法对于大型网络来说并不实用。
为了以非常易于理解的文本结束(但有许多深入信息的链接),我想提一下 维基百科的优秀文章
如果您认真对待这类事情,您可以考虑数学入门/复习课程,特别是线性代数,以及处理一般图形的计算机科学课程。顺便说一句,Michael Dorfman 在这篇文章中对 OCW 1806 讲座的视频提出了很好的建议。
我希望这会有所帮助......
The formal defintion of PageRank, as defined at page 4 of the cited document, is expressed in the mathematical equation with the funny "E" symbol (it is in fact the capital Sigma Greek letter. Sigma is the letter "S" which here stands for Summation).
In a nutshell this formula says that to calculate the PageRank of page X...
The key idea being this formula is that all web pages that link to a given page X are adding to value to its "worth". By linking to some page they are "voting" in favor of this page. However this "vote" has more or less weight, depending on two factors:
These two factors reflect very intuitive ideas:
As you notice, this formula makes use of somewhat of a circular reference, because to know the page range of X, you need to know the PageRank of all pages linking to X. Then how do you figure these PageRank values?... That's where the next issue of convergence explained in the section of the document kick in.
Essentially, by starting with some "random" (or preferably "decent guess" values of PageRank, for all pages, and by calculating the PageRank with the formula above, the new calculated values get "better", as you iterate this process a few times. The values converge, i.e. they each get closer and closer to what is the actual/theorical value. Therefore by iterating a sufficient amount of times, we reach a moment when additional iterations would not add any practical precision to the values provided by the last iteration.
Now... That is nice and dandy, in theory. The trick is to convert this algorithm to something equivalent but which can be done more quickly. There are several papers that describe how this, and similar tasks, can be done. I don't have such references off-hand, but will add these later. Beware they do will involve a healthy dose of linear algebra.
EDIT: as promised, here are a few links regarding algorithms to calculate page rank.
Efficient Computation of PageRank Haveliwala 1999 ///
Exploiting the Block Structure of the Web for Computing PR Kamvar etal 2003 ///
A fast two-stage algorithm for computing PageRank Lee et al. 2002
Although many of the authors of the links provided above are from Stanford, it doesn't take long to realize that the quest for efficient PageRank-like calculation is a hot field of research. I realize this material goes beyond the scope of the OP, but it is important to hint at the fact that the basic algorithm isn't practical for big webs.
To finish with a very accessible text (yet with many links to in-depth info), I'd like to mention Wikipedia's excellent article
If you're serious about this kind of things, you may consider an introductory/refresher class in maths, particularly linear algebra, as well a computer science class that deal with graphs in general. BTW, great suggestion from Michael Dorfman, in this post, for OCW's video of 1806's lectures.
I hope this helps a bit...
如果您真的想为搜索引擎开发算法,我强烈建议您参加线性代数课程。在没有面对面课程的情况下,Gilbert Strang 的 MIT OCW 课程相当不错(视频讲座位于 http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/)。
像这样的课程肯定会让您理解您提供的文档中的数学符号——该论文中的所有内容都是第一年线性代数课程中不会涵盖的。
我知道这不是您正在寻找的答案,但这确实是您的最佳选择。当你没有很好地掌握基本概念时,让别人尝试向你解释各个符号或算法并不是很好地利用任何人的时间。
If you are serious about developing an algorithm for a search engine, I'd seriously recommend you take a Linear Algebra course. In the absence of an in-person course, the MIT OCW course by Gilbert Strang is quite good (video lectures at http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/).
A class like this would certainly allow you to understand the mathematical symbols in the document you provide-- there's nothing in that paper that wouldn't be covered in a first-year Linear Algebra course.
I know this isn't the answer you are looking for, but it's really the best option for you. Having someone try to explain the individual symbols or algorithms to you when you don't have a good grasp of the basic concepts isn't a very good use of anybody's time.
这是您需要的论文:http://infolab.stanford.edu/~backrub/google .html(如果您不认识作者的姓名,可以在此处找到有关他们的更多信息:http://www.google.com/corporate/execs.html)。
文件中使用的符号以通俗英语描述。
谢谢你让我用谷歌搜索这个。
This is the paper that you need: http://infolab.stanford.edu/~backrub/google.html (If you do not recognise the names of the authors, you will find more information about them here: http://www.google.com/corporate/execs.html).
The symbols used in the document, are described in the document in lay English.
Thanks for making me google this.
您可能还想阅读由 David Austin 撰写的关于构建 Pagerank 矩阵背后的数学原理的介绍性教程,标题为 Google 如何在网络大海捞针中找到你的目标;它从一个简单的示例开始,然后构建完整的定义。
You might also want to read the introductory tutorial on the mathematics behind the construction of the Pagerank matrix written by David Austin's entitled How Google Finds Your Needle in the Web's Haystack; it starts with a simple example and builds to the full definition.
“价值 25,000,000,000 美元的特征向量:Google 背后的线性代数”。 来自 Rose- Hulman 有点过时了,因为现在 Page Rank 是 $491B 的线性代数问题。我认为这篇论文写得很好。
“编程集体智慧”也对页面排名进行了很好的讨论。
"The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google". from Rose-Hulman is a bit out of date, because now Page Rank is the $491B linear algebra problem. I think the paper is very well written.
"Programming Collective Intelligence" has a nice discussion of Page Rank as well.
达菲莫发布了我认为最好的参考。我在本科四年级时研究了页面排名算法。页面排名正在执行以下操作:
定义从站点 u 转换到 v 的概率,其中存在从 u 到 v 的传出链接
1/u_{n},其中 u_{n} 是从 u 发出的传出链接的数量。
假设上面定义的马尔可夫链是不可约的(这可以通过仅稍微降低结果来强制执行)
Google 使用幂法的细微变化来查找平稳分布(幂法查找主导特征值)。除此之外没有什么可做的。它相当简单和优雅,可能是我能想到的最简单的马尔可夫链应用之一,但它值很多钱!
因此,PageRank 算法所做的就是考虑网络的拓扑结构,作为网站是否重要的指示。站点拥有的传入链接越多,随机粒子在无限时间内停留在站点上的概率就越大。
Duffymo posted the best refernce in my opinion. I studied the page rank algorithm in my senior undergrad year. Page rank is doing the following:
Define the probability of transitioning from site u to v where the there is an outgoing link to v from u to be
1/u_{n} where u_{n} is the number of out going links from u.
Assume the markov chain defined above is irreducible (this can be enforced with only a slight degradation of the results)
Google uses a slight variation on the power method to find the stationary distribution (the power method finds dominant eigenvalues). Other than that there is nothing to it. Its rather simple and elegant and probably one of the simplest applications of markov chains I can think of, but it is wortha lot of money!
So all the pagerank algorithm does is take into account the topology of the web as an indication of whether a website should be important. The more incoming links a site has the greater the probability of a random particle spending its time at the site over an infinite amount of time.
如果您想通过更少的数学了解有关页面排名的更多信息,那么 这是关于基本矩阵运算的非常好的教程。我向所有没有多少数学背景但想深入研究排名算法的人推荐它。
If you want to learn more about page rank with less math, then this is very good tutorial on basic matrix operations. I recommend it for everyone who has little math background but wants to dive into ranking algorithms.