设计网络爬虫
我遇到过一个面试问题“如果你设计一个网络爬虫,你会如何避免陷入无限循环?”我正在尝试回答它。
这一切是如何从头开始的。 假设谷歌从一些中心页面开始,说有数百个(这些中心页面最初是如何找到的是一个不同的子问题)。 当 Google 跟踪某个页面的链接等时,它是否会不断创建哈希表以确保它不会跟踪较早访问的页面。
如果同一个页面有 2 个名称(URL),那么现在我们有 URL 缩短器等。
我以 Google 为例。虽然谷歌没有透露其网络爬虫算法和页面排名等如何工作,但你有什么猜测吗?
I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.
How does it all begin from the beginning.
Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question).
As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages.
What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..
I have taken Google as an example. Though Google doesn't leak how its web crawler algorithms and page ranking etc work, but any guesses?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果您想获得详细答案,请查看本文第 3.8 节,它描述了现代抓取工具的 URL-seen 测试:
基本上,它们使用散列函数对所有 URL 进行散列,该散列函数保证每个 URL 的唯一散列,并且由于 URL 的局部性,因此很容易找到 URL。 Google 甚至开源了他们的哈希函数:CityHash
警告!< br>
他们也可能在谈论机器人陷阱!机器人陷阱是页面的一部分,它不断生成具有唯一 URL 的新链接,通过跟踪该页面提供的链接,您实际上会陷入“无限循环”。这并不完全是一个循环,因为循环可能是访问同一 URL 的结果,但它是一个无限的 URL 链,您应该避免对其进行爬网。
更新 12/13/2012
- 世界应该结束后的第二天:)根据 Fr0zenFyr 的评论:如果有人使用 AOPIC 算法来选择页面,那么很容易避免无限循环类型的机器人陷阱。以下是 AOPIC 工作原理的总结:
由于Lambda页面不断收税,最终它将成为信用金额最大的页面,我们必须“爬行”它。我在引号中说“爬行”,因为我们实际上并不对 Lambda 页面发出 HTTP 请求,我们只是获取其积分并将其平均分配给数据库中的所有页面。
由于机器人陷阱只提供内部链接信用,而很少从外部获得信用,因此它们会不断地将信用(来自税收)泄漏到 Lambda 页面。 Lambda 页面会将信用值均匀地分配给数据库中的所有页面,并且在每个周期中,机器人陷阱页面都会失去越来越多的信用值,直到它的信用值非常少,以至于几乎再也不会被爬行。对于好的页面来说,这种情况不会发生,因为它们经常从其他页面上找到的反向链接中获得积分。这也会导致动态页面排名,您会注意到,每当您拍摄数据库快照时,根据页面拥有的积分数量对页面进行排序,那么它们很可能会根据其 进行大致排序真实页面排名。
这只能避免无限循环类型的机器人陷阱,但还有许多其他机器人您应该提防的陷阱,并且也有方法可以绕过它们。
If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:
Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash
WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.
Update 12/13/2012
- the day after the world was supposed to end :)Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:
Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.
Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.
This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.
虽然这里的每个人都已经建议了如何创建网络爬虫,但以下是 Google 如何对页面进行排名。
Google 根据回调链接的数量(其他网站上有多少链接指向特定网站/页面)为每个页面提供排名。这称为相关性得分。这是基于以下事实:如果一个页面有许多其他页面链接到它,则它可能是一个重要页面。
每个站点/页面都被视为图表中的一个节点。到其他页面的链接是有向边。顶点的度定义为传入边的数量。具有较多传入边的节点排名较高。
以下是 PageRank 的确定方法。假设页面 Pj 有 Lj 链接。如果这些链接之一是指向 Pi 页面,则 Pj 会将其重要性的 1/Lj 传递给 Pi。 Pi 的重要性排名是链接到它的页面所做的所有贡献的总和。因此,如果我们用 Bi 表示链接到 Pi 的页面集,则我们有以下公式:
排名放置在称为超链接矩阵的矩阵中: H[i,j]
该矩阵中的一行要么是 0,要么是 1/Lj如果有从 Pi 到 Bi 的链接。该矩阵的另一个属性是,如果我们将一列中的所有行相加,我们将得到 1。
现在我们需要将此矩阵乘以一个名为 I(特征值为 1)的特征向量,这样:
现在我们开始迭代: I H, IIH, IIIH .... I^k *H 直到解收敛。即我们在步骤 k 和 k+1 中得到的矩阵中的数字几乎相同。
现在 I 向量中剩下的就是每个页面的重要性。
有关简单的课堂作业示例,请参阅 http://www.math .cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
While everybody here already suggested how to create your web crawler, here is how how Google ranks pages.
Google gives each page a rank based on the number of callback links (how many links on other websites point to a specific website/page). This is called relevance score. This is based on the fact that if a page has many other pages link to it, it's probably an important page.
Each site/page is viewed as a node in a graph. Links to other pages are directed edges. A degree of a vertex is defined as the number of incoming edges. Nodes with a higher number of incoming edges are ranked higher.
Here's how the PageRank is determined. Suppose that page Pj has Lj links. If one of those links is to page Pi, then Pj will pass on 1/Lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. So if we denote the set of pages linking to Pi by Bi, then we have this formula:
The ranks are placed in a matrix called hyperlink matrix: H[i,j]
A row in this matrix is either 0, or 1/Lj if there is a link from Pi to Bi. Another property of this matrix is that if we sum all rows in a column we get 1.
Now we need multiply this matrix by an Eigen vector, named I (with eigen value 1) such that:
Now we start iterating: IH, IIH, IIIH .... I^k *H until the solution converges. ie we get pretty much the same numbers in the matrix in step k and k+1.
Now whatever is left in the I vector is the importance of each page.
For a simple class homework example see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
取决于他们的问题的意图有多深。如果他们只是想避免来回访问相同的链接,那么对 URL 进行哈希处理就足够了。
如果内容实际上有数千个指向相同内容的 URL,该怎么办?就像 QueryString 参数一样,它不会影响任何内容,但可以有无限次迭代。我想您也可以对页面的内容进行哈希处理并比较 URL,以查看它们是否类似于捕获由多个 URL 标识的内容。例如,请参阅@Lirik 的帖子中提到的机器人陷阱。
Depends on how deep their question was intended to be. If they were just trying to avoid following the same links back and forth, then hashing the URL's would be sufficient.
What about content that has literally thousands of URL's that lead to the same content? Like a QueryString parameter that doesn't affect anything, but can have an infinite number of iterations. I suppose you could hash the contents of the page as well and compare URL's to see if they are similar to catch content that is identified by multiple URL's. See for example, Bot Traps mentioned in @Lirik's post.
您必须有某种哈希表来存储结果,您只需在每个页面加载之前检查它即可。
You'd have to have some sort of hash table to store the results in, you'd just have to check it before each page load.
这里的问题不是抓取重复的 URL,这是通过使用从 url 获得的哈希值的索引来解决的。问题是抓取重复的内容。 “爬虫陷阱”的每个 url 都是不同的(年、日、SessionID...)。
没有“完美”的解决方案...但您可以使用其中一些策略:
• 在网站内保留一个 url 级别的字段。对于从页面获取 url 的每个循环,增加级别。它将像一棵树一样。你可以在某个级别停止爬行,比如10(我认为谷歌使用这个)。
• 您可以尝试创建一种可以比较的HASH,以查找相似的文档,因为您无法与数据库中的每个文档进行比较。谷歌有 SimHash,但我找不到任何可以使用的实现。然后我创建了自己的。我的哈希对 html 代码中的低频和高频字符进行计数,并生成一个 20 字节的哈希,将其与 AVLTree 中最后抓取的页面的小型缓存进行比较,并使用具有一定容差(大约 2)的近邻搜索。您不能使用对此哈希中的字符位置的任何引用。 “识别”陷阱后,您可以记录重复内容的 url 模式,并开始忽略具有该模式的页面。
• 与谷歌一样,您可以为每个网站创建排名,并比其他网站更“信任”其中一个网站。
The problem here is not to crawl duplicated URLS, wich is resolved by a index using a hash obtained from urls. The problem is to crawl DUPLICATED CONTENT. Each url of a "Crawler Trap" is different (year, day, SessionID...).
There is not a "perfect" solution... but you can use some of this strategies:
• Keep a field of wich level the url is inside the website. For each cicle of getting urls from a page, increase the level. It will be like a tree. You can stop to crawl at certain level, like 10 (i think google use this).
• You can try to create a kind of HASH wich can be compared to find similar documents, since you cant compare with each document in your database. There are SimHash from google, but i could not find any implementation to use. Then i´ve created my own. My hash count low and high frequency characters inside the html code and generate a 20bytes hash, wich is compared with a small cache of last crawled pages inside a AVLTree with an NearNeighbors search with some tolerance (about 2). You cant use any reference to characters locations in this hash. After "recognize" the trap, you can record the url pattern of the duplicate content and start to ignore pages with that too.
• Like google, you can create a ranking to each website and "trust" more in one than others.
网络爬虫是一种计算机程序,用于从给定的网站 URL 收集/爬行以下关键值(HREF 链接、图像链接、元数据等)。它被设计成智能地跟踪已经从前一个 URL 获取的不同 HREF 链接,这样,爬虫就可以从一个网站跳转到其他网站。通常,它被称为网络蜘蛛或网络机器人。该机制始终充当网络搜索引擎的支柱。
请从我的技术博客中找到源代码 - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html
The web crawler is a computer program which used to collect/crawling following key values(HREF links, Image links, Meta Data .etc) from given website URL. It is designed like intelligent to follow different HREF links which are already fetched from the previous URL, so in this way, Crawler can jump from one website to other websites. Usually, it called as a Web spider or Web Bot. This mechanism always acts as a backbone of the Web search engine.
Please find the source code from my tech blog - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html
Web 基本上是一个有向图,因此您可以根据 url 构建一个图,然后在标记访问的节点时进行 BFS 或 DFS 遍历,这样您就不会两次访问同一页面。
Well the web is basically a directed graph, so you can construct a graph out of the urls and then do a BFS or DFS traversal while marking the visited nodes so you don't visit the same page twice.
这是一个网络爬虫示例。可用于收集 mac 地址以进行 mac 欺骗。
更改网址以抓取更多网站......祝你好运
This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.
Change the url to crawl more sites......good luck