爬虫种子列表包含什么?
我一直在阅读如何实现爬虫。 据我所知,我们从要访问的 URL 列表(种子列表)开始。 访问所有这些 URL 并将访问页面中的所有链接添加到列表(前沿)。 那么我应该在这个种子列表中添加多少呢?我是否只需添加尽可能多的 URL,并希望它们能让我访问 www 上的 URL,这是否真的能保证我能获得所有其他 URL? 或者有一些约定可以做到这一点?我的意思是……像谷歌这样的搜索引擎是做什么的?
I've been reading on how to implement a crawler.
I understand that we start with a list of URLs to visit ( seeds list ).
Visit all those URLs and add all the links in the visited pages to the list (frontier).
So how much should I add to this seed list? Do I just have to add as much URLs as I can and hope that they'll get me to as much as URLs on the www, and does that actually guarantee that I would get all of other URLs there?
Or there is some convention to do this? I mean ... what does a search engine like Google do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
基本上,他们使用网站之间的连接(链接)制作了一个大的网站列表。您的搜索引擎知道的网站越多越好。这里唯一的问题是如何使这个列表变得有用。也就是说,大量的网站可能性并不意味着搜索结果集很好,因此您必须能够辨别每个网页中的重要内容。
但根据你拥有的信息处理能力,没有必要在某个地方停下来。
这并不能确保您能够访问每一个 URL,但它基本上是抓取网络的唯一实用方法。
It's basically that, they make a big list of web sites using the connections (links) between them. The more web sites your search engine knows, the better. The only issue here is being able to make this list useful. That is, a big list of website possibilities does not mean a good result set to a search, so you have to be able to tell what's important in each web page.
But according to the information processing power you have, there's no need to stop somewhere.
That does not ensure you'll reach every single URL out there, but it's basically the only practical way to crawl the web.