对于网络爬虫来说,合适的更新间隔是多少?
我目前正在开发自己的小型网络爬虫,并且想知道...
网络爬虫再次访问相同网站的合适间隔是多少?
您应该每天重新访问一次吗?每小时一次?我真的不知道...有人在这件事上有经验吗?也许有人可以指出我正确的方向?
I am currently working on my own little web crawler thingy and was wondering...
What is a decent interval for a web crawler to visit the same sites again?
Should you revisit them once a day? Once per hour? I really do not know...has anybody some experience in this matter? Perhaps someone can point me into the right direction?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为你的爬虫访问需要是有机的。
我首先每周抓取一次列表,
当网站内容发生变化时,将其设置为每周抓取两次,
[然后]当您看到更频繁的更改时,您会更频繁地抓取。
该算法需要足够智能,能够了解一次性编辑和频繁站点更改之间的区别。
另外,永远不要忘记关注Robots.txt...这是您爬行时应该访问的第一个页面,您需要首先尊重它的内容。
I think your crawlers visits need to be organic.
I'd start by crawling the list once a week,
and when a sites content changes, set that one to crawl twice a week,
[and then] when you see more frequent changes, you crawl more frequently.
The algorithm would need to be smart enough to know the difference between one off edits and frequent site changes.
Also, never forget to pay attention to the Robots.txt... that's the first page you should hit in a crawl, and you need to respect it's contents above all else.
这将取决于您正在抓取的网站以及您对结果的处理方式。
例如,有些人不会反对相当频繁的访问频率,但其他人可能会限制您每天只能访问一次。
许多网站都热衷于保护自己的内容(默多克和新闻国际谴责谷歌并将《泰晤士报》(英国)置于付费专区就是明证),因此他们对爬虫持不信任的态度。
如果您只想抓取几个网站,那么值得联系网站所有者并解释您想要做什么并查看他们的回复。如果他们确实回复,请尊重他们的意愿并始终遵守
robots.txt
文件。It's going to depend on the sites you are crawling and what you are doing with the results.
Some will not object to a fairly frequent visitation rate, but others might restrict you to one visit every day, for example.
A lot of sites are keen to protect their content (witness Murdoch and News International railing against Google and putting the Times (UK) behind a paywall), so they view crawlers with distrust.
If you are only going to crawl a few sites then it would be worth contacting the site owners and explain what you want to do and see what they reply. If they do reply respect their wishes and always obey the
robots.txt
file.即使一个小时也可能是不礼貌的,具体取决于您正在抓取哪些网站以及强度。我假设您这样做是为了练习,所以请帮助拯救世界,并将自己限制在为处理巨大负载而构建的网站上,然后只首先获取 HTTP 标头,看看您是否需要获取该页面。
更礼貌的做法是首先使用
wget
抓取有限的集合,将其存储在本地并根据缓存进行爬网。如果你不是把这个作为练习,那么就没有理由这样做,因为它已经造成了死亡,并且 interwebz 不需要另一个。
Even an hour can be impolite depending on what sites you are spidering and how intensely. I assume you are doing this as an exercise, so help save the world and limit yourself to sites that are built to handle huge loads and then only get HTTP headers first to see if you need to even get the page.
Even more polite would be to spider a limited set first with
wget
, store it locally and crawl against your cache.If you aren't doing this as an exercise, there is no reason to do it as it has done to death and the interwebz does't need another one.