robots.txt 的 Spider 间隔

发布于 2024-10-20 21:24:37 字数 443 浏览 0 评论 0原文

我一直在阅读有关网络爬行的内容,并得到了一份包含所有注意事项的列表,但是有一个问题我还没有找到任何讨论。

对于任何给定网站,robots.txt 应该多久提取一次?

我的情况是,对于任何特定网站,爬网速度非常慢,每天可能 100 个页面。 假设一个网站添加了一个新部分(/ humans-only/),其他页面链接到该部分。同时在robots.txt中添加相应的行。蜘蛛可能会在更新 robots.txt 之前找到此部分的链接。

有趣的是写下问题如何给出解决方案。 当我提出上面的问题时,我想到了一个解决方案。

robots.txt 很少更新,比如每天更新一次。 但所有新找到的链接都应保留在队列中,直到下次更新 robots.txt。 robots.txt 更新后,现在可以抓取所有通过的待处理链接。

对此还有其他想法或实践经验吗?

I have been reading up on web crawling and got a list full of considerations, however there is one concern that I have not found any discussion about yet.

How often should robots.txt be fetched for any given site?

My scenario is, for any specific site, a very slow crawl with maybe 100 pages a day.
Lets say a website adds a new section(/humans-only/) which other pages link to. And at the same time add the appropriate line in robots.txt. A spider might find links to this section before updating robots.txt.

Funny how writing down a problem gives the solution.
When formulating my question above I got an idea of a solution.

The robots.txt can be updated rarely, like once a day.
But all new found links should be placed on hold in a queue until the next update of robots.txt. After robots.txt has been updated all pending links that passes can now be crawled.

Any other ideas or practical experience with this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

§对你不离不弃 2024-10-27 21:24:37

所有大型网络爬虫都会将 robots.txt 缓存一段时间。一天是很常见的,过去我见过一次长达一周的时间。我们的爬虫的最大缓存时间为24小时。实际上,除了我们经常爬行的网站之外,它通常会小于该值。

如果您保留链接以等待 robots.txt 的未来版本,那么您就会人为地为抓取添加 24 小时延迟。也就是说,如果您今天抓取我的网站,那么您必须将所有这些链接保留最多 24 小时,然后才能再次下载我的 robots.txt 文件,并验证您抓取的链接当时是否被允许。你可能经常犯错,也可能经常犯错。假设发生以下情况:

2011-03-08 06:00:00 - You download my robots.txt
2011-03-08 08:00:00 - You crawl the /humans-only/ directory on my site
2011-03-08 22:00:00 - I change my robots.txt to restrict crawlers from accessing /humans-only/
2011-03-09 06:30:00 - You download my robots.txt and throw out the /humans-only/ links.

在您爬网时,您被允许访问该目录,因此发布链接没有问题。

下载 robots.txt 时,您可以使用 Web 服务器返回的最后修改日期来确定当时是否允许您读取这些文件,但许多服务器在返回最后修改日期时会撒谎。一些很大的百分比(我不记得它是什么)总是返回当前日期/时间作为最后修改日期,因为它们的所有内容(包括 robots.txt)都是在访问时生成的。

此外,向您的机器人添加该限制意味着即使您不打算抓取他们的网站,您也必须再次访问他们的 robots.txt 文件。否则,链接将在您的缓存中消失。您提出的技术提出了许多您无法妥善处理的问题。最好的选择是利用手头的信息进行操作。

大多数网站运营商都了解 robots.txt 缓存,并且如果您的机器人在 robots.txt 更改后 24 小时内访问其网站上的受限目录,他们就会采取其他方式。当然,前提是您没有阅读 robots.txt,然后继续抓取受限制的页面。对于那些质疑这种行为的少数人来说,对所发生的事情进行简单的解释通常就足够了。

只要您公开了解您的爬虫正在做什么,并且为网站运营商提供联系您的方式,大多数误解都很容易纠正。有一些人——极少数人——会指控你从事各种邪恶活动。您最好的选择是为造成的问题道歉,然后阻止您的机器人访问他们的网站。

All large-scale Web crawlers cache robots.txt for some period of time. One day is pretty common, and in the past I've seen times as long as a week. Our crawler has a maximum cache time of 24 hours. In practice, it's typically less than that except for sites that we crawl very often.

If you hold links to wait for a future version of robots.txt, then you're adding an artificial 24-hour latency to your crawl. That is, if you crawl my site today then you have to hold all those links for up to 24 hours before you download my robots.txt file again and verify that the links you crawled were allowed at the time. And you could be wrong as often as you're right. Let's say the following happens:

2011-03-08 06:00:00 - You download my robots.txt
2011-03-08 08:00:00 - You crawl the /humans-only/ directory on my site
2011-03-08 22:00:00 - I change my robots.txt to restrict crawlers from accessing /humans-only/
2011-03-09 06:30:00 - You download my robots.txt and throw out the /humans-only/ links.

At the time you crawled, you were allowed to access that directory, so there was no problem with you publishing the links.

You could use the last modified date returned by the Web server when you download robots.txt to determine if you were allowed to read those files at the time, but a lot of servers lie when returning the last modified date. Some large percentage (I don't remember what it is) always return the current date/time as the last modified date because all of their content, including robots.txt, is generated at access time.

Also, adding that restriction to your bot means that you'll have to visit their robots.txt file again even if you don't intend to crawl their site. Otherwise, links will languish in your cache. Your proposed technique raises a lot of issues that you can't handle gracefully. Your best bet is to operate with the information you have at hand.

Most site operators understand about robots.txt caching, and will look the other way if your bot hits a restricted directory on their site within 24 hours of a robots.txt change. provided, of course, that you didn't read robots.txt and then go ahead and crawl the restricted pages. Of those few who question the behavior, a simple explanation of what happened is usually sufficient.

As long as you're open about what your crawler is doing, and you provide a way for site operators to contact you, most misunderstandings are easily corrected. There are a few--a very few--people who will accuse you of all kinds of nefarious activities. Your best bet with them is to apologize for causing a problem and then block your bot from ever visiting their sites.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文