线程是否违反robots.txt?

发布于 2024-11-14 02:43:05 字数 1459 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

命硬 2024-11-21 02:43:05

取决于:如果您的线程有自己单独的要抓取的 URL 队列,并且任何类型的队列之间没有同步,那么当两个(或更多)线程尝试抓取某个站点的 URL 时,您最终可能会违反站点的 robots.txt。快速连续地访问同一站点。当然设计良好的爬虫不会这样做!

非常“简单”的爬虫有某种共享优先级队列,其中的工作根据各种机器人排除协议进行排队,并且所有线程都从该队列中提取要爬行的 URL。这种方法存在很多问题,尤其是在尝试扩展和抓取整个万维网时。

更高级的爬虫执行“预算”计算(请参阅 BEAST 预算执行部分< /a>),这使他们能够根据各种标准智能地安排爬行:垃圾邮件指示器、robots.txt、覆盖率与新鲜度等。预算执行使多线程爬行程序更容易快速爬行并礼貌地爬行!

Depends: if your threads have their own separate queues of URLs to be crawled and there is no synchronization between queues of any kind, then you could end up violating a site's robots.txt when two (or more) threads attempt to crawl URLs for the same site within quick succession. Of course a well designed crawler would not do that!

The very "simple" crawlers have some sort of shared priority queue where work is queued in accordance to the various Robots Exclusion Protocols and all the threads pull URLs to be crawled from that queue. There are many problems with such an approach, especially when trying to scale up and crawl the entire World Wild Web.

The more advanced crawlers perform "budget" calculations (see the BEAST budget enforcement section) which allow them to intelligently schedule crawling on various criteria: spam indicators, robots.txt, coverage vs freshness, etc. Budget enforcement makes it much easier for multithreaded crawlers to crawl fast and crawl politely!

顾北清歌寒 2024-11-21 02:43:05

他们是不相关的。 robots.txt 表示是否允许您访问某些内容。它没有办法说“请在一本书中只发送一个请求”。

They are unrelated. robots.txt says whether or not you are allowed to access something. It doesn't have a way to say "please send only one request at a tome".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文