Depends: if your threads have their own separate queues of URLs to be crawled and there is no synchronization between queues of any kind, then you could end up violating a site's robots.txt when two (or more) threads attempt to crawl URLs for the same site within quick succession. Of course a well designed crawler would not do that!
The very "simple" crawlers have some sort of shared priority queue where work is queued in accordance to the various Robots Exclusion Protocols and all the threads pull URLs to be crawled from that queue. There are many problems with such an approach, especially when trying to scale up and crawl the entire World Wild Web.
The more advanced crawlers perform "budget" calculations (see the BEAST budget enforcement section) which allow them to intelligently schedule crawling on various criteria: spam indicators, robots.txt, coverage vs freshness, etc. Budget enforcement makes it much easier for multithreaded crawlers to crawl fast and crawl politely!
They are unrelated. robots.txt says whether or not you are allowed to access something. It doesn't have a way to say "please send only one request at a tome".
发布评论
评论(2)
取决于:如果您的线程有自己单独的要抓取的 URL 队列,并且任何类型的队列之间没有同步,那么当两个(或更多)线程尝试抓取某个站点的 URL 时,您最终可能会违反站点的 robots.txt。快速连续地访问同一站点。当然设计良好的爬虫不会这样做!
非常“简单”的爬虫有某种共享优先级队列,其中的工作根据各种机器人排除协议进行排队,并且所有线程都从该队列中提取要爬行的 URL。这种方法存在很多问题,尤其是在尝试扩展和抓取整个万维网时。
更高级的爬虫执行“预算”计算(请参阅 BEAST 预算执行部分< /a>),这使他们能够根据各种标准智能地安排爬行:垃圾邮件指示器、robots.txt、覆盖率与新鲜度等。预算执行使多线程爬行程序更容易快速爬行并礼貌地爬行!
Depends: if your threads have their own separate queues of URLs to be crawled and there is no synchronization between queues of any kind, then you could end up violating a site's robots.txt when two (or more) threads attempt to crawl URLs for the same site within quick succession. Of course a well designed crawler would not do that!
The very "simple" crawlers have some sort of shared priority queue where work is queued in accordance to the various Robots Exclusion Protocols and all the threads pull URLs to be crawled from that queue. There are many problems with such an approach, especially when trying to scale up and crawl the entire World Wild Web.
The more advanced crawlers perform "budget" calculations (see the BEAST budget enforcement section) which allow them to intelligently schedule crawling on various criteria: spam indicators, robots.txt, coverage vs freshness, etc. Budget enforcement makes it much easier for multithreaded crawlers to crawl fast and crawl politely!
他们是不相关的。 robots.txt 表示是否允许您访问某些内容。它没有办法说“请在一本书中只发送一个请求”。
They are unrelated. robots.txt says whether or not you are allowed to access something. It doesn't have a way to say "please send only one request at a tome".