在网络抓取过程中如何保持网络礼貌(避免被禁止)?

发布于 2024-10-08 18:10:48 字数 465 浏览 5 评论 0原文

  • 网络机器人抓取您的网站并使用带宽资源。

  • 机器人数量众多,用途广泛,从自制、大学研究、报废者、新创业公司到成熟的搜索引擎(可能还有更多类别)

除了可能向网站发送流量的大型搜索引擎之外,为什么网站管理员允许其他他们不立即知道目的的机器人? 网站管理员允许这些机器人的动机是什么?

第二个问题是:

如果互联网上有多个crawlagent节点的分布式爬虫,为每个代理使用不同的User-Agent字符串,因为如果它们都使用相同的UA,那么通过多个代理进行扩展的好处就会大大降低。 因为设置了高抓取延迟的大型网站可能需要数周或数月才能完全抓取。

第三个问题: 由于robots.txt(唯一定义的抓取控制方法)位于域级别。 爬网程序是否应该针对每个域或每个 IP(有时托管在同一 IP 上的许多网站)制定礼貌策略。

如何解决此类网络恶意问题?还有其他需要记住的相关事项吗?

  • A web-bot crawling your site and using bandwdith resources.

  • Bots are numerous and for many purposes, starting from homemade, university research, scrappers, new startups to established search engines (and many more categories probably)

Apart from large search engines which can potentially send traffic to a site, why webmasters allow other bots whose purpose they do not know immediately ?
What are the incentives for webmasters to allow these bots ?

2nd question is:

Should a distributed crawler with multiple crawlagent-nodes on internet, use different User-Agent string for each agent, because if they all use same UA, then benefit of scaling via multiple agents is highly reduced.
Because large websites with high crawl-delay set, may take weeks or months to crawl fully.

3rd question:
Since robots.txt (the only defined crawl control method) is at domain level.
Should crawler have politeness policy per domain or per IP (sometimes many websites hosted on same IP) .

How to tackle such web poilteness problems ? Any other related things to keep in mind ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

终止放荡 2024-10-15 18:10:48
  1. 除了搜索引擎机器人之外,还有许多有用的机器人,而且搜索引擎的数量也在不断增加。无论如何,您想要阻止的机器人可能使用不正确的用户代理字符串并忽略您的 robots.txt 文件,那么您将如何阻止它们呢?一旦检测到某些机器人,您可以在 IP 级别对其进行阻止,但对于其他机器人则很难。

  2. 用户代理字符串与抓取速度无关。数百万浏览器用户都使用相同的用户代理字符串。网站根据您的 IP 地址限制访问。如果您想更快地抓取他们的网站,您将需要更多代理,但实际上,您不应该这样做 - 您的抓取工具应该有礼貌,并且应该缓慢地抓取每个单独的网站,同时在许多其他网站上取得进展。

  3. 爬网程序应该对每个域保持礼貌。一个 IP 可以为许多不同的服务器提供服务,但这对于来回传递数据包的路由器来说并不困难。每个单独的服务器可能会限制您维持多个连接的能力以及您可以消耗的带宽。还有一种由多个 IP 地址提供服务的网站场景(例如,循环 DNS 或更智能的方式):有时此类网站上的带宽和连接限制会发生在路由器级别,因此,再次,每个域都有礼貌。

  1. There are many useful bots besides search engine bots and there are a growing number of search engines. In any case, the bots you want to block are probably using incorrect user-agent strings and ignoring your robots.txt files so how are you going to stop them? You can block some at the IP level once you detect them but for others it's hard.

  2. The user agent string has nothing to do with crawl rate. Millions of browser users are all using the same user agent string. Web sites throttle access based on your IP address. If you want to crawl their site faster you'll need more agents, but really, you shouldn't be doing that - your crawler should be polite and should be crawling each individual site slowly whilst making progress on many other sites.

  3. Crawler should be polite per-domain. A single IP may server many different servers but that's no sweat for the router that's passing packets to and fro. Each individual server will likely limit your ability to maintain multiple connections and how much bandwidth you can consume. There's also the one-web-site-served-by-many-IP addresses scenario (e.g. round robin DNS or something smarter): sometimes bandwidth and connection limits on sites like these will happen at the router-level, so once again, be polite per domain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文