防止自定义网络爬虫被阻止

发布于 2024-12-08 10:42:21 字数 354 浏览 1 评论 0原文

我正在使用 C# 创建一个新的网络爬虫来爬网一些特定网站。一切都很顺利。但问题是,在一些请求之后,某些网站阻止了我的抓取工具 IP 地址。我尝试在抓取请求之间使用时间戳。但没有奏效。

有什么办法可以防止网站阻止我的爬虫吗? 像这样的一些解决方案会有所帮助(但我需要知道如何应用它们):

  • 模拟 Google bot 或 yahoo slurp
  • 使用多个 IP 地址(事件虚假 IP 地址)作为抓取工具客户端 IP

任何解决方案都会有帮助。

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prevent websites from blocking my crawler ?
some solutions like this would help (but I need to know how to apply them):

  • simulating Google bot or yahoo slurp
  • using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

﹏半生如梦愿梦如真 2024-12-15 10:42:21

如果速度/吞吐量不是一个大问题,那么最好的解决方案可能是安装 Tor 和 Privoxy 并路由您的爬虫通过它。然后你的爬虫就会有一个随机变化的IP地址。

如果您需要抓取那些不希望您抓取的网站,那么这是一种非常有效的技术。它还通过使爬虫的活动很难追溯到您来提供一层保护/匿名。

当然,如果网站因为爬行速度太快而阻止您的爬虫,那么也许您应该对其进行一些速率限制。

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

心如荒岛 2024-12-15 10:42:21

这就是阻止伪造者的方法(以防万一有人在搜索如何阻止这些伪造者时发现此页面)

在 apache 中阻止该技巧:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

或者为了完整性起见,在 nginx 中阻止该技巧

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }

And this is how you block fakers (just in case someone found this page while searching how to block those)

Block that trick in apache:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

Or a block in nginx for completeness sake

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文