当前位置：文江博客话题详情

防止自定义网络爬虫被阻止

发布于 2024-12-08 10:42:21 字数 354 浏览 1 评论 0原文

我正在使用 C# 创建一个新的网络爬虫来爬网一些特定网站。一切都很顺利。但问题是，在一些请求之后，某些网站阻止了我的抓取工具 IP 地址。我尝试在抓取请求之间使用时间戳。但没有奏效。

有什么办法可以防止网站阻止我的爬虫吗？像这样的一些解决方案会有所帮助（但我需要知道如何应用它们）：

模拟 Google bot 或 yahoo slurp
使用多个 IP 地址（事件虚假 IP 地址）作为抓取工具客户端 IP

任何解决方案都会有帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏半生如梦愿梦如真 2024-12-15 10:42:21

如果速度/吞吐量不是一个大问题，那么最好的解决方案可能是安装 Tor 和 Privoxy 并路由您的爬虫通过它。然后你的爬虫就会有一个随机变化的IP地址。

如果您需要抓取那些不希望您抓取的网站，那么这是一种非常有效的技术。它还通过使爬虫的活动很难追溯到您来提供一层保护/匿名。

当然，如果网站因为爬行速度太快而阻止您的爬虫，那么也许您应该对其进行一些速率限制。

回复收藏 0 原文

心如荒岛 2024-12-15 10:42:21

这就是阻止伪造者的方法（以防万一有人在搜索如何阻止这些伪造者时发现此页面）

在 apache 中阻止该技巧：

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

或者为了完整性起见，在 nginx 中阻止该技巧

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }

And this is how you block fakers (just in case someone found this page while searching how to block those)

Block that trick in apache:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

Or a block in nginx for completeness sake

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }

回复收藏 0 原文

~没有更多了~