防止自定义网络爬虫被阻止
我正在使用 C#
创建一个新的网络爬虫来爬网一些特定网站。一切都很顺利。但问题是,在一些请求之后,某些网站阻止了我的抓取工具 IP 地址。我尝试在抓取请求之间使用时间戳。但没有奏效。
有什么办法可以防止网站阻止我的爬虫吗? 像这样的一些解决方案会有所帮助(但我需要知道如何应用它们):
- 模拟 Google bot 或 yahoo slurp
- 使用多个 IP 地址(事件虚假 IP 地址)作为抓取工具客户端 IP
任何解决方案都会有帮助。
I am creating a new web crawler using C#
to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.
is there any way to prevent websites from blocking my crawler ?
some solutions like this would help (but I need to know how to apply them):
- simulating Google bot or yahoo slurp
- using multiple IP addresses (event fake IP addresses) as crawler client IP
any solution would help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果速度/吞吐量不是一个大问题,那么最好的解决方案可能是安装 Tor 和 Privoxy 并路由您的爬虫通过它。然后你的爬虫就会有一个随机变化的IP地址。
如果您需要抓取那些不希望您抓取的网站,那么这是一种非常有效的技术。它还通过使爬虫的活动很难追溯到您来提供一层保护/匿名。
当然,如果网站因为爬行速度太快而阻止您的爬虫,那么也许您应该对其进行一些速率限制。
If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.
This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.
Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.
这就是阻止伪造者的方法(以防万一有人在搜索如何阻止这些伪造者时发现此页面)
在 apache 中阻止该技巧:
或者为了完整性起见,在 nginx 中阻止该技巧
And this is how you block fakers (just in case someone found this page while searching how to block those)
Block that trick in apache:
Or a block in nginx for completeness sake