抓取\蜘蛛防护
有一个站点\资源提供一些一般统计信息以及搜索工具的界面。这种搜索操作的成本很高,因此我想限制频繁且连续(即自动)的搜索请求(来自人,而不是来自搜索引擎)。
我相信有很多现有的技术和框架可以执行一些情报抓取保护,所以我不必重新发明轮子。我通过 mod_wsgi 使用 Python 和 Apache。
我知道 mod_evasive (将尝试使用它),但我也对任何其他技术感兴趣。
There is a site\resource that offers some general statistic information as well as an interface to search facilities. This search operations are costly, so I want to restrict frequent and continuous (i.e. automatic) search requests (from people, not from search engines).
I believe there are many existing techniques and frameworks that perform some intelligence grabbing protection, so I don't have to reinvent a wheel. I'm using Python and Apache through mod_wsgi.
I am aware of mod_evasive (will try to use it), but I'm also interested in any other techniques.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果有人正在寻找您的网站和数据,那么这确实是值得的 - 在这种情况下,没有什么可以阻止足够聪明的攻击者。
尽管有一些事情值得尝试:
If someone's hunting exactly your website and data there 's really worthy - nothing will stop the smart enough attacker in this case.
Though there are some things worth trying:
您可以尝试 robots.txt 文件。我相信您只是将其放在应用程序的根目录中,但该网站应该有更多详细信息。
Disallow
语法正是您所寻找的。当然,并非所有机器人都尊重它,但它们都应该。所有大公司(谷歌、雅虎等)都会。
您可能还对这个有关禁止动态网址的问题感兴趣 。
You could try a robots.txt file. I believe you just put it at the root of your application, but that website should have more details. The
Disallow
syntax is what you're looking for.Of course, not all robots respect it, but they all should. All the big companies (Google, Yahoo, etc.) will.
You may also be interested in this question about disallowing dynamic URLs.