如何检查我的网站是否被爬虫访问?
如何检查某个页面是否正在从触发连续请求的爬虫或脚本访问? 我需要确保只能通过网络浏览器访问该网站。 谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
如何检查某个页面是否正在从触发连续请求的爬虫或脚本访问? 我需要确保只能通过网络浏览器访问该网站。 谢谢。
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
这个问题是一个很好的起点:
检测“隐形”网络爬虫
原始帖子:
这需要一些时间来设计解决方案。
我可以立即想到要寻找的三件事:
第一,用户代理。如果蜘蛛是 google 或 bing 或其他任何东西,它会识别自己。
第二,如果蜘蛛是恶意的,它很可能会模拟普通浏览器的标头。如果是 IE,请指纹识别。使用 JavaScript 检查活动的 X 对象。
第三,记下它访问的内容以及访问的频率。如果内容平均需要人类 X 秒的时间来查看,那么您可以将其用作尝试确定人类是否可以如此快地消耗数据的起点。这很棘手,您很可能必须依赖 cookie。一个IP可以被多个用户共享。
This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.
您可以使用robots.txt文件来阻止爬虫的访问,也可以使用javascript来检测浏览器代理,并据此进行切换。如果我理解第一个选项更合适,那么:
将其另存为站点根目录下的 robots.txt,并且任何自动化系统都不应检查您的站点。
You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
Save that as robots.txt at the site root, and no automated system should check your site.
我在我的网络应用程序中遇到了类似的问题,因为我在数据库中为浏览该网站的每个用户创建了一些大量数据,并且爬虫程序引发了创建大量无用数据的情况。然而,我不想拒绝爬虫的访问,因为我希望我的网站被索引并被发现;我只是想避免创建无用的数据并减少爬行所需的时间。
我通过以下方式解决了问题:
首先,我使用了HttpBrowserCapability.Crawler 来自 .NET Framework(自 2.0 起)的属性,指示浏览器是否是搜索引擎网络爬虫。您可以从代码中的任何位置访问它:
ASP.NET C# 代码背后:
ASP.NET HTML:
ASP.NET JavaScript:
这种方法的问题在于,它对于身份不明或被屏蔽的爬虫来说并不是 100% 可靠,但也许它对您的情况很有用。
之后,我必须找到一种方法来区分自动化机器人(爬虫、屏幕抓取器等)和人类,并且我意识到该解决方案需要某种交互性,例如单击按钮。嗯,一些爬虫确实处理 javascript 并且很明显它们会使用 button 元素的 onclick 事件,但如果它是非交互式元素,例如div。 以下是我在网络应用程序中使用的 HTML / Javascript 代码 www.so-much-to-do .com 来实现此功能:
<前><代码>
请单击此处创建您自己的一组要执行的示例任务
这种方法到目前为止一直工作得无可挑剔,尽管爬虫可以变得更加聪明,也许在阅读这篇文章之后:D
I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
ASP.NET HTML:
ASP.NET Javascript:
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D