如何检查我的网站是否被爬虫访问？

发布于 2024-10-19 05:44:31 字数 60 浏览 2 评论 0 原文

如何检查某个页面是否正在从触发连续请求的爬虫或脚本访问？我需要确保只能通过网络浏览器访问该网站。谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无力看清 2024-10-26 05:44:31

这个问题是一个很好的起点：
检测“隐形”网络爬虫

原始帖子：

这需要一些时间来设计解决方案。

我可以立即想到要寻找的三件事：

第一，用户代理。如果蜘蛛是 google 或 bing 或其他任何东西，它会识别自己。

第二，如果蜘蛛是恶意的，它很可能会模拟普通浏览器的标头。如果是 IE，请指纹识别。使用 JavaScript 检查活动的 X 对象。

第三，记下它访问的内容以及访问的频率。如果内容平均需要人类 X 秒的时间来查看，那么您可以将其用作尝试确定人类是否可以如此快地消耗数据的起点。这很棘手，您很可能必须依赖 cookie。一个IP可以被多个用户共享。

回复收藏 0 原文

走过海棠暮 2024-10-26 05:44:31

您可以使用robots.txt文件来阻止爬虫的访问，也可以使用javascript来检测浏览器代理，并据此进行切换。如果我理解第一个选项更合适，那么：

User-agent: *
Disallow: /

将其另存为站点根目录下的 robots.txt，并且任何自动化系统都不应检查您的站点。

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:

User-agent: *
Disallow: /

Save that as robots.txt at the site root, and no automated system should check your site.

回复收藏 0 原文

烟火散人牵绊 2024-10-26 05:44:31

我在我的网络应用程序中遇到了类似的问题，因为我在数据库中为浏览该网站的每个用户创建了一些大量数据，并且爬虫程序引发了创建大量无用数据的情况。然而，我不想拒绝爬虫的访问，因为我希望我的网站被索引并被发现；我只是想避免创建无用的数据并减少爬行所需的时间。

我通过以下方式解决了问题：

首先，我使用了HttpBrowserCapability.Crawler 来自 .NET Framework（自 2.0 起）的属性，指示浏览器是否是搜索引擎网络爬虫。您可以从代码中的任何位置访问它：
- ASP.NET C# 代码背后：
```
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
```
- ASP.NET HTML：
```
是爬虫吗？ = <%=HttpContext.Current.Request.Browser.Crawler %>
```
- ASP.NET JavaScript：
```
<脚本类型=“text/javascript”>  
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>;  
```
这种方法的问题在于，它对于身份不明或被屏蔽的爬虫来说并不是 100% 可靠，但也许它对您的情况很有用。
之后，我必须找到一种方法来区分自动化机器人（爬虫、屏幕抓取器等）和人类，并且我意识到该解决方案需要某种交互性，例如单击按钮。嗯，一些爬虫确实处理 javascript 并且很明显它们会使用 button 元素的 onclick 事件，但如果它是非交互式元素，例如div。以下是我在网络应用程序中使用的 HTML / Javascript 代码 www.so-much-to-do .com 来实现此功能：

<前><代码>

onclick="$TodoApp.$AddSampleTree()">
请单击此处创建您自己的一组要执行的示例任务

这种方法到目前为止一直工作得无可挑剔，尽管爬虫可以变得更加聪明，也许在阅读这篇文章之后：D

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.

I solved the problem the following ways:

First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
- ASP.NET C# code behind:
```
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
```
- ASP.NET HTML:
```
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
```
- ASP.NET Javascript:
```
<script type="text/javascript">  
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>  
</script>
```
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
```
<div  
class="all rndCorner"  
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"  
onclick="$TodoApp.$AddSampleTree()">  
Please click here to create your own set of sample tasks to do  
</div>
```
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D