我正在寻找自己的简单网络统计脚本。
据我所知,路上唯一的主要障碍是区分人类访客和机器人。我想要一个不需要定期维护的解决方案(即我不想使用与机器人相关的用户代理更新文本文件)。
是否有任何开放服务可以做到这一点,就像 Akismet 处理垃圾邮件一样?
或者有没有专门识别蜘蛛和机器人并提供频繁更新的PHP项目?
澄清一下:我并不是要阻止机器人。 我不需要 100% 无懈可击的结果。我只是
我想从我的统计数据中排除尽可能多的数据。在
知道解析用户代理是一个
选项,但保持模式
解析是很多工作。我的
问题是是否有
执行此操作的项目或服务
已经。
赏金:我想我应该将此作为该主题的参考问题。最好/最具原创性/技术上最可行的贡献将获得奖金。
I am looking to roll my own simple web stats script.
The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).
Is there any open service that does that, like Akismet does for spam?
Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?
To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just
want to exclude as many as I can from my stats. In
know that parsing the user-Agent is an
option but maintaining the patterns to
parse for is a lot of work. My
question is whether there is any
project or service that does that
already.
Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.
发布评论
评论(15)
人类和机器人会做类似的事情,但机器人会做人类不会做的事情。让我们尝试识别这些东西。在我们研究行为之前,让我们接受 RayQuang 的 评论为有用。如果访问者有机器人的用户代理字符串,则它可能是机器人。我无法想象任何人会以“Google Crawler”(或类似的东西)作为用户代理,除非他们正在努力破坏某些东西。我知道您不想手动更新列表,但自动拉取该列表应该很好,即使它在未来 10 年内保持陈旧,也会很有帮助。
有些人已经提到了 Javascript 和图像加载,但 Google 会同时做这两件事。我们必须假设现在有几个机器人可以同时完成这两件事,因此这些不再是人类指标。然而,机器人仍然会做独特的事情,那就是遵循“隐形”链接。以一种非常隐蔽的方式链接到我作为用户无法看到的页面。如果遵循这一点,我们就有了一个机器人。
机器人通常(尽管并非总是)会尊重 robots.txt。用户不关心 robots.txt,我们可以假设任何检索 robots.txt 的人都是机器人。不过,我们可以更进一步,将一个虚拟 CSS 页面链接到被 robots.txt 排除的页面。如果我们的普通 CSS 已加载,但我们的虚拟 CSS 未加载,那么它肯定是一个机器人。您必须按 IP 构建(可能是内存中的)负载表并执行不包含在匹配中的操作,但这应该是一个非常可靠的说明。
因此,要使用这一切:按 IP 地址维护机器人数据库表,可能有时间戳限制。添加跟随您的不可见链接的任何内容,添加加载“真实”CSS 但忽略 robots.txt CSS 的任何内容。也许还添加所有 robots.txt 下载器。最后一步过滤用户代理字符串,并考虑使用它来进行快速统计分析,并查看这些方法在识别我们已知的机器人方面的作用有多大。
Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
最简单的方法是检查他们的用户代理是否包含“bot”或“spider”。大多数都包含。
The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.
编辑(10年后):正如Lukas在评论框中所说,当今几乎所有爬虫都支持javascript,因此我删除了声明如果网站是基于JS的大多数机器人将被自动删除的段落出去。
您可以关注机器人列表并将其用户代理添加到过滤列表中。
看看这个机器人列表。
这个用户代理列表也很不错。只需去掉所有 B 就可以了。
编辑: eSniff 所做的令人惊叹的工作有 此处列出了上述列表 "以更容易查询和解析的形式。新的机器人由 robots-id:XXX 定义。您应该能够每周下载一次并将其解析为您的脚本可以使用的内容”,就像您可以在他的评论中看到的那样。
希望有帮助!
EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.
You can follow a bot list and add their user-agent to the filtering list.
Take a look at this bot list.
This user-agent list is also pretty good. Just strip out all the B's and you're set.
EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.
Hope it helps!
考虑一个伪装为 CSS 背景图像的 PHP 统计脚本(提供正确的响应标头 - 至少是内容类型和缓存控制 - 但写出一个空图像)。
有些机器人会解析 JS,但肯定没有人会加载 CSS 图像。与 JS 一样,一个陷阱是您将用它排除基于文本的浏览器,但这不到万维网人口的 1%。此外,禁用 CSS 的客户端肯定比禁用 JS 的客户端(移动设备!)要少。
为了使其更可靠地应对更高级的机器人(Google、Yahoo 等)将来可能抓取它们的(非例外)情况,请禁止在
robots.txt
中使用 CSS 图像的路径(其中无论如何,更好的机器人都会尊重)。Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).
Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).
To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in
robots.txt
(which the better bots will respect anyway).我将以下内容用于我的统计/计数器应用程序:
我删除了原始代码源的链接,因为它现在重定向到食品应用程序。
I use the following for my stats/counter app:
I removed a link to the original code source, because it now redirects to a food app.
检查用户代理会提醒您注意诚实的机器人,但不会提醒您垃圾邮件发送者。
要判断哪些请求是由不诚实的机器人发出的,您最好的选择(基于这个人有趣的研究)是捕获Javascript焦点事件。
如果焦点事件触发,则几乎可以肯定该页面是由人加载的。
Checking the user-agent will alert you to the honest bots, but not the spammers.
To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .
If the focus event fires, the page was almost certainly loaded by a human being.
我目前使用 AWstats 和 Webalizer 来监控 Apasce2 的日志文件,到目前为止,他们做得非常好。如果您愿意,可以查看他们的源代码,因为它是一个开源项目。
您可以在 http://awstats.sourceforge.net 获取源代码,或者查看常见问题解答http://awstats.sourceforge.net/docs/awstats_faq.html
希望有所帮助,
雷光
I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.
You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html
Hope that helps,
RayQuang
我们不是试图维护一个长得不可思议的蜘蛛用户代理列表,而是寻找暗示人类行为的东西。其原理是,我们将会话计数分为两个数字:单页会话数和多页会话数。我们删除一个会话 cookie,并使用它来确定多页面会话。我们还删除一个持久的“机器 ID”cookie;返回的用户(找到计算机 ID cookie)将被视为多页面会话,即使他们只查看该会话中的一页。您可能还有其他暗示“人类”访问者的特征 - 例如,推荐人是 Google(尽管我相信 MS 搜索机器人将其伪装成标准 UserAgent,并使用实际关键字来检查该网站是否显示不同的内容 [给他们的机器人],而且这种行为看起来很像人类!)
当然,这并不是绝对错误的,特别是如果你有很多人到达并“点击关闭”,那么这不会是一个好的统计数据对于您来说,如果您的大多数人都关闭了 cookie(在我们的例子中,如果没有启用会话 cookie,他们将无法使用我们的 [购物车] 网站)。
根据我们一位客户的数据,我们发现每日单会话计数到处都是——每天的数量级都不同;然而,如果我们从每天的多页面会话中减去 1,000,那么我们就会得到一个近乎线性的比率:每个订单 4 个多页面会话/每个购物篮 2 个会话。我真的不知道每天其他 1,000 个多页会话是什么!
Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)
Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).
Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!
使用 JavaScript 记录鼠标移动和滚动。您可以从记录的数据中判断出这是人还是机器人。除非机器人真的非常复杂并且模仿人类鼠标的动作。
Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.
现在我们有了各种各样的无头浏览器。 Chrome、Firefox 或其他将执行您网站上的任何 JS 的浏览器。所以任何基于 JS 的检测都不起作用。
我认为最有信心的方法是跟踪网站上的行为。如果我要编写一个机器人并想绕过检查,我会仅使用无头镶边来模拟滚动、鼠标移动、悬停、浏览器历史记录等事件。为了将其提升到一个新的水平,即使无头 Chrome 在请求中添加了一些有关“无头”模式的提示,我也可以分叉 chrome 存储库,进行更改并构建我自己的二进制文件,而不会留下任何痕迹。
我认为这可能是最接近真实检测的答案,无论是否是人类,访问者不采取任何行动:
https://developers.google.com/recaptcha/docs/invisible
我不确定这背后的技术,但我相信谷歌做得很好,用他们的机器学习算法分析了数十亿个请求来检测行为是否是人类的还是机器人的。
虽然这是一个额外的 HTTP 请求,但它不会检测到快速退回的访问者,因此需要记住这一点。
Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.
I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.
I think this may be the closest answer to real detection if it's human or not by no action from the visitor:
https://developers.google.com/recaptcha/docs/invisible
I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.
while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.
先决条件 -referrer 设置为
apache 级别:
在网页中嵌入
/ human/$hashkey_of_current_url.gif
。如果是机器人,则不太可能设置引荐来源网址(这是灰色区域)。
如果使用浏览器地址栏直接点击,则不会包含在内。
每天结束时,
/ human-access_log
应包含所有实际上是人类页面视图的引荐来源网址。为了安全起见,apache 日志中引用者的哈希值应与图像名称一致
Prerequisite - referrer is set
apache level:
In web-page, embed a
/human/$hashkey_of_current_url.gif
.If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.
At the end of each day,
/human-access_log
should contains all the referrer which actually is human page-view.To play safe, hash of the referrer from apache log should tally with the image name
在您的页面中放置一个 1x1 gif 并进行跟踪。如果加载,那么它很可能是一个浏览器。如果未加载,则可能是脚本。
Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.
=?抱歉,误会了。您可以尝试我在网站上设置的另一个选项:创建一个具有硬/奇怪名称的非链接网页,并记录对此页面的访问。此页面的大多数(如果不是全部)访问者都是机器人,这样您就可以动态创建机器人列表。
原始答案如下(获得负面评价!)
=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Original answer follows (getting negative ratings!)
您可以排除来自也请求
robots.txt
的用户代理的所有请求。所有表现良好的机器人都会提出这样的请求,但坏机器人会逃避检测。您还会遇到误报问题 - 作为人类,我并不经常在浏览器中阅读 robots.txt ,但我当然可以。为了避免这些错误地显示为机器人,您可以将一些常见的浏览器用户代理列入白名单,并认为它们始终是人类。但这只会变成维护浏览器的用户代理列表,而不是机器人的用户代理列表。
因此,这种 did-they-request-robots.txt 方法当然不会给出 100% 无懈可击的结果,但它可能会提供一些启发式方法来构建完整的解决方案。
You could exclude all requests that come from a User Agent that also requests
robots.txt
. All well behaved bots will make such a request, but the bad bots will escape detection.You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.
So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.
我很惊讶没有人建议实施图灵测试。只需在另一端与人类建立一个聊天框即可。
程序化的解决方案是行不通的:看看当 PARRY 时会发生什么遇到医生
这两个“角色”都是“聊天”机器人,是在 70 年代的人工智能研究过程中编写的:看看它们能欺骗真人多久,让他们误认为自己也是一个人。帕里角色被塑造为偏执型精神分裂症患者,而医生则被塑造为典型的心理治疗师。
这是一些更多背景
I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.
A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR
These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.
Here's some more background