当前位置：文江博客话题详情

区分机器人和人类访客以获得统计数据？

发布于 2024-08-10 10:15:54 字数 445 浏览 4 评论 0 原文

我正在寻找自己的简单网络统计脚本。

据我所知，路上唯一的主要障碍是区分人类访客和机器人。我想要一个不需要定期维护的解决方案（即我不想使用与机器人相关的用户代理更新文本文件）。

是否有任何开放服务可以做到这一点，就像 Akismet 处理垃圾邮件一样？或者有没有专门识别蜘蛛和机器人并提供频繁更新的PHP项目？

澄清一下：我并不是要阻止机器人。 我不需要 100% 无懈可击的结果。我只是我想从我的统计数据中排除尽可能多的数据。在知道解析用户代理是一个选项，但保持模式解析是很多工作。我的问题是是否有执行此操作的项目或服务已经。

赏金：我想我应该将此作为该主题的参考问题。最好/最具原创性/技术上最可行的贡献将获得奖金。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

影子是时光的心 2024-08-17 10:15:54

人类和机器人会做类似的事情，但机器人会做人类不会做的事情。让我们尝试识别这些东西。在我们研究行为之前，让我们接受 RayQuang 的评论为有用。如果访问者有机器人的用户代理字符串，则它可能是机器人。我无法想象任何人会以“Google Crawler”（或类似的东西）作为用户代理，除非他们正在努力破坏某些东西。我知道您不想手动更新列表，但自动拉取该列表应该很好，即使它在未来 10 年内保持陈旧，也会很有帮助。

有些人已经提到了 Javascript 和图像加载，但 Google 会同时做这两件事。我们必须假设现在有几个机器人可以同时完成这两件事，因此这些不再是人类指标。然而，机器人仍然会做独特的事情，那就是遵循“隐形”链接。以一种非常隐蔽的方式链接到我作为用户无法看到的页面。如果遵循这一点，我们就有了一个机器人。

机器人通常（尽管并非总是）会尊重 robots.txt。用户不关心 robots.txt，我们可以假设任何检索 robots.txt 的人都是机器人。不过，我们可以更进一步，将一个虚拟 CSS 页面链接到被 robots.txt 排除的页面。如果我们的普通 CSS 已加载，但我们的虚拟 CSS 未加载，那么它肯定是一个机器人。您必须按 IP 构建（可能是内存中的）负载表并执行不包含在匹配中的操作，但这应该是一个非常可靠的说明。

因此，要使用这一切：按 IP 地址维护机器人数据库表，可能有时间戳限制。添加跟随您的不可见链接的任何内容，添加加载“真实”CSS 但忽略 robots.txt CSS 的任何内容。也许还添加所有 robots.txt 下载器。最后一步过滤用户代理字符串，并考虑使用它来进行快速统计分析，并查看这些方法在识别我们已知的机器人方面的作用有多大。

回复收藏 0 原文

我不咬妳我踢妳 2024-08-17 10:15:54

最简单的方法是检查他们的用户代理是否包含“bot”或“spider”。大多数都包含。

回复收藏 0 原文

羁〃客ぐ 2024-08-17 10:15:54

编辑（10年后）：正如Lukas在评论框中所说，当今几乎所有爬虫都支持javascript，因此我删除了声明如果网站是基于JS的大多数机器人将被自动删除的段落出去。

您可以关注机器人列表并将其用户代理添加到过滤列表中。

看看这个机器人列表。

这个用户代理列表也很不错。只需去掉所有 B 就可以了。

编辑： eSniff 所做的令人惊叹的工作有此处列出了上述列表 "以更容易查询和解析的形式。新的机器人由 robots-id:XXX 定义。您应该能够每周下载一次并将其解析为您的脚本可以使用的内容”，就像您可以在他的评论中看到的那样。

希望有帮助！

回复收藏 0 原文

等待圉鍢 2024-08-17 10:15:54

考虑一个伪装为 CSS 背景图像的 PHP 统计脚本（提供正确的响应标头 - 至少是内容类型和缓存控制 - 但写出一个空图像）。

有些机器人会解析 JS，但肯定没有人会加载 CSS 图像。与 JS 一样，一个陷阱是您将用它排除基于文本的浏览器，但这不到万维网人口的 1%。此外，禁用 CSS 的客户端肯定比禁用 JS 的客户端（移动设备！）要少。

为了使其更可靠地应对更高级的机器人（Google、Yahoo 等）将来可能抓取它们的（非例外）情况，请禁止在 robots.txt 中使用 CSS 图像的路径（其中无论如何，更好的机器人都会尊重）。

回复收藏 0 原文

夜光 2024-08-17 10:15:54

我将以下内容用于我的统计/计数器应用程序：

<?php
    function is_bot($user_agent) {
        return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
    }

    //example usage
    if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>

我删除了原始代码源的链接，因为它现在重定向到食品应用程序。

I use the following for my stats/counter app:

<?php
    function is_bot($user_agent) {
        return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
    }

    //example usage
    if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>

I removed a link to the original code source, because it now redirects to a food app.

回复收藏 0 原文

林空鹿饮溪 2024-08-17 10:15:54

检查用户代理会提醒您注意诚实的机器人，但不会提醒您垃圾邮件发送者。

要判断哪些请求是由不诚实的机器人发出的，您最好的选择（基于这个人有趣的研究）是捕获Javascript焦点事件。

如果焦点事件触发，则几乎可以肯定该页面是由人加载的。

编辑：确实如此，关闭 Javascript 的人不会显示为人类，但就是这样网络用户的比例并不高。

Edit2：当前的机器人也可以执行 Javascript，至少 Google 可以.

回复收藏 0 原文

愿与i 2024-08-17 10:15:54

我目前使用 AWstats 和 Webalizer 来监控 Apasce2 的日志文件，到目前为止，他们做得非常好。如果您愿意，可以查看他们的源代码，因为它是一个开源项目。

您可以在 http://awstats.sourceforge.net 获取源代码，或者查看常见问题解答http://awstats.sourceforge.net/docs/awstats_faq.html

希望有所帮助，
雷光

回复收藏 0 原文

丑疤怪 2024-08-17 10:15:54

我们不是试图维护一个长得不可思议的蜘蛛用户代理列表，而是寻找暗示人类行为的东西。其原理是，我们将会话计数分为两个数字：单页会话数和多页会话数。我们删除一个会话 cookie，并使用它来确定多页面会话。我们还删除一个持久的“机器 ID”cookie；返回的用户（找到计算机 ID cookie）将被视为多页面会话，即使他们只查看该会话中的一页。您可能还有其他暗示“人类”访问者的特征 - 例如，推荐人是 Google（尽管我相信 MS 搜索机器人将其伪装成标准 UserAgent，并使用实际关键字来检查该网站是否显示不同的内容 [给他们的机器人]，而且这种行为看起来很像人类！）

当然，这并不是绝对错误的，特别是如果你有很多人到达并“点击关闭”，那么这不会是一个好的统计数据对于您来说，如果您的大多数人都关闭了 cookie（在我们的例子中，如果没有启用会话 cookie，他们将无法使用我们的 [购物车] 网站）。

根据我们一位客户的数据，我们发现每日单会话计数到处都是——每天的数量级都不同；然而，如果我们从每天的多页面会话中减去 1,000，那么我们就会得到一个近乎线性的比率：每个订单 4 个多页面会话/每个购物篮 2 个会话。我真的不知道每天其他 1,000 个多页会话是什么！

回复收藏 0 原文

一个人练习一个人 2024-08-17 10:15:54

使用 JavaScript 记录鼠标移动和滚动。您可以从记录的数据中判断出这是人还是机器人。除非机器人真的非常复杂并且模仿人类鼠标的动作。

回复收藏 0 原文

[浮城] 2024-08-17 10:15:54

现在我们有了各种各样的无头浏览器。 Chrome、Firefox 或其他将执行您网站上的任何 JS 的浏览器。所以任何基于 JS 的检测都不起作用。

我认为最有信心的方法是跟踪网站上的行为。如果我要编写一个机器人并想绕过检查，我会仅使用无头镶边来模拟滚动、鼠标移动、悬停、浏览器历史记录等事件。为了将其提升到一个新的水平，即使无头 Chrome 在请求中添加了一些有关“无头”模式的提示，我也可以分叉 chrome 存储库，进行更改并构建我自己的二进制文件，而不会留下任何痕迹。

我认为这可能是最接近真实检测的答案，无论是否是人类，访问者不采取任何行动：

https://developers.google.com/recaptcha/docs/invisible

我不确定这背后的技术，但我相信谷歌做得很好，用他们的机器学习算法分析了数十亿个请求来检测行为是否是人类的还是机器人的。

虽然这是一个额外的 HTTP 请求，但它不会检测到快速退回的访问者，因此需要记住这一点。

回复收藏 0 原文

能怎样 2024-08-17 10:15:54

先决条件 -referrer 设置为

apache 级别：

LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*)   /b.gif [L]
SetEnv human_session 0

# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1

SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog

在网页中嵌入 / human/$hashkey_of_current_url.gif。
如果是机器人，则不太可能设置引荐来源网址（这是灰色区域）。
如果使用浏览器地址栏直接点击，则不会包含在内。

每天结束时，/ human-access_log 应包含所有实际上是人类页面视图的引荐来源网址。

为了安全起见，apache 日志中引用者的哈希值应与图像名称一致

Prerequisite - referrer is set

apache level:

LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*)   /b.gif [L]
SetEnv human_session 0

# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1

SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog

In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.

At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.

To play safe, hash of the referrer from apache log should tally with the image name

回复收藏 0 原文