检测行为良好/众所周知的机器人

发布于 2024-07-23 12:23:45 字数 778 浏览 7 评论 0原文

我发现这个问题非常有趣:程序化机器人检测 我有一个非常相似的问题,但我并不担心“行为恶劣的机器人”。

我正在跟踪(除了谷歌分析之外)每次访问的以下内容:

  • 条目 URL
  • Referer
  • UserAgent
  • Adwords(通过查询字符串)
  • 用户是否进行了购买
  • 等。

问题是要计算我的任何类型的转化率我最终得到了很多“机器人”访问,这极大地扭曲了我的结果。

我想忽略尽可能多的机器人访问,但我想要一个不需要太密切监视的解决方案,并且它本身不会成为性能消耗者,并且如果有人禁用了 javascript,最好仍然可以工作。

是否有已发布的 100 强机器人列表? 我确实在 http://www.user-agents.org/ 找到了一个列表,但出现了包含数百甚至数千个机器人。 我不想针对数千个链接检查每个引荐来源网址。

这是当前的 googlebot UserAgent。 多久改变一次?

 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

I found this question very interesting : Programmatic Bot Detection
I have a very similar question, but I'm not bothered about 'badly behaved bots'.

I am tracking (in addition to google analytics) the following per visit :

  • Entry URL
  • Referer
  • UserAgent
  • Adwords (by means of query string)
  • Whether or not the user made a purchase
  • etc.

The problem is that to calculate any kind of conversion rate I'm ending up with lots of 'bot' visits that are greatly skewing my results.

I'd like to ignore as many as possible bot visits, but I want a solution that I don't need to monitor too closely, and that won't in itself be a performance hog and preferably still work if someone has javascript disabled.

Are there good published lists of the top 100 bots or so? I did find a list at http://www.user-agents.org/ but that appears to contain hundreds if not thousands of bots. I don't want to check every referer against thousands of links.

Here is the current googlebot UserAgent. How often does it change?

 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤者何惧 2024-07-30 12:23:45

您可以尝试从 robotstxt.org 导入 机器人数据库 并使用它来过滤来自这些机器人的请求用户代理。 可能与 User-agents.org 没有太大不同,但至少 robotstxt.org 列表是“所有者提交的”(据说)。

该网站还链接到 botsvsbrowsers.com,尽管我没有立即看到其数据的可下载版本。

还有,你说

我不想针对数千个链接检查每个引荐来源网址。

这很公平 - 但如果运行时性能是一个问题,只需“记录”每个请求并将其作为后处理(隔夜批处理,或作为报告查询的一部分)过滤掉。

这一点也让我有点困惑

如果有人禁用了 javascript,最好仍然可以工作。

您是否将日志写在服务器端作为您所服务的每个页面的一部分? 在这种情况下,javascript 应该不会产生任何影响(尽管显然那些禁用了 javascript 的内容不会通过 Google Analytics 报告)。

ps 提到了 robotstxt.org,值得记住的是,行为良好的机器人会从您的网站根目录请求 /robots.txt。 也许您可以利用这些知识来发挥自己的优势 - 通过记录/通知您可能想要排除的机器人用户代理(尽管我不会自动排除该UA,以防普通网络用户在浏览器中输入 /robots.txt,这可能会导致您的代码忽略真人)。 我不认为随着时间的推移这会导致太多的维护开销......

You could try importing the Robots database off robotstxt.org and using that to filter out requests from those User-Agents. Might not be much different to User-agents.org, but at least the robotstxt.org list is 'owner-submitted' (supposedly).

That site also links to botsvsbrowsers.com although I don't immediately see a downloadable version of their data.

Also, you said

I don't want to check every referer against thousands of links.

which is fair enough - but if runtime performance is a concern, just 'log' every request and filter them out as a post-process (an overnight batch, or as part of the reporting queries).

This point also confuses me a bit

preferably still work if someone has javascript disabled.

are you writing your log on the server-side as part of every page you serve? javascript should not make any difference in this case (although obviously those with javascript disabled will not get reported via Google Analytics).

p.s. having mentioned robotstxt.org, it's worth remembering that well-behaved robots will request /robots.txt from your website root. Perhaps you could use that knowledge to your advantage - by logging/notifying you of possible robot User-Agents that you might want to exclude (although I wouldn't automatically exclude that UA in case a regular web user types /robots.txt into their browser, which might cause your code to ignore real people). I don't think that would cause too much maintenance overhead over time...

寒冷纷飞旳雪 2024-07-30 12:23:45

我意识到,与我尝试做的相反的事情实际上可能更容易。

select count(*) as count, useragent from sessionvisit 
where useragent not like '%firefox%' 
and useragent not like '%chrome%'
and useragent not like '%safari%'
and useragent not like '%msie%'
and useragent not like '%gecko%'
and useragent not like '%opera%'
group by useragent order by count desc

我实际上想做的是获得准确的转化​​率,并且包含好的浏览器似乎比排除机器人更有意义(好或坏) 。

此外,如果我发现“机器人”进行了购买的“会话”,则可能意味着有一个新的浏览器(想想chrome)。 目前我的机器人还没有进行购买!

I realized that its probably actually easier to do the exact reverse of what I was attempting.

i.e.

select count(*) as count, useragent from sessionvisit 
where useragent not like '%firefox%' 
and useragent not like '%chrome%'
and useragent not like '%safari%'
and useragent not like '%msie%'
and useragent not like '%gecko%'
and useragent not like '%opera%'
group by useragent order by count desc

What I'm actually trying to do is get an accurate conversion rate, and it seems to make more sense to include good browsers rather than exclude bots (good or bad).

In addition if i ever find a 'session' where a 'robot' has made a purchase it probably means there is a new browser (think chrome). Currently none of my robots have made a purchase!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文