自动标记用户代理字符串以进行统计？

发布于 2024-08-15 13:11:39 字数 876 浏览 9 评论 0原文

我们在我们的网站中跟踪用户代理字符串。我想对它们进行一些统计，看看我们有多少 IE6 用户（这样我们就知道我们要针对什么进行开发），以及我们有多少移动用户。

所以我们有这样的日志整体：

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

理想情况下，看到所有“有意义”的字符串会非常整洁，这意味着字符串可能长于一定的长度。例如，我可能想查看有多少条目包含 FunWebProducts、.NET CLR 或 .NET CLR 1.0.3705 - - 但我不想想知道有多少个有分号。所以我不一定要寻找唯一的字符串，而是所有字符串，甚至子集。因此，我想查看所有 Mozilla 的计数，知道这包括 Mozilla/5.0 和 Mozilla/4.0 的计数。如果有一个嵌套显示，从最短的字符串开始，一直向下，那就太好了。也许是这样的：

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

这听起来像是计算机科学作业。这叫什么？那里有这样的东西吗，还是我自己写？

原文

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have.

So we have log entires like this:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

And ideally, it would be pretty neat to see all the 'meaningful' strings, which would just mean probably strings longer than a certain length. For instance, I might like to see how many entries have FunWebProducts in it, or .NET CLR, or .NET CLR 1.0.3705 -- but I don't want to see how many have a semi-colon. So I'm not necessarily looking for unique strings, but all strings, even sub-sets. So, I would want to see the count of all Mozilla, knowing that this includes the counts for Mozilla/5.0 and Mozilla/4.0. It would be nice if there were a nested display for this, starting with the shortest strings, and working its way down. Something perhaps like

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

This sounds like a computer science homework. What would this be called? Does something like this exist out there, or do I write my own?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

您的好友蓝忘机已上羡 2024-08-22 13:11:39

您正在查看最长公共子字符串问题，或者，根据上面的具体示例，最长公共子字符串前缀问题，可以通过 trie 来解决。

但是，从上面的示例来看，您可能甚至不需要对此保持高效。相反，简单地：

对某些标点子集上的字符串进行标记，例如 [ ;/]
对
< p>保存任意多个标记的每个唯一前缀，替换原始分隔符
对于每个前缀，获取其匹配的记录的计数并保存

回复收藏 0 原文

陌上芳菲 2024-08-22 13:11:39

如果您将其分解为主要名称（左括号之前的部分），然后将每个部分以分号分隔存储为子记录，您可以进行任何您想要的分析。例如，将其存储在关系数据库中：

BrowserID   BrowserText
---------   -----------
1           Mozilla/4.0
2           Mozilla/5.0

FeatureID   FeatureText
---------   -----------
1           compatible
2           MSIE 7.0
3           Windows NT 5.1
4           FunWebProducts
5           .NET CLR 1.0.3705
6           .NET CLR 1.1.4322
7           Media Center PC 4.0
8           .NET CLR 2.0.50727

然后记录对浏览器和部件的引用，您可以进行任何类型的分析。

If you break it up into the major name (part before the opening paren), and then store each part separated by semicolon as a child record, you could do whatever analysis you want. For example, store it in a relational database:

BrowserID   BrowserText
---------   -----------
1           Mozilla/4.0
2           Mozilla/5.0

FeatureID   FeatureText
---------   -----------
1           compatible
2           MSIE 7.0
3           Windows NT 5.1
4           FunWebProducts
5           .NET CLR 1.0.3705
6           .NET CLR 1.1.4322
7           Media Center PC 4.0
8           .NET CLR 2.0.50727

Then log references to browser and parts and you can do any type of analysis you want.

回复收藏 0 原文

傲性难收 2024-08-22 13:11:39

使用正则表达式将用户代理字符串解析为其相关组成部分怎么样？用户代理字符串的基本规范是“[name]/[version]”或“[name] [version]”'。有了这些信息，我们可以使用像 ([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_ +\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+)) 获取匹配集，其中第一个匹配项是[name]，集合中的第二个匹配项是[version]。当然，您必须从集合中的第二个匹配中删除空格和 / ，或者修改正则表达式以使用lookbehind（几种正则表达式风格不支持，所以我没有将其包含在这里）。

获得所有这些元组后，您可以根据需要对它们进行操作和计数。