自动标记用户代理字符串以进行统计?

发布于 2024-08-15 13:11:39 字数 876 浏览 8 评论 0原文

我们在我们的网站中跟踪用户代理字符串。我想对它们进行一些统计,看看我们有多少 IE6 用户(这样我们就知道我们要针对什么进行开发),以及我们有多少移动用户。

所以我们有这样的日志整体:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

理想情况下,看到所有“有意义”的字符串会非常整洁,这意味着字符串可能长于一定的长度。例如,我可能想查看有多少条目包含 FunWebProducts.NET CLR.NET CLR 1.0.3705 - - 但我不想想知道有多少个有分号。所以我不一定要寻找唯一的字符串,而是所有字符串,甚至子集。因此,我想查看所有 Mozilla 的计数,知道这包括 Mozilla/5.0Mozilla/4.0 的计数。如果有一个嵌套显示,从最短的字符串开始,一直向下,那就太好了。也许是这样的:

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

这听起来像是计算机科学作业。这叫什么?那里有这样的东西吗,还是我自己写?

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have.

So we have log entires like this:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

And ideally, it would be pretty neat to see all the 'meaningful' strings, which would just mean probably strings longer than a certain length. For instance, I might like to see how many entries have FunWebProducts in it, or .NET CLR, or .NET CLR 1.0.3705 -- but I don't want to see how many have a semi-colon. So I'm not necessarily looking for unique strings, but all strings, even sub-sets. So, I would want to see the count of all Mozilla, knowing that this includes the counts for Mozilla/5.0 and Mozilla/4.0. It would be nice if there were a nested display for this, starting with the shortest strings, and working its way down. Something perhaps like

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

This sounds like a computer science homework. What would this be called? Does something like this exist out there, or do I write my own?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

您的好友蓝忘机已上羡 2024-08-22 13:11:39

您正在查看最长公共子字符串问题,或者,根据上面的具体示例,最长公共子字符串前缀问题,可以通过 trie 来解决。

但是,从上面的示例来看,您可能甚至不需要对此保持高效。相反,简单地:

  1. 对某些标点子集上的字符串进行标记,例如 [ ;/]

  2. < p>保存任意多个标记的每个唯一前缀,替换原始分隔符

  3. 对于每个前缀,获取其匹配的记录的计数并保存

You are looking at a longest common substring problem, or, given your specific example above, a longest common prefix problem, which can be approached with a trie.

However, going from your example above, you probably don't even need to be efficient about this. Instead, simply:

  1. Tokenize strings on some punctuation subset, like [ ;/]

  2. Save each unique prefix of however many tokens, replacing the original delimiters

  3. For each prefix, get a count of which records it matches and save that

陌上芳菲 2024-08-22 13:11:39

如果您将其分解为主要名称(左括号之前的部分),然后将每个部分以分号分隔存储为子记录,您可以进行任何您想要的分析。例如,将其存储在关系数据库中:

BrowserID   BrowserText
---------   -----------
1           Mozilla/4.0
2           Mozilla/5.0

FeatureID   FeatureText
---------   -----------
1           compatible
2           MSIE 7.0
3           Windows NT 5.1
4           FunWebProducts
5           .NET CLR 1.0.3705
6           .NET CLR 1.1.4322
7           Media Center PC 4.0
8           .NET CLR 2.0.50727

然后记录对浏览器和部件的引用,您可以进行任何类型的分析。

If you break it up into the major name (part before the opening paren), and then store each part separated by semicolon as a child record, you could do whatever analysis you want. For example, store it in a relational database:

BrowserID   BrowserText
---------   -----------
1           Mozilla/4.0
2           Mozilla/5.0

FeatureID   FeatureText
---------   -----------
1           compatible
2           MSIE 7.0
3           Windows NT 5.1
4           FunWebProducts
5           .NET CLR 1.0.3705
6           .NET CLR 1.1.4322
7           Media Center PC 4.0
8           .NET CLR 2.0.50727

Then log references to browser and parts and you can do any type of analysis you want.

傲性难收 2024-08-22 13:11:39

使用正则表达式将用户代理字符串解析为其相关组成部分怎么样?用户代理字符串的基本规范是“[name]/[version]”或“[name] [version]”'。有了这些信息,我们可以使用像 ([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_ +\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+)) 获取匹配集,其中第一个匹配项是[name],集合中的第二个匹配项是[version]。当然,您必须从集合中的第二个匹配中删除空格和 / ,或者修改正则表达式以使用lookbehind(几种正则表达式风格不支持,所以我没有将其包含在这里)。

获得所有这些元组后,您可以根据需要对它们进行操作和计数。

What about using a regex to parse the user agent string into its relevant component parts? The basic spec for a user agent string is '[name]/[version]' or '[name] [version]'. With that information we can use a regex like ([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_+\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+)) to get match sets where the first match in a set is the [name] and the second match in a set is the [version]. Of course, you'll have to strip the spaces and / from the second match in the set, or modify the regex to use lookbehind (which several regex flavors don't support, so I didn't include it here).

After you get all these tuples you can manipulate and count them however you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文