自动标记用户代理字符串以进行统计?
我们在我们的网站中跟踪用户代理字符串。我想对它们进行一些统计,看看我们有多少 IE6 用户(这样我们就知道我们要针对什么进行开发),以及我们有多少移动用户。
所以我们有这样的日志整体:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)
理想情况下,看到所有“有意义”的字符串会非常整洁,这意味着字符串可能长于一定的长度。例如,我可能想查看有多少条目包含 FunWebProducts
、.NET CLR
或 .NET CLR 1.0.3705
- - 但我不想想知道有多少个有分号。所以我不一定要寻找唯一的字符串,而是所有字符串,甚至子集。因此,我想查看所有 Mozilla
的计数,知道这包括 Mozilla/5.0
和 Mozilla/4.0
的计数。如果有一个嵌套显示,从最短的字符串开始,一直向下,那就太好了。也许是这样的:
4,2093 Mozilla
1,093 Mozilla/5.0
468 Mozilla/5.0 (Windows;
47 Mozilla/5.0 (Windows; U
2,398 Mozilla/4.0
这听起来像是计算机科学作业。这叫什么?那里有这样的东西吗,还是我自己写?
We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have.
So we have log entires like this:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)
And ideally, it would be pretty neat to see all the 'meaningful' strings, which would just mean probably strings longer than a certain length. For instance, I might like to see how many entries have FunWebProducts
in it, or .NET CLR
, or .NET CLR 1.0.3705
-- but I don't want to see how many have a semi-colon. So I'm not necessarily looking for unique strings, but all strings, even sub-sets. So, I would want to see the count of all Mozilla
, knowing that this includes the counts for Mozilla/5.0
and Mozilla/4.0
. It would be nice if there were a nested display for this, starting with the shortest strings, and working its way down. Something perhaps like
4,2093 Mozilla
1,093 Mozilla/5.0
468 Mozilla/5.0 (Windows;
47 Mozilla/5.0 (Windows; U
2,398 Mozilla/4.0
This sounds like a computer science homework. What would this be called? Does something like this exist out there, or do I write my own?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您正在查看最长公共子字符串问题,或者,根据上面的具体示例,最长公共子字符串前缀问题,可以通过 trie 来解决。
但是,从上面的示例来看,您可能甚至不需要对此保持高效。相反,简单地:
对某些标点子集上的字符串进行标记,例如
[ ;/]
对
对于每个前缀,获取其匹配的记录的计数并保存
You are looking at a longest common substring problem, or, given your specific example above, a longest common prefix problem, which can be approached with a trie.
However, going from your example above, you probably don't even need to be efficient about this. Instead, simply:
Tokenize strings on some punctuation subset, like
[ ;/]
Save each unique prefix of however many tokens, replacing the original delimiters
For each prefix, get a count of which records it matches and save that
如果您将其分解为主要名称(左括号之前的部分),然后将每个部分以分号分隔存储为子记录,您可以进行任何您想要的分析。例如,将其存储在关系数据库中:
然后记录对浏览器和部件的引用,您可以进行任何类型的分析。
If you break it up into the major name (part before the opening paren), and then store each part separated by semicolon as a child record, you could do whatever analysis you want. For example, store it in a relational database:
Then log references to browser and parts and you can do any type of analysis you want.
使用正则表达式将用户代理字符串解析为其相关组成部分怎么样?用户代理字符串的基本规范是“
[name]
/[version]
”或“[name]
[version]”
'。有了这些信息,我们可以使用像([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_ +\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+))
获取匹配集,其中第一个匹配项是[name]
,集合中的第二个匹配项是[version]
。当然,您必须从集合中的第二个匹配中删除空格和/
,或者修改正则表达式以使用lookbehind(几种正则表达式风格不支持,所以我没有将其包含在这里)。获得所有这些元组后,您可以根据需要对它们进行操作和计数。
What about using a regex to parse the user agent string into its relevant component parts? The basic spec for a user agent string is '
[name]
/[version]
' or '[name]
[version]
'. With that information we can use a regex like([^\(\)\/\\;\n]+)([ ]((?=\d*\.+\d*|\d*_+\d*)[\d\.Xx_]+)|[/]([^\(\)\/; \n]+))
to get match sets where the first match in a set is the[name]
and the second match in a set is the[version]
. Of course, you'll have to strip the spaces and/
from the second match in the set, or modify the regex to use lookbehind (which several regex flavors don't support, so I didn't include it here).After you get all these tuples you can manipulate and count them however you want.