如何在 C# 中优化这个 UserAgent 解析器 for 循环?
我正在编写一个 C# 程序来分析 Web 服务器日志的 UserAgent 列中的浏览器数量。我希望输出浏览器类型、浏览器主要版本和点击次数。
我该如何优化这个?
我使用正则表达式将 UserAgent 字符串与预定义字符串进行比较,以测试 Firefox、Opera 等。然后使用正则表达式消除可能的不匹配。然后我使用正则表达式来获取主要版本。我使用一个结构来保存每个浏览器的此信息:
private struct Browser
{
public int ID;
public string name;
public string regex_match;
public string regex_not;
public string regex_version;
public int regex_group;
}
然后加载浏览器信息并循环遍历 UserAgent 的所有记录:
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
foreach (Browser b in browsers)
{
if (Regex.IsMatch(csUserAgent[i], b.regex_match))
{
if (b.regex_not != "")
{
if (Regex.IsMatch(csUserAgent[i], b.regex_not))
{
continue;
}
}
string strBrowser = b.name;
if (b.regex_version != "")
{
string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
I am writing a C# program to analyze the the number of browsers in the UserAgent column of a web server log. I wish to output the browser type, browser major version, and the number of hits.
How can I optimize this?
I am using regex to compare the UserAgent string with predefined strings to test for Firefox, Opera, etc. I then use regex to cancel out a possible mismatch. I then use a regex to obtain the major version. I use a struct to hold this information for each browser:
private struct Browser
{
public int ID;
public string name;
public string regex_match;
public string regex_not;
public string regex_version;
public int regex_group;
}
I then load the browser information and loop over all of the records for the UserAgent:
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
foreach (Browser b in browsers)
{
if (Regex.IsMatch(csUserAgent[i], b.regex_match))
{
if (b.regex_not != "")
{
if (Regex.IsMatch(csUserAgent[i], b.regex_not))
{
continue;
}
}
string strBrowser = b.name;
if (b.regex_version != "")
{
string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以
存储编译新
Regex(pattern, RegexOptions.Compiled)
而不仅仅是pattern
将正则表达式组合成一个正则表达式并利用 RegexOptions.Compiled 和RegexOptions.CultureInvariantIgnoreCase
而不是匹配两次(一次使用
IsMatch
一次使用Matches
) 匹配一次 (Matches
) 并检查 MatchCollection 是否为空这只是一个起点 - 我可能会在阅读代码时提出更多想法:)
编辑 还有一点:
例如,您可以有一个像这样的静态正则表达式实例:
您可以通过使用
match.Groups["browserid"]
方便地访问正确的子组和match.Groups[“版本”]
。这几乎消除了浏览器结构列表的所有使用。它唯一仍然满足的是排除正则表达式(regex_not)。不过,我建议首先使用单个正正则表达式重新进行分析,并在煎小鱼之前看看是否仍然存在性能问题。
基准测试
我写了一个基准测试(见下文)。我将逐步更新此数据,直到我失去兴趣:)(我知道我的数据集不具有代表性。如果您上传文件,我将用它进行测试)
替换单独的数据集由单个静态编译的正则表达式组成的正则表达式,速度从 14 秒提高到 2.1 秒(加速 6 倍); 这只是替换了最外层的匹配
,
用预编译的正则表达式替换 regex_not/regex_version 并没有对我的测试集产生太大影响(但我没有实际匹配的用户代理,因此有道理)
。
You could
construct a hashtable or most-frequently matches user-agent and avoid matching the regexen.
store compile new
Regex(pattern, RegexOptions.Compiled)
instead of justpattern
combine the regexes into a single regex and take advantage of RegexOptions.Compiled and RegexOptions.CultureInvariantIgnoreCase
instead of matching twice (once with
IsMatch
and once withMatches
) match once (Matches
) and check whether the MatchCollection is emptyThis is only a starting point - I might come up with more ideas on reading the code :)
Edit One more:
E.g. you could have a single static regex instance like this:
You can conveniently access the proper subgroups by using
match.Groups["browserid"]
andmatch.Groups["version"]
. This nearly eliminates all the use for your list of Browser structs.The only thing it still caters for is the exclusion regex (regex_not). I suggest re-profiling with the single positive regex first, though and see whether there is still a performance problem left before frying smaller fish.
Benchmark
I wrote a benchmark (see below). I'll be updating this incrementally until I loose interest :) (I know my dataset isn't representative. If you upload a file, I'll test it with that)
replacing the separate regexes by the single statically compiled regex, speeds up from 14s to 2.1s (a 6x speedup); this is only with the outermost match replaced
replacing the regex_not/regex_version by precompiled regexes did not make much of a difference with my test set (but I don't have actual matching useragents, so that makes sense)
.