如何在 C# 中优化这个 UserAgent 解析器 for 循环？

发布于 2024-12-04 03:54:37 字数 2752 浏览 1 评论 0原文

我正在编写一个 C# 程序来分析 Web 服务器日志的 UserAgent 列中的浏览器数量。我希望输出浏览器类型、浏览器主要版本和点击次数。

我该如何优化这个？

我使用正则表达式将 UserAgent 字符串与预定义字符串进行比较，以测试 Firefox、Opera 等。然后使用正则表达式消除可能的不匹配。然后我使用正则表达式来获取主要版本。我使用一个结构来保存每个浏览器的此信息：

private struct Browser
{
    public int ID;
    public string name;
    public string regex_match;
    public string regex_not;
    public string regex_version;
    public int regex_group;
}

然后加载浏览器信息并循环遍历 UserAgent 的所有记录：

Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
    browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
    foreach (Browser b in browsers)
    {
        if (Regex.IsMatch(csUserAgent[i], b.regex_match))
        {
            if (b.regex_not != "")
            {
                if (Regex.IsMatch(csUserAgent[i], b.regex_not))
                {
                    continue;
                }
            }
            string strBrowser = b.name;
            if (b.regex_version != "")
            {
                string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
                int intPeriod = strVersion.IndexOf('.');
                if (intPeriod > 0)
                {
                    strBrowser += " " + strVersion.Substring(0, intPeriod);
                }
            }
            if (!browser_counts.ContainsKey(strBrowser))
            {
                browser_counts.Add(strBrowser, 1);
            }
            else
            {
                browser_counts[strBrowser]++;
            }
            break;
        }
    }
}

原文

I am writing a C# program to analyze the the number of browsers in the UserAgent column of a web server log. I wish to output the browser type, browser major version, and the number of hits.

How can I optimize this?

I am using regex to compare the UserAgent string with predefined strings to test for Firefox, Opera, etc. I then use regex to cancel out a possible mismatch. I then use a regex to obtain the major version. I use a struct to hold this information for each browser:

private struct Browser
{
    public int ID;
    public string name;
    public string regex_match;
    public string regex_not;
    public string regex_version;
    public int regex_group;
}

I then load the browser information and loop over all of the records for the UserAgent:

Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
    browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
    foreach (Browser b in browsers)
    {
        if (Regex.IsMatch(csUserAgent[i], b.regex_match))
        {
            if (b.regex_not != "")
            {
                if (Regex.IsMatch(csUserAgent[i], b.regex_not))
                {
                    continue;
                }
            }
            string strBrowser = b.name;
            if (b.regex_version != "")
            {
                string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
                int intPeriod = strVersion.IndexOf('.');
                if (intPeriod > 0)
                {
                    strBrowser += " " + strVersion.Substring(0, intPeriod);
                }
            }
            if (!browser_counts.ContainsKey(strBrowser))
            {
                browser_counts.Add(strBrowser, 1);
            }
            else
            {
                browser_counts[strBrowser]++;
            }
            break;
        }
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伴梦长久 2024-12-11 03:54:37

您可以

构建一个哈希表或最常匹配的用户代理并避免匹配正则表达式。
存储编译新Regex(pattern, RegexOptions.Compiled)而不仅仅是pattern
将正则表达式组合成一个正则表达式并利用 RegexOptions.Compiled 和RegexOptions.CultureInvariantIgnoreCase
而不是匹配两次（一次使用 IsMatch 一次使用 Matches) 匹配一次 (Matches) 并检查 MatchCollection 是否为空

这只是一个起点 - 我可能会在阅读代码时提出更多想法:)

编辑还有一点：

避免使用另一个正则表达式解析版本 - 只有 safari 需要根据您的配置进行特殊处理。尝试使用与 browserid 相同的正则表达式来“捕获”版本。（我现在只是为 safari 做一个例外）

例如，您可以有一个像这样的静态正则表达式实例：

private static readonly Regex _regex = new Regex(
    "(?i)" 
    + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
    + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

您可以通过使用 match.Groups["browserid"] 方便地访问正确的子组和match.Groups[“版本”]。这几乎消除了浏览器结构列表的所有使用。

它唯一仍然满足的是排除正则表达式（regex_not）。不过，我建议首先使用单个正正则表达式重新进行分析，并在煎小鱼之前看看是否仍然存在性能问题。

基准测试

我写了一个基准测试（见下文）。我将逐步更新此数据，直到我失去兴趣:)（我知道我的数据集不具有代表性。如果您上传文件，我将用它进行测试）

替换单独的数据集由单个静态编译的正则表达式组成的正则表达式，速度从 14 秒提高到 2.1 秒（加速 6 倍）； 这只是替换了最外层的匹配
，
用预编译的正则表达式替换 regex_not/regex_version 并没有对我的测试集产生太大影响（但我没有实际匹配的用户代理，因此有道理）

。

using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;


public class Program
{
    private struct Browser
    {
        public int ID;
        public string name;
        public Regex regex_match, regex_not, regex_version;
        public int regex_group;
    }

    private static readonly Regex _regex = new Regex("(?i)" 
        + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
        + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

    public static void Main(string[] args)
    {

        Browser[] browsers = new Browser[5];
        for (int i = 0; i < 5; i++)
        {
            browsers[i].ID = i;
        }
        browsers[0].name = "Firefox";
        browsers[1].name = "Opera";
        browsers[2].name = "Chrome";
        browsers[3].name = "Safari";
        browsers[4].name = "Internet Explorer";
        browsers[0].regex_match = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_match = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_match = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_match = new Regex("(?i)safari/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_match = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_not = new Regex("(?i)flock", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_not = null;
        browsers[2].regex_not = null;
        browsers[3].regex_not = new Regex("(?i)android|arora|chrome|shiira", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_not = new Regex("(?i)webtv|omniweb|opera", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_version = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_version = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_version = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_version = new Regex("(?i)version/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_version = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[0].regex_group = 1;
        browsers[1].regex_group = 1;
        browsers[2].regex_group = 1;
        browsers[3].regex_group = 1;
        browsers[4].regex_group = 2;
        Dictionary<string, int> browser_counts = new Dictionary<string, int>();

        var lookupBrowserId = new Dictionary<string, int> {
            { "firefox/", 0 },
            { "opera/", 1 },
            { "chrome/", 2 },
            { "safari/", 3 },
            { "msie+", 4 },
            { "msie_", 4 },
            { "msie ", 4 },
            { "msie", 4 },
        };

        for (int i=1; i<20; i++)
        foreach (var line in System.IO.File.ReadAllLines("/etc/dictionaries-common/words"))
        {
            // OPTIMIZATION #1 START
            Match match = _regex.Match(line);

            {
                if (match.Success)
                {
                    Browser b = browsers[lookupBrowserId[match.Groups["browserid"].Value]];
                    // OPTIMIZATION #1 END

                    // OPTIMIZATION #2
                    if (b.regex_not != null && b.regex_not.IsMatch(line))
                            continue;

                    string strBrowser = b.name;
                    if (b.regex_version != null)
                    {
                        // OPTIMIZATION #2
                        string strVersion = b.regex_version.Match(line).Groups[b.regex_group].Value;
                        int intPeriod = strVersion.IndexOf('.');
                        if (intPeriod > 0)
                        {
                            strBrowser += " " + strVersion.Substring(0, intPeriod);
                        }
                    }
                    if (!browser_counts.ContainsKey(strBrowser))
                    {
                        browser_counts.Add(strBrowser, 1);
                    }
                    else
                    {
                        browser_counts[strBrowser]++;
                    }
                    break;
                }
            }
        }
    }
}

You could

construct a hashtable or most-frequently matches user-agent and avoid matching the regexen.
store compile new Regex(pattern, RegexOptions.Compiled) instead of just pattern
combine the regexes into a single regex and take advantage of RegexOptions.Compiled and RegexOptions.CultureInvariantIgnoreCase
instead of matching twice (once with IsMatch and once with Matches) match once (Matches) and check whether the MatchCollection is empty

This is only a starting point - I might come up with more ideas on reading the code :)

Edit One more:

avoid parsing the version with another regex - only safari requires special treaetment according to your config. Try to 'catch' the version with the same regex as the browserid. (I'd simply make an exception for safari for now)

E.g. you could have a single static regex instance like this:

private static readonly Regex _regex = new Regex(
    "(?i)" 
    + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
    + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

You can conveniently access the proper subgroups by using match.Groups["browserid"] and match.Groups["version"]. This nearly eliminates all the use for your list of Browser structs.

The only thing it still caters for is the exclusion regex (regex_not). I suggest re-profiling with the single positive regex first, though and see whether there is still a performance problem left before frying smaller fish.

Benchmark

I wrote a benchmark (see below). I'll be updating this incrementally until I loose interest :) (I know my dataset isn't representative. If you upload a file, I'll test it with that)

replacing the separate regexes by the single statically compiled regex, speeds up from 14s to 2.1s (a 6x speedup); this is only with the outermost match replaced
replacing the regex_not/regex_version by precompiled regexes did not make much of a difference with my test set (but I don't have actual matching useragents, so that makes sense)

using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;


public class Program
{
    private struct Browser
    {
        public int ID;
        public string name;
        public Regex regex_match, regex_not, regex_version;
        public int regex_group;
    }

    private static readonly Regex _regex = new Regex("(?i)" 
        + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
        + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

    public static void Main(string[] args)
    {

        Browser[] browsers = new Browser[5];
        for (int i = 0; i < 5; i++)
        {
            browsers[i].ID = i;
        }
        browsers[0].name = "Firefox";
        browsers[1].name = "Opera";
        browsers[2].name = "Chrome";
        browsers[3].name = "Safari";
        browsers[4].name = "Internet Explorer";
        browsers[0].regex_match = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_match = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_match = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_match = new Regex("(?i)safari/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_match = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_not = new Regex("(?i)flock", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_not = null;
        browsers[2].regex_not = null;
        browsers[3].regex_not = new Regex("(?i)android|arora|chrome|shiira", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_not = new Regex("(?i)webtv|omniweb|opera", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_version = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_version = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_version = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_version = new Regex("(?i)version/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_version = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[0].regex_group = 1;
        browsers[1].regex_group = 1;
        browsers[2].regex_group = 1;
        browsers[3].regex_group = 1;
        browsers[4].regex_group = 2;
        Dictionary<string, int> browser_counts = new Dictionary<string, int>();

        var lookupBrowserId = new Dictionary<string, int> {
            { "firefox/", 0 },
            { "opera/", 1 },
            { "chrome/", 2 },
            { "safari/", 3 },
            { "msie+", 4 },
            { "msie_", 4 },
            { "msie ", 4 },
            { "msie", 4 },
        };

        for (int i=1; i<20; i++)
        foreach (var line in System.IO.File.ReadAllLines("/etc/dictionaries-common/words"))
        {
            // OPTIMIZATION #1 START
            Match match = _regex.Match(line);

            {
                if (match.Success)
                {
                    Browser b = browsers[lookupBrowserId[match.Groups["browserid"].Value]];
                    // OPTIMIZATION #1 END

                    // OPTIMIZATION #2
                    if (b.regex_not != null && b.regex_not.IsMatch(line))
                            continue;

                    string strBrowser = b.name;
                    if (b.regex_version != null)
                    {
                        // OPTIMIZATION #2
                        string strVersion = b.regex_version.Match(line).Groups[b.regex_group].Value;
                        int intPeriod = strVersion.IndexOf('.');
                        if (intPeriod > 0)
                        {
                            strBrowser += " " + strVersion.Substring(0, intPeriod);
                        }
                    }
                    if (!browser_counts.ContainsKey(strBrowser))
                    {
                        browser_counts.Add(strBrowser, 1);
                    }
                    else
                    {
                        browser_counts[strBrowser]++;
                    }
                    break;
                }
            }
        }
    }
}

回复收藏 0 原文

~没有更多了~