解析文件中文本的代码减慢到停止 C#

发布于 2024-12-05 18:27:34 字数 1837 浏览 0 评论 0原文

 private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
    {
        var i = 0;
        var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
        foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
        {
 #if DEBUG
            Stopwatch sw = Stopwatch.StartNew();
 #endif

            var text = File.ReadAllText(item.FullName);

            MatchCollection matches = exp.Matches(text);
            foreach (Match match in matches)
            {
                var matchValue = match.Groups[0].Value;

                if (dictionary.ContainsKey(matchValue))
                {
                    dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
                }
                else
                {
                    dictionary.Add(matchValue, item.Name);
                }
            }

            Console.WriteLine(string.Format("Found matches in {0}.", item.Name));

 #if DEBUG
            sw.Stop();
            Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
 #endif


            Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
        }
    }

lambda 大约找到 232 个文件。它能顺利通过 160 度,然后就开始爬行。我现在正在分析代码,但想知道是否有任何明显的错误。

正则表达式是

    Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);

所有文件都具有相似的长度和相似的结构。

大多数文件花费的时间少于 30 毫秒,但有些文件需要 11251 毫秒。

使用更新的正则表达式,整个过程现在需要 1700 毫秒。唷!

 private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
    {
        var i = 0;
        var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
        foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
        {
 #if DEBUG
            Stopwatch sw = Stopwatch.StartNew();
 #endif

            var text = File.ReadAllText(item.FullName);

            MatchCollection matches = exp.Matches(text);
            foreach (Match match in matches)
            {
                var matchValue = match.Groups[0].Value;

                if (dictionary.ContainsKey(matchValue))
                {
                    dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
                }
                else
                {
                    dictionary.Add(matchValue, item.Name);
                }
            }

            Console.WriteLine(string.Format("Found matches in {0}.", item.Name));

 #if DEBUG
            sw.Stop();
            Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
 #endif


            Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
        }
    }

there are about 232 files the lambda finds. It rips through 160 just fine then comes to a crawl. I'm profiling the code now but wondering if there is anything obvious i'm doing wrong.

the regex is

    Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);

all of the files are similar length and similar structure.

most files take less than 30ms but some are 11251 ms.

with updated regex the whole process takes 1700ms now. phew!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

只为一人 2024-12-12 18:27:34

尝试简化您的正则表达式:

Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+[""']\)", RegexOptions.IgnoreCase | RegexOptions.Compiled);

更新:如果您想匹配您的示例,请删除末尾的分号。

Try simplifying your regex:

Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+[""']\)", RegexOptions.IgnoreCase | RegexOptions.Compiled);

UPDATE: Then remove the semi-colon at the end if you want to match your example.

∞觅青森が 2024-12-12 18:27:34

我认为当前的违规部分是正则表达式的一部分:

(\w+\.?)*

删除 ?并添加 \w*,您将匹配所有相同的字符串,但效率更高。

(\w+\.?)* 可以通过多种不同方式匹配 asdf

  • asdf
  • asd,f
  • as,d,f
  • a,s,d,f
  • a,sd,f
  • a,s,df
  • a,sdf
  • as,df

我猜你的某些文件有一堆像这样的行:

dojo.require('asdf')  //with no ;

你的正则表达式将失败最贪婪的匹配,然后尝试所有其他组合,直到它最终没有得到任何完全匹配。随着 'asdf' 字符串的增长,这可能会变得非常昂贵。

尝试使用:

Regex exp = new Regex(@"dojo\.require\((\""|\')((\w+\.)*\w*)(\""|\')\);");

I think the current offending piece is piece of the regex here:

(\w+\.?)*

Remove the ? and add \w* and you'll match all of the same strings, but much more efficiently.

(\w+\.?)* can match asdf many different ways:

  • asdf
  • asd,f
  • as,d,f
  • a,s,d,f
  • a,sd,f
  • a,s,df
  • a,sdf
  • as,df

I'm guessing that some of your files had a bunch of lines like this:

dojo.require('asdf')  //with no ;

Your regex would fail the greediest match, and then try every other combination until it eventually didn't get any match at all. This can get very expensive as the 'asdf' string grows.

Try using:

Regex exp = new Regex(@"dojo\.require\((\""|\')((\w+\.)*\w*)(\""|\')\);");
玻璃人 2024-12-12 18:27:34

有几件事:

  1. 删除 DiscardBufferedData 调用。你不需要它,而且
    它很贵。
  2. 修复双重处置。请注意,也关闭
    调用 Dispose,这样你也可以摆脱它。
  3. 其实,有一个
    File.ReadAllText 方法可用于摆脱
    您正在构建和处置的 StreamReaders。

A few things:

  1. take out the DiscardBufferedData call. You don't need it, and
    it's expensive.
  2. Fix the double dispose. Note that Close also
    calls Dispose, so you can get rid of that as well.
  3. Actually, there is a
    File.ReadAllText method that can be used to get rid of the
    StreamReaders you are constructing and disposing of.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文