解析文件中文本的代码减慢到停止 C#

发布于 2024-12-05 18:27:34 字数 1837 浏览 0 评论 0原文

 private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
    {
        var i = 0;
        var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
        foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
        {
 #if DEBUG
            Stopwatch sw = Stopwatch.StartNew();
 #endif

            var text = File.ReadAllText(item.FullName);

            MatchCollection matches = exp.Matches(text);
            foreach (Match match in matches)
            {
                var matchValue = match.Groups[0].Value;

                if (dictionary.ContainsKey(matchValue))
                {
                    dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
                }
                else
                {
                    dictionary.Add(matchValue, item.Name);
                }
            }

            Console.WriteLine(string.Format("Found matches in {0}.", item.Name));

 #if DEBUG
            sw.Stop();
            Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
 #endif


            Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
        }
    }

lambda 大约找到 232 个文件。它能顺利通过 160 度，然后就开始爬行。我现在正在分析代码，但想知道是否有任何明显的错误。

正则表达式是

    Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);

所有文件都具有相似的长度和相似的结构。

大多数文件花费的时间少于 30 毫秒，但有些文件需要 11251 毫秒。

使用更新的正则表达式，整个过程现在需要 1700 毫秒。唷！

原文

 private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
    {
        var i = 0;
        var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
        foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
                                 Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
        {
 #if DEBUG
            Stopwatch sw = Stopwatch.StartNew();
 #endif

            var text = File.ReadAllText(item.FullName);

            MatchCollection matches = exp.Matches(text);
            foreach (Match match in matches)
            {
                var matchValue = match.Groups[0].Value;

                if (dictionary.ContainsKey(matchValue))
                {
                    dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
                }
                else
                {
                    dictionary.Add(matchValue, item.Name);
                }
            }

            Console.WriteLine(string.Format("Found matches in {0}.", item.Name));

 #if DEBUG
            sw.Stop();
            Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
 #endif


            Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
        }
    }

there are about 232 files the lambda finds. It rips through 160 just fine then comes to a crawl. I'm profiling the code now but wondering if there is anything obvious i'm doing wrong.

the regex is

    Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);

all of the files are similar length and similar structure.

most files take less than 30ms but some are 11251 ms.

with updated regex the whole process takes 1700ms now. phew!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只为一人 2024-12-12 18:27:34

尝试简化您的正则表达式：

Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+[""']\)", RegexOptions.IgnoreCase | RegexOptions.Compiled);

更新：如果您想匹配您的示例，请删除末尾的分号。

Try simplifying your regex:

Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+[""']\)", RegexOptions.IgnoreCase | RegexOptions.Compiled);

UPDATE: Then remove the semi-colon at the end if you want to match your example.

回复收藏 0 原文

∞觅青森が 2024-12-12 18:27:34

我认为当前的违规部分是正则表达式的一部分：

(\w+\.?)*

删除 ?并添加 \w*，您将匹配所有相同的字符串，但效率更高。

(\w+\.?)* 可以通过多种不同方式匹配 asdf：

asdf
asd,f
as,d,f
a,s,d,f
a,sd,f
a,s,df
a,sdf
as,df

我猜你的某些文件有一堆像这样的行：

dojo.require('asdf')  //with no ;

你的正则表达式将失败最贪婪的匹配，然后尝试所有其他组合，直到它最终没有得到任何完全匹配。随着 'asdf' 字符串的增长，这可能会变得非常昂贵。

尝试使用：

Regex exp = new Regex(@"dojo\.require\((\""|\')((\w+\.)*\w*)(\""|\')\);");

I think the current offending piece is piece of the regex here:

(\w+\.?)*

Remove the ? and add \w* and you'll match all of the same strings, but much more efficiently.

(\w+\.?)* can match asdf many different ways:

asdf
asd,f
as,d,f
a,s,d,f
a,sd,f
a,s,df
a,sdf
as,df

I'm guessing that some of your files had a bunch of lines like this:

dojo.require('asdf')  //with no ;

Your regex would fail the greediest match, and then try every other combination until it eventually didn't get any match at all. This can get very expensive as the 'asdf' string grows.

Try using:

Regex exp = new Regex(@"dojo\.require\((\""|\')((\w+\.)*\w*)(\""|\')\);");

回复收藏 0 原文

玻璃人 2024-12-12 18:27:34

有几件事：

删除 DiscardBufferedData 调用。你不需要它，而且
它很贵。
修复双重处置。请注意，也关闭
调用 Dispose，这样你也可以摆脱它。
其实，有一个
File.ReadAllText 方法可用于摆脱
您正在构建和处置的 StreamReaders。

回复收藏 0 原文

~没有更多了~

关于作者

再可℃爱ぅ一点好了

暂无简介

0 文章

0 评论

24 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

解析文件中文本的代码减慢到停止 C#

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

解析文件中文本的代码减慢到停止 C#

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。