解析文件中文本的代码减慢到停止 C#
private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
{
var i = 0;
var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
{
#if DEBUG
Stopwatch sw = Stopwatch.StartNew();
#endif
var text = File.ReadAllText(item.FullName);
MatchCollection matches = exp.Matches(text);
foreach (Match match in matches)
{
var matchValue = match.Groups[0].Value;
if (dictionary.ContainsKey(matchValue))
{
dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
}
else
{
dictionary.Add(matchValue, item.Name);
}
}
Console.WriteLine(string.Format("Found matches in {0}.", item.Name));
#if DEBUG
sw.Stop();
Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
#endif
Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
}
}
lambda 大约找到 232 个文件。它能顺利通过 160 度,然后就开始爬行。我现在正在分析代码,但想知道是否有任何明显的错误。
正则表达式是
Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);
所有文件都具有相似的长度和相似的结构。
大多数文件花费的时间少于 30 毫秒,但有些文件需要 11251 毫秒。
使用更新的正则表达式,整个过程现在需要 1700 毫秒。唷!
private static void BuildDictionaryOfRequires(Regex exp, Dictionary<string, string> dictionary, DirectoryInfo dir)
{
var i = 0;
var total = dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
Where(x => x.Extension == ".aspx" || x.Extension == ".ascx").Count();
foreach (var item in dir.EnumerateFiles("*.*", SearchOption.AllDirectories).
Where(x => x.Extension == ".aspx" || x.Extension == ".ascx"))
{
#if DEBUG
Stopwatch sw = Stopwatch.StartNew();
#endif
var text = File.ReadAllText(item.FullName);
MatchCollection matches = exp.Matches(text);
foreach (Match match in matches)
{
var matchValue = match.Groups[0].Value;
if (dictionary.ContainsKey(matchValue))
{
dictionary[matchValue] = string.Format("{0},{1}", dictionary[matchValue], item.Name);
}
else
{
dictionary.Add(matchValue, item.Name);
}
}
Console.WriteLine(string.Format("Found matches in {0}.", item.Name));
#if DEBUG
sw.Stop();
Console.WriteLine("Time used (float): {0} ms", sw.Elapsed.TotalMilliseconds);
#endif
Console.WriteLine(string.Format("{0} of {1}", (++i).ToString(), total));
}
}
there are about 232 files the lambda finds. It rips through 160 just fine then comes to a crawl. I'm profiling the code now but wondering if there is anything obvious i'm doing wrong.
the regex is
Regex exp = new Regex(@"dojo\.require\([""'][\w\.]+['""]\);?", RegexOptions.IgnoreCase | RegexOptions.Compiled);
all of the files are similar length and similar structure.
most files take less than 30ms but some are 11251 ms.
with updated regex the whole process takes 1700ms now. phew!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试简化您的正则表达式:
更新:如果您想匹配您的示例,请删除末尾的分号。
Try simplifying your regex:
UPDATE: Then remove the semi-colon at the end if you want to match your example.
我认为当前的违规部分是正则表达式的一部分:
删除 ?并添加
\w*
,您将匹配所有相同的字符串,但效率更高。(\w+\.?)*
可以通过多种不同方式匹配asdf
:我猜你的某些文件有一堆像这样的行:
你的正则表达式将失败最贪婪的匹配,然后尝试所有其他组合,直到它最终没有得到任何完全匹配。随着
'asdf'
字符串的增长,这可能会变得非常昂贵。尝试使用:
I think the current offending piece is piece of the regex here:
Remove the ? and add
\w*
and you'll match all of the same strings, but much more efficiently.(\w+\.?)*
can matchasdf
many different ways:I'm guessing that some of your files had a bunch of lines like this:
Your regex would fail the greediest match, and then try every other combination until it eventually didn't get any match at all. This can get very expensive as the
'asdf'
string grows.Try using:
有几件事:
它很贵。
调用 Dispose,这样你也可以摆脱它。
File.ReadAllText 方法可用于摆脱
您正在构建和处置的 StreamReaders。
A few things:
it's expensive.
calls Dispose, so you can get rid of that as well.
File.ReadAllText method that can be used to get rid of the
StreamReaders you are constructing and disposing of.