为什么 Group.Value 总是最后一个匹配的组字符串?
最近,我发现一个 C# Regex API 真的很烦人。
我有正则表达式 (([0-9]+)|([az]+))+
。我想找到所有匹配的字符串。代码如下。
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456defFOO";
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
输出是:
Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match
似乎所有 group.Value 都是最后一个匹配的字符串(“def”和“456”)。我花了一些时间弄清楚我应该依靠 group.Captures 而不是 group.Value。
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456def";
//Console.WriteLine(str);
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
CaptureCollection cc = group.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(" Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
这将输出:
Match1
Match group count = 4
Group0='abc123xyz456def'
Capture0='abc123xyz456def', Position=0
Group1='def'
Capture0='abc', Position=0
Capture1='123', Position=3
Capture2='xyz', Position=6
Capture3='456', Position=9
Capture4='def', Position=12
Group2='456'
Capture0='123', Position=3
Capture1='456', Position=9
Group3='def'
Capture0='abc', Position=0
Capture1='xyz', Position=6
Capture2='def', Position=12
go to next match
现在,我想知道为什么 API 设计是这样的。为什么 Group.Value 只返回最后一个匹配的字符串?这个设计看起来不太好。
Recently, I found one C# Regex API really annoying.
I have regular expression (([0-9]+)|([a-z]+))+
. I want to find all matched string. The code is like below.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456defFOO";
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
The output is:
Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match
It seems that all group.Value is the last matched string ("def" and "456"). I spent some time to figure out that I should count on group.Captures instead of group.Value.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456def";
//Console.WriteLine(str);
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
CaptureCollection cc = group.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(" Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
This will output:
Match1
Match group count = 4
Group0='abc123xyz456def'
Capture0='abc123xyz456def', Position=0
Group1='def'
Capture0='abc', Position=0
Capture1='123', Position=3
Capture2='xyz', Position=6
Capture3='456', Position=9
Capture4='def', Position=12
Group2='456'
Capture0='123', Position=3
Capture1='456', Position=9
Group3='def'
Capture0='abc', Position=0
Capture1='xyz', Position=6
Capture2='def', Position=12
go to next match
Now, I am wondering why the API design is like this. Why Group.Value only returns the last matched string? This design doesn't look good.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
主要原因是历史性的:正则表达式一直都是这样工作的,可以追溯到 Perl 及以后。但这并不是真正糟糕的设计。通常,如果您想要这样的每个匹配,您只需省略最外面的量词(在这种情况下为
+
)并使用Matches()
方法而不是Match ()
。每种支持正则表达式的语言都提供了一种方法来做到这一点:在 Perl 或 JavaScript 中,您可以在/g
模式下进行匹配;在 Ruby 中,您可以使用scan
方法;在 Java 中,您重复调用find()
直到返回false
。同样,如果您正在进行替换操作,则可以使用占位符($1
、$2
或\1、
\2
,具体取决于语言)。另一方面,据我所知,没有其他 Perl 5 派生的正则表达式风格能够像 .NET 及其 CaptureCollections 那样提供检索中间捕获组匹配的能力。我并不感到惊讶:实际上很少有人真正需要像这样一次性捕获所有比赛。并考虑跟踪所有这些中间匹配所需的所有存储和/或处理能力。不过,这是一个不错的功能。
The primary reason is historical: regexes have always worked that way, going back to Perl and beyond. But it's not really bad design. Usually, if you want every match like that, you just leave off the outermost quantifier (
+
in ths case) and use theMatches()
method instead ofMatch()
. Every regex-enabled language provides a way to do that: in Perl or JavaScript you do the match in/g
mode; in Ruby you use thescan
method; in Java you callfind()
repeatedly until it returnsfalse
. Similarly, if you're doing a replace operation, you can plug the captured substrings back in as you go with placeholders ($1
,$2
or\1
,\2
, depending on the language).On the other hand, I know of no other Perl 5-derived regex flavor that provides the ability to retrieve intermediate capture-group matches like .NET does with its CaptureCollections. And I'm not surprised: it's actually very seldom that you really need to capture all the matches in one go like that. And think of all the storage and/or processing power it can take to keep track of all those intermediate matches. It is a nice feature though.