使用正则表达式匹配组的优先规则
考虑以下 .NET 正则表达式:
^(REF)?(.{1,10})-(\d{12})-(\d+)$
它定义了我感兴趣的四个组,我将分别分析它们。
现在,考虑此正则表达式的输入字符串:
REFmisc03-123456789012-213
可以像这样匹配它:
(REF)(misc03)-(123456789012)-(213)
也可以像这样匹配它:
()(REFmisc03)-(123456789012)-(213)
是否记录了正则表达式引擎首选的方式,或者它是随机的?
Consider the following .NET regular expression:
^(REF)?(.{1,10})-(\d{12})-(\d+)$
It defines four groups, in which I'm interested and which I will analyse separately.
Now, consider an input string for this regexp:
REFmisc03-123456789012-213
It is possible to match it like this:
(REF)(misc03)-(123456789012)-(213)
And it is also possible to match it like this:
()(REFmisc03)-(123456789012)-(213)
Is it documented what way will be preferred by the regexp engine, or is it random?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这不是随机的。这归结为正则表达式引擎如何解释量词以及潜在的回溯。我所说的量词是指
(REF)?
中的?
。 根据 MSDN:换句话说,
?
是贪婪的,而??
是惰性的。两者都匹配零次或一次,但它们会对匹配的执行方式产生影响。关于回溯,MSDN 提到了:
可以在此处找到了解有关回溯的更多信息的另一个有用资源:所有格量词。
为了直接回答您的问题,我们可以比较这两种方法。
贪婪方法
原始输入:
REFmisc03-123456789012-213
使用
(REF)?
会将您的文本与4组匹配(不包括第一组)整场比赛)并且所有组都将成功匹配:这与您的第一个可能的比赛场景(松散定义)匹配:
只要
"misc..."
部分的长度为 1-10 个字符,匹配结果将相同,全部为 1-第二组出现的10个角色。REF
部分将始终在第一组中匹配。新输入:
REF-123456789012-213
“misc...”
部分缺失。由于(REF)?
是可选的,而(.{1,10})
不是可选的,因此正则表达式引擎将使用"REF"
输入以满足模式的后一个(必需)部分并忽略前一个(可选)部分。这将产生以下组值:""
(空字符串,Success
property =false
)惰性方法
原始输入:
REFmisc03-123456789012-213
通过使用
(REF)??
,并保留其余部分如果您的模式相同,则量词变得惰性,这会返回 4 个具有以下值的组:""
(空字符串,Success
property =false
)这与您的第二种可能的匹配场景匹配:
由于第一组对于惰性量词是可选的,因此正则表达式引擎能够忽略它。由于
"REFmisc03"
的长度为 9 个字符,因此引擎会将"REF"
与"misc03"
混在一起,因为它们适合>(.{1,10})
组。新输入:
REF-123456789012-213
这与贪婪模式的行为类似,并且适用相同的推理。
另一个新输入:
REFmisc0345-123456789012-213
在此示例中,
“misc0345”
部分的长度为 8 个字符。尽管该模式使用惰性量词,但它无法将"REFmisc0345"
放入第二组,因为它超出了 10 个字符的限制。正则表达式引擎将回溯并匹配第一组中的"REF"
和第二组中的"misc0345"
:It is not random. This boils down to how quantifiers are interpreted by the regex engine and potential backtracking. By quantifier I am referring to the
?
in(REF)?
. According to MSDN:In other words,
?
is greedy, and??
is lazy. Both match zero or one time, but they will have an effect on how the matching is performed.With regards to backtracking, MSDN mentions:
Another useful resource to learn more about backtracking can be found here: Possessive Quantifiers.
To answer your question directly, we can compare both approaches.
Greedy approach
Original input:
REFmisc03-123456789012-213
Usage of
(REF)?
will match your text with 4 groups (excluding the first group with the entire match) and all groups will be successfully matched:This matches your first possible match scenario (loosely defined):
As long as the
"misc..."
portion is 1-10 characters long, the match will be the same, with all 1-10 characters appearing in the second group. TheREF
portion will always be matched in the first group.New input:
REF-123456789012-213
The
"misc..."
portion is absent. Since(REF)?
is optional, and(.{1,10})
isn't, the regex engine will use the"REF"
input to satisfy the latter (required) portion of the pattern and disregard the former (optional) portion. This will yield the following group values:""
(empty string,Success
property =false
)Lazy approach
Original input:
REFmisc03-123456789012-213
By using
(REF)??
, and keeping the rest of your pattern the same, the quantifier becomes lazy and this returns 4 groups with these values:""
(empty string,Success
property =false
)This matches your second possible match scenario:
Since the first group is optional with a lazy quantifier, the regex engine is able to disregard it. Since
"REFmisc03"
is 9 characters long, the engine proceeds to lump"REF"
in with"misc03"
because they fit into the(.{1,10})
group.New input:
REF-123456789012-213
This behaves similarly to the greedy pattern and the same reasoning applies.
Another new input:
REFmisc0345-123456789012-213
In this example the
"misc0345"
portion is 8 characters long. Although the pattern uses a lazy quantifier it can't fit"REFmisc0345"
into the second group because it exceeds the 10 character limit. The regex engine will backtrack and match"REF"
in the first group, and"misc0345"
in the second group: