使用正则表达式匹配组的优先规则

发布于 2025-01-01 14:26:05 字数 363 浏览 0 评论 0原文

考虑以下 .NET 正则表达式：

^(REF)?(.{1,10})-(\d{12})-(\d+)$

它定义了我感兴趣的四个组，我将分别分析它们。

现在，考虑此正则表达式的输入字符串：

REFmisc03-123456789012-213

可以像这样匹配它：

(REF)(misc03)-(123456789012)-(213)

也可以像这样匹配它：

()(REFmisc03)-(123456789012)-(213)

是否记录了正则表达式引擎首选的方式，或者它是随机的?

原文

Consider the following .NET regular expression:

^(REF)?(.{1,10})-(\d{12})-(\d+)$

It defines four groups, in which I'm interested and which I will analyse separately.

Now, consider an input string for this regexp:

REFmisc03-123456789012-213

It is possible to match it like this:

(REF)(misc03)-(123456789012)-(213)

And it is also possible to match it like this:

()(REFmisc03)-(123456789012)-(213)

Is it documented what way will be preferred by the regexp engine, or is it random?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悸初 2025-01-08 14:26:05

这不是随机的。这归结为正则表达式引擎如何解释量词以及潜在的回溯。我所说的量词是指 (REF)? 中的 ?。根据 MSDN：

通常，量词是贪婪的；它们导致正则表达式
引擎匹配特定模式的出现次数
可能的。附加 ?量词的字符使其变得懒惰；它
使正则表达式引擎匹配尽可能少的出现次数
有可能。

换句话说，? 是贪婪的，而 ?? 是惰性的。两者都匹配零次或一次，但它们会对匹配的执行方式产生影响。

关于回溯，MSDN 提到了：

正则表达式引擎尝试完全匹配可选或
替代子表达式。当它前进到下一种语言时
子表达式中的元素且匹配不成功，则
正则表达式引擎可以放弃其成功的一部分
匹配并返回到先前保存的状态以进行匹配
正则表达式与输入字符串作为一个整体。这个过程
返回到先前保存的状态以查找匹配的过程称为
回溯。

可以在此处找到了解有关回溯的更多信息的另一个有用资源：所有格量词。

为了直接回答您的问题，我们可以比较这两种方法。

贪婪方法

原始输入：REFmisc03-123456789012-213

使用(REF)?会将您的文本与4组匹配（不包括第一组）整场比赛）并且所有组都将成功匹配：

REF
Misc03
123456789012
213

这与您的第一个可能的比赛场景（松散定义）匹配：

(参考)(misc03)-(123456789012)-(213)

只要 "misc..." 部分的长度为 1-10 个字符，匹配结果将相同，全部为 1-第二组出现的10个角色。 REF 部分将始终在第一组中匹配。

新输入： REF-123456789012-213

“misc...” 部分缺失。由于 (REF)? 是可选的，而 (.{1,10}) 不是可选的，因此正则表达式引擎将使用 "REF" 输入以满足模式的后一个（必需）部分并忽略前一个（可选）部分。这将产生以下组值：

""（空字符串，Success property = false）
REF
123456789012
213

惰性方法

原始输入： REFmisc03-123456789012-213

通过使用(REF)??，并保留其余部分如果您的模式相同，则量词变得惰性，这会返回 4 个具有以下值的组：

""（空字符串，Success property = false )
REFmisc03
123456789012
213

这与您的第二种可能的匹配场景匹配：

()(REFmisc03)-(123456789012)-(213)

由于第一组对于惰性量词是可选的，因此正则表达式引擎能够忽略它。由于 "REFmisc03" 的长度为 9 个字符，因此引擎会将 "REF" 与 "misc03" 混在一起，因为它们适合 >(.{1,10}) 组。

新输入： REF-123456789012-213

这与贪婪模式的行为类似，并且适用相同的推理。

另一个新输入： REFmisc0345-123456789012-213

在此示例中，“misc0345” 部分的长度为 8 个字符。尽管该模式使用惰性量词，但它无法将 "REFmisc0345" 放入第二组，因为它超出了 10 个字符的限制。正则表达式引擎将回溯并匹配第一组中的 "REF" 和第二组中的 "misc0345"：

REF
Misc0345
123456789012
213

It is not random. This boils down to how quantifiers are interpreted by the regex engine and potential backtracking. By quantifier I am referring to the ? in (REF)?. According to MSDN:

Ordinarily, quantifiers are greedy; they cause the regular expression
engine to match as many occurrences of particular patterns as
possible. Appending the ? character to a quantifier makes it lazy; it
causes the regular expression engine to match as few occurrences as
possible.

In other words, ? is greedy, and ?? is lazy. Both match zero or one time, but they will have an effect on how the matching is performed.

With regards to backtracking, MSDN mentions:

the regular expression engine tries to fully match optional or
alternative subexpressions. When it advances to the next language
element in the subexpression and the match is unsuccessful, the
regular expression engine can abandon a portion of its successful
match and return to an earlier saved state in the interest of matching
the regular expression as a whole with the input string. This process
of returning to a previous saved state to find a match is known as
backtracking.

Another useful resource to learn more about backtracking can be found here: Possessive Quantifiers.

To answer your question directly, we can compare both approaches.

Greedy approach

Original input: REFmisc03-123456789012-213

Usage of (REF)? will match your text with 4 groups (excluding the first group with the entire match) and all groups will be successfully matched:

REF
misc03
123456789012
213

This matches your first possible match scenario (loosely defined):

(REF)(misc03)-(123456789012)-(213)

As long as the "misc..." portion is 1-10 characters long, the match will be the same, with all 1-10 characters appearing in the second group. The REF portion will always be matched in the first group.

New input: REF-123456789012-213

The "misc..." portion is absent. Since (REF)? is optional, and (.{1,10}) isn't, the regex engine will use the "REF" input to satisfy the latter (required) portion of the pattern and disregard the former (optional) portion. This will yield the following group values:

"" (empty string, Success property = false)
REF
123456789012
213

Lazy approach

Original input: REFmisc03-123456789012-213

By using (REF)??, and keeping the rest of your pattern the same, the quantifier becomes lazy and this returns 4 groups with these values:

"" (empty string, Success property = false)
REFmisc03
123456789012
213

This matches your second possible match scenario:

()(REFmisc03)-(123456789012)-(213)

Since the first group is optional with a lazy quantifier, the regex engine is able to disregard it. Since "REFmisc03" is 9 characters long, the engine proceeds to lump "REF" in with "misc03" because they fit into the (.{1,10}) group.

New input: REF-123456789012-213

This behaves similarly to the greedy pattern and the same reasoning applies.

Another new input: REFmisc0345-123456789012-213

In this example the "misc0345" portion is 8 characters long. Although the pattern uses a lazy quantifier it can't fit "REFmisc0345" into the second group because it exceeds the 10 character limit. The regex engine will backtrack and match "REF" in the first group, and "misc0345" in the second group: