为什么正则表达式默认是贪婪的?
对于编写正则表达式的初学者来说,这似乎是一个巨大的混乱来源,可能会导致隐藏的性能问题,而且典型的用例似乎是非贪婪的。
这只是出于遗留原因(这是最初完成的方式,每个实现都会复制它),还是有其他原因?
It seems that this is a huge source of confusion for beginners writing regular expressions, can cause hidden performance problems, and it would seem that a typical use case would be non-greedy.
Is this just for legacy reasons (it was how it was first done, and every implementation copies that), or is there a reason for it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
歇斯底里的 Raisens
部分答案可能涉及 RE 在实际计算中的起源。它们最初是来自自动机理论和形式语言理论的理论概念,直到Ken Thompson 本人编写了一个真正的实现并在 qed 和 ed( 1)。
原始版本只有贪婪语法,因此甚至没有做出决定。
Hysterical Raisens
Part of the answer may involve the origins of REs in practical computing. They were originally a theoretical concept from automata theory and formal language theory until Ken Thompson himself wrote a real implementation and used them in qed and ed(1).
The original version had only the greedy syntax and so there wasn't a decision to even make.
就性能而言,由于回溯,惰性量词并不总是更快:http:// blog.stevenlevithan.com/archives/greedy-lazy-performance
至于实际设计,老实说我不能说为什么量词默认是贪婪的,但我确实想知道什么控制字符会被用来制作量词贪婪而不是懒惰。我不认为
?
会削减它:-)In the case of performance, lazy quantifiers aren't always faster because of backtracking: http://blog.stevenlevithan.com/archives/greedy-lazy-performance
As for the actual design, I honestly can't say why quantifiers are greedy by default but I do wonder what control character would have been used to make a quantifier greedy instead of lazy. I don't think
?
would have cut it :-)可能的原因:如果非贪婪,正则表达式引擎需要回溯很多。
Possible reason: The regex engine needs to backtrack a lot if it's non-greedy.
嗯,重要的是计算机尽可能地表现得可预测。因此,正确的行为应该遵循一个简单的规则,比如贪婪匹配,这样至少有经验的程序员才能预测一段代码的结果。
至于典型用例是否应该是非贪婪的,那么以下情况如何:假设我有一个包含 foo1909、bar3939、baz3331 等条目的文件,而我只想提取这些数字。将 (\d*) 写为正则表达式似乎很自然。
你可能会说,写成 (\d*)\D 或其他什么都一样容易,但基本上总是这样,程序员可以更明确、更明确。因为我们想要一个 100% 可预测的默认行为,并且在头脑中计算起来很简单,所以对我来说这似乎是合理的。
Well, it is important that computers behave predictably whenever possible. So the correct behavior should follow a simple rule, like greedy matching, so that at least experienced programmers can predict the outcome of a piece of code.
As for whether a typical use case should be non-greedy, what about the following: suppose I have a file with entries like foo1909, bar3939, baz3331, and I just want to extract these numbers. It seems natural enough to write (\d*) as the regular expression for this.
You might say that it is just as easy to write (\d*)\D or whatever, but it is basically always the case that the programmer can be more explicit and less ambiguous. Since we wanted a default behavior that was 100% predictable, and trivial to calculate in ones head, it seems reasonable to me.
这里真正的问题是 Kleene 闭包运算符(星号);对于正则表达式中的其他所有内容,最长匹配与最短匹配相同。
当您从这些角度思考时,您会意识到更现代的工具意识到您需要两者。我迟到了,所以我只能想到两个例子:
ksh
和bash
都提供“最长匹配”和“最短匹配”大多数特殊变量更改运算符的“匹配”形式。Lua正则表达式包括用于Kleene闭包最长匹配的
*
和用于Kleene闭包最短匹配的-
。当我忘记转义文字-
符号时,这个问题总是困扰着我。回顾 Kleene 的原始工作,看看这是否会影响最长匹配的早期工具,将会很有趣。
The real issue here is the Kleene closure operator (star); for everything else in a regular expression, the longest match is the same as the shortest match.
When you think about it in those terms, you realize that more modern tools realize you need both. I'm up late so I can think of only two examples:
Both
ksh
andbash
provide "longest match" and "shortest match" forms of most of the special variable-altering operators.The Lua regular expressions include
*
for Kleene closure longest match and-
for Kleene closure shortest match. This one always bites me when I forget to escape a literal-
sign.It would be interesting to go back to Kleene's original work and see if that might have influenced early tools toward longest match.
我想澄清的是,这是错误的,除非“典型用例”意味着 HTML 黑客攻击。
一个简单的例子是编程语言的词法分析器。您只是不想
被解释为 3 个变量,后跟一个等号,后跟 2 个数字。相反,通常您希望解析器考虑最长的可能匹配。
在 HTML 出现之前,我们这些年长的人已经在贪婪的正则表达式中生活了几十年,而且我们做得很好。即使在今天,我在 99% 的情况下也不会使用非贪婪的,诚然,因为我懒得查找语法,但也因为很少有场合你不能只写一个终止良好的贪婪的。例如,要匹配一个字符串:
I want to make it clear that this is wrong, unless "typical use case" means HTML-hacking.
An easy example are lexical analysers for programming languages. You simply don't want
to be interpreted as 3 variables, followed by an equal sign, followed by 2 numbers. On the contrary, typically you expect your parser to consider the longest possible matches.
Before the advent of HTML, we elder ones have lived for decades with greedy regular expressions, and we did just fine. Even today I do not use non-greedy ones in 99% of all cases, admittedly because I am too lazy to look up the syntax, but also because the occasions are seldom where you couldn't just write a well terminated greedy one. For example, to match a string: