为什么不创建反向引用?
我知道将 ?:
放在正则表达式括号的开头内部将阻止它创建反向引用,这应该会更快。我的问题是,为什么要这样做?速度的增加是否足够明显以值得考虑?在什么情况下它会如此重要以至于您每次不打算使用反向引用时都需要小心地跳过它。另一个缺点是它使正则表达式更难读取、编辑和更新(如果您最终想稍后使用反向引用)。
总而言之,为什么不创建反向引用呢?
I understand that putting ?:
inside of the start of the parentheses of a regular expression will prevent it from creating a backreference, which is supposed to be faster. My question is, why do this? Is the speed increase noticeable enough to warrant this consideration? Under what circumstances is it going to matter so much that you need to carefully skip the backreference each time you are not going to use it. Another disadvantage is that it makes the regex harder to read, edit, and update (if you end up wanting to use a backreference later).
So in summary, why bother not creating a backreference?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您混淆了像
\1
这样的反向引用和捕获组(...)
。反向引用通过使语言变得不规则来防止各种优化。
捕获组使正则表达式引擎做更多的工作来记住组的开始和结束位置,但不像反向引用那么糟糕。
http://www.regular-expressions.info/brackets.html 解释了捕获组和详细参考它们。
编辑:
关于使正则表达式成为非正则的反向引用,请考虑以下与 lua 注释匹配的正则表达式:
所以
--[[...]]
是一个注释,--[=[ ...]=]
是注释,--[==[...]==]
是注释。您可以通过在方括号之间添加额外的等号来嵌套注释。
这无法通过严格的常规语言来匹配,因此简单的有限状态机无法在 O 中处理它(n) 时间——你需要一个计数器。
Perl 5 正则表达式可以使用反向引用来处理这个问题。但是,一旦您需要非常规模式匹配,您的正则表达式库就必须放弃简单的状态机方法,而使用更复杂、效率更低的代码。
I think you're confusing backreferences like
\1
and capturing groups(...)
.Backreferences prevent all kinds of optimizations by making the language non-regular.
Capturing groups make the regular expression engine do a little more work to remember where a group starts and ends, but are not as bad as backreferences.
http://www.regular-expressions.info/brackets.html explains capturing groups and back references to them in detail.
EDIT:
On backreferences making regular expressions non-regular, consider the following regular expression which matches lua comments:
So
--[[...]]
is a comment,--[=[...]=]
is a comment,--[==[...]==]
is a comment.You can nest comments by adding extra equals signs between the square brackets.
This cannot be matched by a strictly regular language, so a simple finite state machine cannot handle it in O(n) time -- you need a counter.
Perl 5 regular expressions can handle this using back-references. But as soon as you require non-regular pattern matching, your regular expression library has to give up the simple state-machine approach and use more complex, less-efficient code.
你是对的,性能并不是避免捕获群体的唯一原因——事实上,它甚至不是最重要的原因。
我以相反的方式看待它:如果您习惯使用非捕获组,那么当您确实选择捕获某些内容时,更容易跟踪组编号。同样,如果您使用命名组(假设您的正则表达式风格支持它们),您应该始终使用命名组,并且始终引用它们(在反向引用中)或替换字符串)按名称,而不是按数字。始终遵循这些规则将至少部分抵消非捕获组的可读性损失。
是的,这是一个 PITA 必须以这种方式弄乱你的正则表达式,编写/维护正则表达式实现的人都知道这一点。在 .NET 中,您可以设置
ExplicitCapture
选项,从而将所有“裸”括号视为非捕获组,并且仅捕获命名组。在 Perl 6 中,括号(带或不带名称)始终捕获,方括号用于非捕获组。其他口味最终可能会效仿,但与此同时我们只需要依靠良好的习惯。You're right, performance is not the only reason to avoid capturing groups--in fact, it's not even the most important reason.
I look at it the other way around: if you habitually use non-capturing groups, it's easier to keep track of the group numbers on those occasions when you do choose to capture something. In the same vein, if you're using named groups (assuming your regex flavor supports them), you should always use named groups, and always refer to them (in backreferences or replacement strings) by name, not by number. Following these rules consistently will at least partially offset the readability penalty of non-capturing groups.
Yes, it is a PITA having to clutter up your regexes that way, and the people who write/maintain the regex implementations know it. In .NET you can set the
ExplicitCapture
option whereby all "bare" parentheses are treated as non-capturing groups, and only named groups capture. In Perl 6, parentheses (with or without names) always capture, and square brackets are used for non-capturing groups. The other flavors will probably follow suit eventually, but in the meantime we just have to rely on good habits.