为什么我的非贪婪 Perl 正则表达式仍然匹配太多?
比如说,我有一行包含以下字符串:
"$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah.
我想提取
"$dick" said "blah blah blah"
以下代码:
my ($term) = /(".+?" said ".+?")/g;
print $term;
但它给了我比我需要的更多的内容:
"$tom" said blah blah blash. "$dick" said "blah blah blah"
我尝试使用非捕获括号将我的模式作为一个整体进行分组:
my ($term) = /((?:".+?" said ".+?"))/g;
但是问题仍然存在。
我重读了 Learning Perl 的 Nongreedy Quantifiers 部分,但到目前为止我一无所获。
感谢您慷慨提供的任何指导:)
Say, I have a line that contains the following string:
"$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah.
and I want to extract
"$dick" said "blah blah blah"
I have the following code:
my ($term) = /(".+?" said ".+?")/g;
print $term;
But it gives me more than I need:
"$tom" said blah blah blash. "$dick" said "blah blah blah"
I tried grouping my pattern as a whole by using the non-capturing parens:
my ($term) = /((?:".+?" said ".+?"))/g;
But the problem persists.
I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.
Thanks for any guidance you can generously offer :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
问题是,尽管它不贪心,但它仍然在不断尝试。正则表达式不会看到
并思考“哦,“所说”后面的内容没有被引用,所以我会跳过那个。”它认为“嗯,‘said’之后的内容没有被引用,所以它仍然是我们引用的一部分。”因此
".+?"
匹配您想要的是
"[^"]+"
。这将匹配包含非引号的任何内容的两个引号。所以最终解决方案:The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see
and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So
".+?"
matchesWhat you want is
"[^"]+"
. This will match two quote marks enclosing anything that's not a quote mark. So the final solution:不幸的是
"
是一个足够奇特的字符,需要仔细对待。使用:它应该可以正常工作(它对我来说......!)。即显式匹配“非双引号”序列而不是任意字符的序列。
Unfortunately
"
is a peculiar-enough character to need to be treated carefully. Use:and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.
其他人已经提到了如何解决这个问题。
我将回答如何调试此问题:您可以通过使用更多捕获来查看发生的情况:
Others have mentioned how to fix this.
I'll answer how you can debug this: you can see what's happening by using more captures:
这里的问题是,您的正则表达式有两种可能的匹配,一种是您想要的(较短的),另一种是正则表达式引擎选择的。引擎选择该特定匹配是因为它更喜欢字符串中较早开始且较长的匹配,而不是开始较晚且较短的匹配。换句话说,早期的比赛胜过较短的比赛。
要解决这个问题,您需要使您的正则表达式更加具体(例如告诉引擎 $term 不应包含任何引号。无论如何,使您的正则表达式尽可能具体是一个好主意。
有关正则表达式的更多详细信息和陷阱,我推荐Jeffrey Friedl的好书:掌握正则表达式
Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.
To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.
For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions