为什么我的非贪婪 Perl 正则表达式仍然匹配太多?

发布于 2024-08-07 22:58:16 字数 579 浏览 6 评论 0原文

比如说,我有一行包含以下字符串:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

我想提取

"$dick" said "blah blah blah"

以下代码:

my ($term) = /(".+?" said ".+?")/g;
print $term;

但它给了我比我需要的更多的内容:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

我尝试使用非捕获括号将我的模式作为一个整体进行分组:

my ($term) = /((?:".+?" said ".+?"))/g;

但是问题仍然存在。

我重读了 Learning Perl 的 Nongreedy Quantifiers 部分,但到目前为止我一无所获。

感谢您慷慨提供的任何指导:)

Say, I have a line that contains the following string:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

and I want to extract

"$dick" said "blah blah blah"

I have the following code:

my ($term) = /(".+?" said ".+?")/g;
print $term;

But it gives me more than I need:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

I tried grouping my pattern as a whole by using the non-capturing parens:

my ($term) = /((?:".+?" said ".+?"))/g;

But the problem persists.

I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.

Thanks for any guidance you can generously offer :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

想挽留 2024-08-14 22:58:16

问题是,尽管它不贪心,但它仍然在不断尝试。正则表达式不会看到

"$tom" said blah blah blash.

并思考“哦,“所说”后面的内容没有被引用,所以我会跳过那个。”它认为“嗯,‘said’之后的内容没有被引用,所以它仍然是我们引用的一部分。”因此 ".+?" 匹配

"$tom" said blah blah blash.  "$dick"

您想要的是 "[^"]+"。这将匹配包含非引号的任何内容的两个引号。所以最终解决方案:

("[^"]+" said "[^"]+")

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")
尤怨 2024-08-14 22:58:16

不幸的是 " 是一个足够奇特的字符,需要仔细对待。使用:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

它应该可以正常工作(它对我来说......!)。即显式匹配“非双引号”序列而不是任意字符的序列。

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

烟燃烟灭 2024-08-14 22:58:16

其他人已经提到了如何解决这个问题。

我将回答如何调试此问题:您可以通过使用更多捕获来查看发生的情况:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""
空宴 2024-08-14 22:58:16

这里的问题是,您的正则表达式有两种可能的匹配,一种是您想要的(较短的),另一种是正则表达式引擎选择的。引擎选择该特定匹配是因为它更喜欢字符串中较早开始且较长的匹配,而不是开始较晚且较短的匹配。换句话说,早期的比赛胜过较短的比赛。

要解决这个问题,您需要使您的正则表达式更加具体(例如告诉引擎 $term 不应包含任何引号。无论如何,使您的正则表达式尽可能具体是一个好主意。

有关正则表达式的更多详细信息和陷阱,我推荐Jeffrey Friedl的好书:掌握正则表达式

Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.

To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.

For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文