为什么我的非贪婪 Perl 正则表达式仍然匹配太多？

发布于 2024-08-07 22:58:16 字数 579 浏览 14 评论 0原文

比如说，我有一行包含以下字符串：

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

我想提取

"$dick" said "blah blah blah"

以下代码：

my ($term) = /(".+?" said ".+?")/g;
print $term;

但它给了我比我需要的更多的内容：

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

我尝试使用非捕获括号将我的模式作为一个整体进行分组：

my ($term) = /((?:".+?" said ".+?"))/g;

但是问题仍然存在。

我重读了 Learning Perl 的 Nongreedy Quantifiers 部分，但到目前为止我一无所获。

感谢您慷慨提供的任何指导:)

原文

Say, I have a line that contains the following string:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

and I want to extract

"$dick" said "blah blah blah"

I have the following code:

my ($term) = /(".+?" said ".+?")/g;
print $term;

But it gives me more than I need:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

I tried grouping my pattern as a whole by using the non-capturing parens:

my ($term) = /((?:".+?" said ".+?"))/g;

But the problem persists.

I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.

Thanks for any guidance you can generously offer :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

想挽留 2024-08-14 22:58:16

问题是，尽管它不贪心，但它仍然在不断尝试。正则表达式不会看到

"$tom" said blah blah blash.

并思考“哦，“所说”后面的内容没有被引用，所以我会跳过那个。”它认为“嗯，‘said’之后的内容没有被引用，所以它仍然是我们引用的一部分。”因此 ".+?" 匹配

"$tom" said blah blah blash.  "$dick"

您想要的是 "[^"]+"。这将匹配包含非引号的任何内容的两个引号。所以最终解决方案：

("[^"]+" said "[^"]+")

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")

回复收藏 0 原文

尤怨 2024-08-14 22:58:16

不幸的是 " 是一个足够奇特的字符，需要仔细对待。使用：

my ($term) = /("[^"]+?" said "[^"]+?")/g;

它应该可以正常工作（它对我来说......！）。即显式匹配“非双引号”序列而不是任意字符的序列。

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

回复收藏 0 原文

烟燃烟灭 2024-08-14 22:58:16

其他人已经提到了如何解决这个问题。

我将回答如何调试此问题：您可以通过使用更多捕获来查看发生的情况：

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""

回复收藏 0 原文