带括号的奇怪 Perl 正则表达式行为

发布于 2024-11-09 17:44:54 字数 391 浏览 3 评论 0原文

我正在提取一些维基百科标记,并且想要匹配相对(在维基百科上)链接中的 URL。我不想匹配任何包含冒号的 URL(不包括协议冒号),以避免特殊页面等,因此我有以下代码:

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

不幸的是,此代码无法按预期工作。任何包含括号 [ie /wiki/Eon_(geology)] 的 URL 都会在左括号之前被提前截断,因此该 URL 将匹配为 /wiki/Eon_ 。我已经查看了代码一段时间,但我无法弄清楚我做错了什么。任何人都可以提供一些见解吗?

I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北城半夏 2024-11-16 17:44:54

只要您的 Perl 足够新以支持这些 RE 功能,此代码就其本身而言没有任何问题。使用 Perl 5.10.1 进行测试。

$body = <<"__ENDHTML__";
<a href="/wiki/Eon_(geology)">Body</a> Blah blah 
<a href="/wiki/Some_other_(parenthesis)">Body</a>
__ENDHTML__

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

你使用的是旧的 Perl 吗?

There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.

$body = <<"__ENDHTML__";
<a href="/wiki/Eon_(geology)">Body</a> Blah blah 
<a href="/wiki/Some_other_(parenthesis)">Body</a>
__ENDHTML__

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

Are you using an old Perl?

仙女 2024-11-16 17:44:54

您没有将 RE 锚定到字符串的末尾。在后面加上一个“。

虽然这是一个问题,但这不是他试图解决的问题。他试图解决的问题是没有任何东西可以匹配方法/主机名(http://en.wiki。 ..) 在 RE 中添加 .*? 会有所帮助。

You didn't anchor the RE to the end of the string. Put a " afterwards.

While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文