正则表达式高级：正向回顾

发布于 2024-08-29 21:07:24 字数 480 浏览 13 评论 0原文

这是我的测试字符串：

<img rel="{objectid:498,newobject:1,fileid:338}" width="80" height="60" align="left" src="../../../../files/jpg1/Desert1.jpg" alt="" />

我想获取 rel 属性之间的每个 JSON 形成的元素。它适用于第一个元素（objectid）。

这是我的 ReqEx，它工作正常：

(?<=(rel="\{objectid:))\d+(?=[,|\}])

但我想做这样的事情，但不起作用：

(?<=(rel="\{.*objectid:))\d+(?=[,|\}])

所以我可以解析搜索字符串的每个元素。

我正在使用 Java-ReqEx

原文

This is my test-string:

<img rel="{objectid:498,newobject:1,fileid:338}" width="80" height="60" align="left" src="../../../../files/jpg1/Desert1.jpg" alt="" />

I want to get each of the JSON formed Elements inbetween the rel attribute.
It's working for the first element (objectid).

Here is my ReqEx, which works fine:

(?<=(rel="\{objectid:))\d+(?=[,|\}])

But i want to do somthing like this, which doesn't work:

(?<=(rel="\{.*objectid:))\d+(?=[,|\}])

So i can parse every element of the search string.

I'm using Java-ReqEx

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

静待花开 2024-09-05 21:07:24

Java（以及除 .NET 和 JGSoft 之外的几乎所有正则表达式风格）不支持lookbehind 内的无限重复。

您可以改用捕获组。另外，最好使用 [^{]* 而不是 .*，并使用 \b 确保单词边界。

rel="\{[^{]*\bobjectid:(\d+)

应该足够了（然后查看捕获组 1 的属性值。

Java (and nearly all regex flavors except .NET and JGSoft) don't support infinite repetition inside lookbehinds.

You could use capturing groups instead. Also, better use [^{]* instead of .*, and ensure word boundaries with \b.

rel="\{[^{]*\bobjectid:(\d+)

should be sufficient (then look at the capturing group 1 for the value of the attribute.

回复收藏 0 原文

白馒头 2024-09-05 21:07:24

您想迭代所有键/值对吗？您不需要向后查找：

String s = 
    "<img rel=\"{objectid:498,newobject:1,fileid:338}\" " +
    "width=\"80\" height=\"60\" align=\"left\" " +
    "src=\"../../../../files/jpg1/Desert1.jpg\" alt=\"\" />";
Pattern p = Pattern.compile(
    "(?:\\brel=\"\\{|\\G,)(\\w+):(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.printf("%s = %s%n", m.group(1), m.group(2));
}

第一次调用 find() 时，正则表达式的第一部分匹配 rel="{。在后续调用中，第二部分替代 (\G,) 接管以匹配逗号，但前提是它紧跟在前一个匹配之后。无论哪种情况，它都会让您排队等待 (\w+):(\w+。 ) 来匹配下一个键/值对，并且它永远不能匹配 rel 属性之外的任何地方，

我假设您将正则表达式应用于独立的 IMG 标记，就像您一样。发布它，而不是整个 HTML 文件。此外，正则表达式可能需要一些调整以匹配您的实际数据，例如，您可能需要更通用的 ([^:]+):([^,} ]+) 而不是 (\w+):(\w+)。

Do you want to iterate through all the key/value pairs? You don't need lookbehind for that:

String s = 
    "<img rel=\"{objectid:498,newobject:1,fileid:338}\" " +
    "width=\"80\" height=\"60\" align=\"left\" " +
    "src=\"../../../../files/jpg1/Desert1.jpg\" alt=\"\" />";
Pattern p = Pattern.compile(
    "(?:\\brel=\"\\{|\\G,)(\\w+):(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.printf("%s = %s%n", m.group(1), m.group(2));
}

The first time find() is called, the first part of the regex matches rel="{. On subsequent calls, the second alternative (\G,) takes over to match a comma, but only if it immediately follows the previous match. In either case it leaves you lined up for (\w+):(\w+) to match the next key/value pair, and it can never match anywhere outside the rel attribute.

I'm assuming you're applying the regex to an isolated IMG tag, as you posted it, not to a whole HTML file. Also, the regex may need a little tweaking to match your actual data. For example, you might want the more general ([^:]+):([^,}]+) instead of (\w+):(\w+).

回复收藏 0 原文

丶情人眼里出诗心の 2024-09-05 21:07:24

一般情况下，前向和后向可能不包含任意正则表达式：大多数引擎（包括 Java）要求它们的长度是众所周知的，因此您不能在其中使用像 * 这样的量词。

无论如何，你为什么在这里使用前瞻和后瞻？只需使用捕获组即可，这要简单得多。

rel="\{.*objectid:(\d+)

现在第一个捕获组将包含 ID。

Lookaheads and lookbehinds may not contain arbitrary regular expressions in general: Most engines (Java’s included) require that their length is well-known so you can’t use quantifiers like * in them.

Why are you using lookaheads and lookbehinds here, anyway? Just use capture groups instead, that’s much simpler.