我需要匹配什么 reg 表达式来匹配 {{ 和 }} 之间的所有内容

发布于 2024-09-27 10:13:16 字数 727 浏览 8 评论 0原文

我需要什么 reg 表达式来匹配 {{ 和 }} 之间的所有内容

我正在尝试解析维基百科，但在运行 rexex 代码后我最终得到了孤儿 }} 。这是我的 PHP 脚本。

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

我认为问题是我嵌套了 {{ 和 }} 例如

{{ 一些东西 {{ 一些其他东西 {{ 一些新的东西 }}{{ 一些旧的东西 }} 一些蓝色的东西 }} 一些绿色的东西 }}

原文

What reg expression patten to I need to match everything between {{ and }}

I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. Here's my PHP script.

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

I think the problem is I have nested {{ and }} e.g.

{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦纸 2024-10-04 10:13:16

您可以使用：

\{\{(.*?)\}\}

大多数正则表达式风格将大括号 { 视为文字字符，除非它是像 {x,y} 这样的重复运算符的一部分，但此处情况并非如此。因此，您不需要使用反斜杠对其进行转义，尽管这样做会产生相同的结果。

因此，您还可以使用：

{{(.*?)}}

示例：

$ echo {{StackOverflow}} | perl -pe 's/{{(.*?)}}/$1/'
StackOverflow

另请注意，此处以非贪婪方式使用与任何字符（换行符除外）匹配的 .* 。所以它会尝试尽可能少地匹配。

示例：

在字符串 '{{stack}}{{overflow}}' 中，它将匹配 'stack' 而不是 'stack}}{{overflow'.
如果您想要稍后的行为，可以将 .*? 更改为 .*，使匹配变得贪婪。

You can use:

\{\{(.*?)\}\}

Most regex flavors treat the brace { as a literal character, unless it is part of a repetition operator like {x,y} which is not the case here. So you do not need to escape it with a backslash, though doing it will give the same result.

So you can also use:

{{(.*?)}}

Sample:

$ echo {{StackOverflow}} | perl -pe 's/{{(.*?)}}/$1/'
StackOverflow

Also note that the .* which matches any character(other than newline) is used here in non-greedy way. So it'll try to match as little as possible.

Example:

In the string '{{stack}}{{overflow}}' it will match 'stack' and not 'stack}}{{overflow'.
If you want the later behavior you can change .*? to .*, making the match greedy.

回复收藏 0 原文

有木有妳兜一样 2024-10-04 10:13:16

您的编辑表明您正在尝试进行递归匹配，这与原始问题有很大不同。如果您不只是删除匹配的文本，我建议您根本不要使用正则表达式，但这应该可以满足您的要求：

$wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                       '', $wikicode);

在第一个 {{ 匹配开始分隔符后， ( ?:(?!{{|}}).)++ 吞噬所有内容，直到下一个分隔符。如果它是另一个开始分隔符，(?R) 将接管并再次递归地应用整个正则表达式。

(?R) 与正则表达式功能一样不标准。它是 PCRE 库所独有的，正是 PCRE 库为 PHP 的正则表达式风格提供了动力。其他一些风格有自己的匹配递归结构的方式，它们彼此之间都非常不同。

Your edit shows that you're trying to do a recursive match, which is very different from the original question. If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want:

$wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                       '', $wikicode);

After the first {{ matches an opening delimiter, (?:(?!{{|}}).)++ gobbles up everything until the next delimiter. If it's another opening delimiter, the (?R) takes over and applies the whole regex again, recursively.

(?R) is about as non-standard as regex features get. It's unique to the PCRE library, which is what powers PHP's regex flavor. Some other flavors have their own ways of matching recursive structures, all of them very different from each other.

回复收藏 0 原文

水染的天色ゝ 2024-10-04 10:13:16

除了使用已经提到的非贪婪量词之外，您还可以使用：

\{\{(([^}]|}[^}])*)}}

内部 ([^}]|}[^}])* 用于仅匹配零个或多个任意字符的序列不包含序列 }}。

Besides using a already mentioned non-greedy quantifier, you can also use this:

\{\{(([^}]|}[^}])*)}}

The inner ([^}]|}[^}])* is used to only match sequences of zero or more arbitrary characters that do not contain the sequence }}.

回复收藏 0 原文

︶￣淡然 2024-10-04 10:13:16

获得最短匹配的贪婪版本是

\{\{([^}]*(?:\}[^}]+)*)\}\}

（作为比较，与字符串 {{fd}sdfd}sf}x{dsf}} 相比，惰性版本 \{\{(.* ?)\}\} 需要 57 个步骤来匹配，而我的版本只需要 17 个步骤。这假设 Regex Buddy 的调试输出是可信的。）

A greedy version to get the shortest match is

\{\{([^}]*(?:\}[^}]+)*)\}\}

(For comparison, with the string {{fd}sdfd}sf}x{dsf}}, the lazy version \{\{(.*?)\}\} takes 57 steps to match, while my version only takes 17 steps. This assumes the debug output of Regex Buddy can be trusted.)

回复收藏 0 原文