我需要匹配什么 reg 表达式来匹配 {{ 和 }} 之间的所有内容
我需要什么 reg 表达式来匹配 {{ 和 }} 之间的所有内容
我正在尝试解析维基百科,但在运行 rexex 代码后我最终得到了孤儿 }} 。这是我的 PHP 脚本。
<?php
$articleName='england';
$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.
$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);
$wikicode = $xml->page->revision->text;
$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);
print($wikicode);
?>
我认为问题是我嵌套了 {{ 和 }} 例如
{{ 一些东西 {{ 一些其他东西 {{ 一些新的东西 }}{{ 一些旧的东西 }} 一些蓝色的东西 }} 一些绿色的东西 }}
What reg expression patten to I need to match everything between {{ and }}
I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. Here's my PHP script.
<?php
$articleName='england';
$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.
$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);
$wikicode = $xml->page->revision->text;
$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);
print($wikicode);
?>
I think the problem is I have nested {{ and }} e.g.
{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用:
大多数正则表达式风格将大括号
{
视为文字字符,除非它是像{x,y}
这样的重复运算符的一部分,但此处情况并非如此。因此,您不需要使用反斜杠对其进行转义,尽管这样做会产生相同的结果。因此,您还可以使用:
示例:
另请注意,此处以非贪婪方式使用与任何字符(换行符除外)匹配的
.*
。所以它会尝试尽可能少地匹配。示例:
在字符串
'{{stack}}{{overflow}}'
中,它将匹配'stack'
而不是'stack}}{{overflow'
.如果您想要稍后的行为,可以将
.*?
更改为.*
,使匹配变得贪婪。You can use:
Most regex flavors treat the brace
{
as a literal character, unless it is part of a repetition operator like{x,y}
which is not the case here. So you do not need to escape it with a backslash, though doing it will give the same result.So you can also use:
Sample:
Also note that the
.*
which matches any character(other than newline) is used here in non-greedy way. So it'll try to match as little as possible.Example:
In the string
'{{stack}}{{overflow}}'
it will match'stack'
and not'stack}}{{overflow'
.If you want the later behavior you can change
.*?
to.*
, making the match greedy.您的编辑表明您正在尝试进行递归匹配,这与原始问题有很大不同。如果您不只是删除匹配的文本,我建议您根本不要使用正则表达式,但这应该可以满足您的要求:
在第一个
{{
匹配开始分隔符后,( ?:(?!{{|}}).)++
吞噬所有内容,直到下一个分隔符。如果它是另一个开始分隔符,(?R)
将接管并再次递归地应用整个正则表达式。(?R)
与正则表达式功能一样不标准。它是 PCRE 库所独有的,正是 PCRE 库为 PHP 的正则表达式风格提供了动力。其他一些风格有自己的匹配递归结构的方式,它们彼此之间都非常不同。Your edit shows that you're trying to do a recursive match, which is very different from the original question. If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want:
After the first
{{
matches an opening delimiter,(?:(?!{{|}}).)++
gobbles up everything until the next delimiter. If it's another opening delimiter, the(?R)
takes over and applies the whole regex again, recursively.(?R)
is about as non-standard as regex features get. It's unique to the PCRE library, which is what powers PHP's regex flavor. Some other flavors have their own ways of matching recursive structures, all of them very different from each other.除了使用已经提到的非贪婪量词之外,您还可以使用:
内部
([^}]|}[^}])*
用于仅匹配零个或多个任意字符的序列不包含序列}}
。Besides using a already mentioned non-greedy quantifier, you can also use this:
The inner
([^}]|}[^}])*
is used to only match sequences of zero or more arbitrary characters that do not contain the sequence}}
.获得最短匹配的贪婪版本是
(作为比较,与字符串
{{fd}sdfd}sf}x{dsf}}
相比,惰性版本\{\{(.* ?)\}\}
需要 57 个步骤来匹配,而我的版本只需要 17 个步骤。这假设 Regex Buddy 的调试输出是可信的。)A greedy version to get the shortest match is
(For comparison, with the string
{{fd}sdfd}sf}x{dsf}}
, the lazy version\{\{(.*?)\}\}
takes 57 steps to match, while my version only takes 17 steps. This assumes the debug output of Regex Buddy can be trusted.)\{{2}(.*)\}{2}
或者,更简洁,带有环视(?<=\{{2}).*(?=\}{2 })
,但前提是您的正则表达式引擎支持它们。如果您希望匹配在第一个找到的
}}
处停止(即非贪婪),您应该将.*
替换为.*?
。此外,您还应该考虑引擎的单行匹配设置,因为其中一些
.
默认情况下不会匹配新行字符。您可以启用单行或使用[.\r\n]*
而不是.*
。\{{2}(.*)\}{2}
or, cleaner, with lookarounds(?<=\{{2}).*(?=\}{2})
, but only if your regex engine supports them.If you want your match to stop at the first found
}}
(i.e. non-greedy) you should replace.*
with.*?
.Also you should take into account the settings for single-line matching of your engine as in some of them
.
will not match new line characters by default. You can either enable single-line or use[.\r\n]*
instead of.*
.