由于 PCRE 限制,正则表达式导致 Apache 崩溃

发布于 2024-09-16 10:14:59 字数 1499 浏览 2 评论 0原文

我目前正在创建 bbcode 解析引擎,并且遇到了我自己无法弄清楚的情况。

问题是,我突然遇到了一个与此完全相同的问题: Windows 上的 Apache / PHP 使用正则表达式会崩溃

这意味着如果我做了类似下面的示例,由于递归计数达到 690(PCRE 的 1MB 内存限制),Apache 崩溃了:

$txt = '[b]'.str_repeat('a', 338).'[/b]';  // if I change repeat count to lower value it's ok
$regex = '#\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))](?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)\[/(?P=tag)]#mi';

echo preg_replace_callback($regex, function($matches) { return $matches['content']; }, $txt);

所以我需要以某种方式最大限度地减少 *+ 的需求在我的正则表达式中,但这就是我没有想法的地方,所以我想也许你可以提出一些建议。

欢迎使用其他解析 bbcode 的方法(可以处理嵌套标签)。 但是我不想使用已经构建的类或其他东西。我喜欢自己做事!

我还研究了 PECL 和 Pear HTML_BBCodeParser。但我不希望我的应用程序依赖于扩展。更有可能的是,我可能会执行一些脚本来检查该扩展名,如果它不存在,则使用我在这里尝试执行的 BBCode 解析器。

抱歉,如果我的描述令人沮丧,我不擅长英语^^

编辑。所以正则表达式解释道:

\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))]

这是我的开始标签。我使用了命名组。使用“标签”我识别标签,使用“属性”我识别标签属性。也可以将标签视为属性。那么这里发生了什么?我尝试匹配一个标签,当一个标签匹配时,我尝试匹配 = 符号之后的任何内容或 \s (间隔符)之后的任何内容,直到达到标签闭合 ]

(?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)

现在我在这里尝试匹配内容。这是棘手的部分。我正在寻找任何不是 [ 的字符,如果找到任何字符,那么我检查它是否不是我的结束标签或递归,然后我告诉正则表达式引擎这样做,直到......

\[/(?P=tag)]

...结束标签是成立。

I am currently creating bbcode parsing engine and I have encountered a situation what I can't figure out on my own.

The thing is, that I popped into a problem exactly like this one:
Apache / PHP on Windows crashes with regular expression

That means that if I make something like the example below Apache crashes because of recursion count reaching 690 (1MB memory limit for PCRE):

$txt = '[b]'.str_repeat('a', 338).'[/b]';  // if I change repeat count to lower value it's ok
$regex = '#\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))](?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)\[/(?P=tag)]#mi';

echo preg_replace_callback($regex, function($matches) { return $matches['content']; }, $txt);

So I need to somehow minimize the need of * and + in my regex, but that's where I'm out of ideas so I though maybe you could suggest something.

Other approaches for parsing bbcode (that could handle nested tags) are welcome.
However I would not like to use an already built class or something. I like to do things on my own!

I have also looked into PECL and Pear HTML_BBCodeParser. But I don't want my application to be dependent on extensions. More likely I may do some script that checks for that extension and if it doesn't exist use the BBCode parser that I'm trying to do here.

Sorry if my descriptions are gloomy, I'm not pro at English ^^

EDIT. So the regex explained:

\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))]

This is my opening tag. I have used named groups. With 'tag' I identify tag and with 'attributes' I identify tags attributes. Think of tag as an attribute also. So what is happening here? I try to match a tag, when a tag is matched, I try to match anything after = sign or anything after \s (spacer) until it reaches tag closure ].

(?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)

Now here I am trying to match content. This is the tricky part. I am looking for any character that is not [ and if I find any, then I check if it is not my ending tag or recursion, and I tell the regex engine to do so until....

\[/(?P=tag)]

... the ending tag is found.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

撑一把青伞 2024-09-23 10:14:59

您的正则表达式,尤其是零宽度断言(环视)会导致正则表达式引擎灾难性地回溯。这个故事的寓意是:正则表达式不能不应该用于解析不规则的语言。如果您有嵌套结构,那么这不是常规语言。

事实上,我认为 BBCode 是邪恶的。 BBCode 是一种由懒惰的程序员发明的标记语言,他们不想以正确的方式过滤 HTML。结果,我们现在有一个松散的“标准”,很难实施。以正确的方式过滤 HTML:

http://htmlpurifier.org/

Your regex, especially the zero-width assertions (lookaround) cause the regex engine to backtrack catastrophically. Moral of the story: Regex can't shouldn't be used to parse languages that are not regular. If you have nested structures, that's not a regular language.

In fact, I think BBCode is evil. BBCode is a markup language invented by lazy programmers who didn't want to filter HTML the proper way. As a result, we now have a loose "standard" that's hard to implement. Filter your HTML the right way:

http://htmlpurifier.org/

半夏半凉 2024-09-23 10:14:59

我本来打算推荐一个 BBCodeParser...

我还研究了 PECL 和 Pear HTML_BBCodeParser。但我不希望我的应用程序依赖于扩展

我发现这很奇怪。为什么要重新发明轮子?优秀软件工程的原则之一是 DRY(不要重复自己)。你正试图解决一个已经解决的问题。

我喜欢自己做事!

这本身并不坏,但有时您最好使用经过验证的解决方案;一个比您自己的经过更好测试且更强大的(正如您所发现的)。这样你就会把时间花在你真正想要解决的问题上,而不是解决已经解决的问题。不要陷入重新发明轮子的陷阱。 :)

我给你的建议(和解决方案)是使用 BBCode 解析器。

编辑

另一件事是您正在解析类似 HTML 的内容。这种性质的事物不容易被正则表达式解析。

I was going to suggest a BBCodeParser...

I have also looked into PECL and Pear HTML_BBCodeParser. But i don't want my application to be dependant on extensions

I find that to be very strange. Why reinvent the wheel? One of the principles of good software-engineering is DRY (Don't Repeat Yourself). You're trying to solve a problem that has already been solved.

I like to do things on my own!

That's not bad in of itself, but there are times when you are better off using a tried and true solution; one that is better tested and more robust than your own (as you're finding out). That way you will spend time on the problem you actually want to solve instead of solving a problem that has already been solved. Don't fall into the trap of reinventing the wheel. :)

My suggestion (and solution) to you is to use a BBCode parser.

EDIT

Another thing is that you're parsing something that is HTML-like. Things of that nature don't lend themselves easily to being parsed by regular expressions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文