为什么 preg_match_all 在这么多字符后就大便了?

发布于 2024-09-09 19:39:01 字数 679 浏览 2 评论 0原文

我的 preg_match_all 语句有问题。当我打出一篇文章时,它一直工作得很好,但在它经过一定长度后突然停止工作。这是该函数的一个已知问题吗?在这么多字符之后它不执行任何操作?

$number = preg_match_all("/(<!-- ([\w]+):start -->)\n?(.*?)\n?(<!-- \\2:stop -->)/s", $data, $matches, PREG_SET_ORDER);

它一直工作得很好,并且对于其他页面也工作得很好,但是一旦该文章超过了一定的长度,噗,它就不再对该文章工作了。我可以使用另一种解决方案来使其适用于较长的文本块吗?正在处理的文章长度约为 33,000 个字符(包括空格)。

我以前问过这样的问题,但只得到了一个我从未实际测试过的答案。上一次我刚刚找到了另一种方法来解决该特定场景,但这次没有办法解决它,因为这都是一篇文章。我尝试将 pcre.backtrack_limit 和 pcre.recursion_limit 更改为 500,000,但完全没有效果。关于为什么会发生这种情况以及我可以做些什么来让它继续工作,即使对于这些大量的文本块,还有其他想法吗? 30,000 个字符的限制似乎有点低,只有 5,000-6,000 个单词(这个大约是 5,700 个)。在这里,将其分开并不是一个真正的选择,因为如果它们位于两个单独的文本块中,它将无法找到开始和结束。

I'm having a problem with my preg_match_all statement. It's been working perfectly as I've been typing out an article but all of a sudden after it passed a certain length is stopped working all together. Is this a known issue with the function that it just doesn't do anything after so many characters?

$number = preg_match_all("/(<!-- ([\w]+):start -->)\n?(.*?)\n?(<!-- \\2:stop -->)/s", $data, $matches, PREG_SET_ORDER);

It's been working fine all this time and works fine for other pages, but once that article passed a certain length, poof, it stopped working for that article. Is there another solution I can use to make it work for longer blocks of text? The article that is being processed is about 33,000 characters in length (including spaces).

I asked a question like this before but got only one answer which I never actually tested. The previous time I had just found another way to get around it for that particular scenario, but this time there is no way to get around it because it's all one article. I tried changing the pcre.backtrack_limit and pcre.recursion_limit up to even 500,000 with absolutely no effect. Are there any other ideas on why this is occurring and what I can do to get it to continue working even for these massive blocks of text? A 30,000 character limit seems to be a bit low, that's only 5,000-6,000 words (this one is about 5,700). Breaking it apart isn't really an option here because it won't find the start and stop if they are in two separate blocks of text.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不寐倦长更 2024-09-16 19:39:01

我曾经遇到过这个问题,当时解决这个问题的唯一方法就是劈开绳子。您可以使用 explode()preg_split()

从我的源代码中逐字引用:

    // regexps have failed miserably on very large tables...
    $parts = explode("<table",$html);

但这是两年前的事了。

I bumped into this one once, and the only way I could solve it back then, was by splitting the string. You could explode() or preg_split().

Quoting literally from my source code:

    // regexps have failed miserably on very large tables...
    $parts = explode("<table",$html);

But this was two years ago.

回忆躺在深渊里 2024-09-16 19:39:01

看起来您正在使用 HTML。您可能需要考虑使用各种解析器之一。例如,DOM有一个特定的注释类,所以我们知道它可以与他们合作。不幸的是,DOM 使用起来有点尴尬。

另一种选择可能是使用 XMLReader,它将 XML 作为流读取并一路将其作为令牌处理。似乎明白了什么是评论。我自己从未使用过它,所以我无法告诉你它的效果如何。 (您可以使用 DOM 的 loadHTMLsaveXML 方法将 HTML 转换为 XML,假设它的格式不是太糟糕。)

最后,您可能考虑为您的自定义注释编写标记器或解析器。它应该不会太困难,而且对于您来说,学习起来可能比学习我所指出的任何一个 XML 解决方案更快。

It looks like you're working with HTML. You might want to consider working with one of the various parsers. For example, DOM has a specific class for comments, so we know it can work with them. Unfortunately the DOM is a bit awkward to work with.

Another option might be to use XMLReader, which reads XML as a stream and processes it as tokens along the way. It seems to understand what comments are. I've never used it myself, so I can't tell you how well it works. (You can use DOM's loadHTML and saveXML methods to convert your HTML into XML, assuming it's not too horribly formed.)

Finally, you might consider writing a tokenizer or parser for your custom comments. It shouldn't be too difficult, and may well be faster for you to hack together than learning either of the XML solutions I've pointed out.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文