当输入是 html 时 preg_replace 返回 null （但不是所有时候）

发布于 2024-10-14 22:42:03 字数 498 浏览 4 评论 0原文

我正在从几个不同的来源读取 html 内容，我必须对其进行操作。作为其中的一部分，我有许多 preg_replace() 调用，我必须替换收到的 html 中的一些信息。

在我必须执行此操作的 90% 站点上，一切正常，其余 10% 在每次 preg_replace() 调用时返回 NULL。

我尝试根据我发现的其他文章增加 pcre.backtrack_limit 和 pcre.recursion_limit ，这些文章似乎有相同的问题，但这没有用。

我有输出 preg_last_error() ，它返回“4”，PHP文档对此没有任何帮助，所以如果有人能阐明这一点，它可能会开始为我指明正确的方向，但我难住了。

其中一个令人反感的例子是：

$html = preg_replace('@<script[^>]*?.*?</script>@siu', '', $html);

但正如我所说，这在 90% 的情况下都有效。

原文

I am reading in html from a few different sources which I have to manipulate. As part of this I have a number of preg_replace() calls where I have to replace some of the information within the html received.

On 90% of the sites I have to do this on, everything works fine, the remaining 10% are returning NULL on each of the preg_replace() calls.

I've tried increasing the pcre.backtrack_limit and pcre.recursion_limit based on other articles I've found which appear to have the same problem, but this has been to no avail.

I have output preg_last_error() which is returning '4' for which the PHP documentation isn't proving very helpful at all, so if anyone can shed any light on this it might start to point me in the right direction, but I'm stumped.

One of the offending examples is:

$html = preg_replace('@<script[^>]*?.*?</script>@siu', '', $html);

but as I said, this works 90% of the time.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱的十字路口 2024-10-21 22:42:04

不要使用正则表达式解析 HTML。使用真正的 DOM 解析器：

$dom = new DOMDocument;
$dom->loadHTML($html);
$scripts = $dom->getElementsByTagName('script');
while ($el = $scripts->item(0)) {
    $el->parentNode->removeChild($el);
}
$html = $dom->saveHTML();

Don't parse HTML with regex. Use a real DOM parser:

$dom = new DOMDocument;
$dom->loadHTML($html);
$scripts = $dom->getElementsByTagName('script');
while ($el = $scripts->item(0)) {
    $el->parentNode->removeChild($el);
}
$html = $dom->saveHTML();

回复收藏 0 原文

雨落星ぅ辰 2024-10-21 22:42:04

你的 utf-8 格式不好。

/**
 * Returned by preg_last_error if the last error was
 * caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available
 * since PHP 5.2.0.
 * @link http://php.net/manual/en/pcre.constants.php
 */
define ('PREG_BAD_UTF8_ERROR', 4);

然而，你真的不应该使用正则表达式来解析 html。使用 DOMDocument

编辑：我也不认为这个答案是不包含您无法解析 [X ]带有正则表达式的 HTML。

You have bad utf-8.

/**
 * Returned by preg_last_error if the last error was
 * caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available
 * since PHP 5.2.0.
 * @link http://php.net/manual/en/pcre.constants.php
 */
define ('PREG_BAD_UTF8_ERROR', 4);

However, you should really not use regex to parse html. Use DOMDocument

EDIT: Also I don't think this answer would be complete without including You can't parse [X]HTML with regex.

回复收藏 0 原文

开始看清了 2024-10-21 22:42:04

您的 #4 错误是“PREG_BAD_UTF8_ERROR”，您应该检查导致此错误的网站上使用的字符集。

回复收藏 0 原文

请止步禁区 2024-10-21 22:42:04

您可能超出了回溯和/或内部递归限制。请参阅http://php.net/manual/en/pcre.configuration.php

在 preg_replace 之前尝试一下：

ini_set('pcre.backtrack_limit', '10000000');
ini_set('pcre.recursion_limit', '10000000');

It is possible that you exceeded backtrack and/or internal recursion limits. See http://php.net/manual/en/pcre.configuration.php

Try this before preg_replace:

ini_set('pcre.backtrack_limit', '10000000');
ini_set('pcre.recursion_limit', '10000000');

回复收藏 0 原文

~没有更多了~

关于作者

甜点

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

当输入是 html 时 preg_replace 返回 null （但不是所有时候）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

当输入是 html 时 preg_replace 返回 null （但不是所有时候）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。