PHP DOMDocument nodeValue 转储文字 UTF-8 字符而不是编码字符

发布于 2024-10-20 11:33:57 字数 1630 浏览 1 评论 0原文

我遇到了类似于 此问题问题

nodeValue 来自DomDocument 在 PHP 中返回奇怪的字符

我发现的根本原因可以用 mb_convert_encoding() 来模仿

在我的单元测试中,这终于解决了问题:

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');

UTF-8 数据的原始值似乎正在过来,并且运行 PHP 的系统的基本代码页很可能不是 UTF-8。

一直到解析(使用转储到 DOMDocument 的 HTML5lib 实现),字符串都保持干净、UTF-8 友好。仅在使用提取数据时

$span->nodeValue

我才发现编码稳定性失败。

我的猜测是,用于 domdocument 导出到 nodeValue 的 htmlentities 使用编码转换器,但忽略内联编码值。

鉴于我的问题与 HTML5 相关,我认为这与实施的新颖性直接相关,但它似乎是一个更广泛的问题。除了开头提到的问题之外,我无法通过搜索找到有关此问题的任何特定于 DOMDocument 的信息。

更新

以前进的名义,我已从 HTML5lib 和 DOMDocument 切换到 Simple HTML DOM,它导出干净转义的 html,然后我可以将其解析回正确的 UTF-8 实体。

另外,我没有尝试过的一个功能是

utf8_decode

,这可能是其他遇到此问题的人的解决方案。它解决了我在使用 AJAX/PHP 时遇到的相关问题,解决方案在 2009 年的这篇博客文章中找到:克服 AJaX UTF-8 编码限制(在 PHP 中)

I am experiencing an issue similar to this question:

nodeValue from DomDocument returning weird characters in PHP

The root cause that I have found can be mimicked with mb_convert_encoding()

In my unit tests, this finally caught the issue:

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');

The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.

All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using

$span->nodeValue

do I see a failure in encoding stability.

My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.

Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.

UPDATE

In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.

Also, one function I did not try was

utf8_decode

So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

内心旳酸楚 2024-10-27 11:33:58

刚刚在 nodeValue 上使用 utf8_decode ,它确实有点工作,但有特殊字符无法正确显示的问题。

但是,某些字符仍然存在问题,例如简单的引号 ' 和其他一些字符(例如),

因此使用 $element->nodeValue 将不起作用,但 utf8_decode($element->nodeValue) 会部分起作用。

Just used utf8_decode on a nodeValue and it indeed kinda worked, had the problem with special characters not displaying correctly.

However, some characters still remain problematic, such as the simple quote ' and a few others (œ for example)

So using $element->nodeValue will not work, but utf8_decode($element->nodeValue) will - PARTLY.

空城缀染半城烟沙 2024-10-27 11:33:58

函数 utf8_decodeutf8_encode 的命名不太好。它们实际上从 utf-8 转换为 iso-8859-1 以及从 iso-8859-1 转换为 utf-8代码> 分别。

仅使用 utf-8 作为参数调用 mb_convert_encoding 时,通常与使用函数 utf8_encode 类似。 (通常除非您更改了内部代码页,但您可能 - 希望 - 没有更改)。

大多数 PHP 函数都期望字符串采用 iso-8859-1 编码。然而,libxml(php 的 xml 解析库的底层库)期望字符串为 utf-8。因此,如果你不小心的话,你很容易会得到损坏的编码。

至于你的测试,第一行可能是欺骗性的。由于您的脚本中有一个文字 é,因此测试将根据您保存文件的编码而变化。请检查您的文本编辑器。

希望能澄清一点。

The functions utf8_decode and utf8_encode are not very well named. They literally convert from utf-8 to iso-8859-1 and from iso-8859-1 to utf-8 respectively.

mb_convert_encoding when called with just utf-8 as argument will normally be similar to using the function utf8_encode. (Normally being unless you changed the internal code page, which you probably - hopefully - didn't).

Most of PHP's functions expect strings to be iso-8859-1 encoded. However, libxml (Which is the underlying library of php's xml parsing libraries) expects strings to be utf-8. As such, you can easily end up with mangled encodings, if you aren't cautious.

As for your test, the first line may be deceptive. Since you have a literal é in your script, the test would change depending on which encoding you have saved the file in. Check your text editor for that.

Hope that clarifies a bit.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文