PHP DOMDocument nodeValue 转储文字 UTF-8 字符而不是编码字符

发布于 2024-10-20 11:33:57 字数 1630 浏览 1 评论 0原文

我遇到了类似于此问题问题：

我发现的根本原因可以用 mb_convert_encoding() 来模仿

在我的单元测试中，这终于解决了问题：

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('&Atilde;&copy;',ENT_QUOTES,'UTF-8'),'values match');

UTF-8 数据的原始值似乎正在过来，并且运行 PHP 的系统的基本代码页很可能不是 UTF-8。

一直到解析（使用转储到 DOMDocument 的 HTML5lib 实现），字符串都保持干净、UTF-8 友好。仅在使用提取数据时

$span->nodeValue

我才发现编码稳定性失败。

我的猜测是，用于 domdocument 导出到 nodeValue 的 htmlentities 使用编码转换器，但忽略内联编码值。

鉴于我的问题与 HTML5 相关，我认为这与实施的新颖性直接相关，但它似乎是一个更广泛的问题。除了开头提到的问题之外，我无法通过搜索找到有关此问题的任何特定于 DOMDocument 的信息。

更新

以前进的名义，我已从 HTML5lib 和 DOMDocument 切换到 Simple HTML DOM，它导出干净转义的 html，然后我可以将其解析回正确的 UTF-8 实体。

另外，我没有尝试过的一个功能是

utf8_decode

，这可能是其他遇到此问题的人的解决方案。它解决了我在使用 AJAX/PHP 时遇到的相关问题，解决方案在 2009 年的这篇博客文章中找到：克服 AJaX UTF-8 编码限制（在 PHP 中）

原文

I am experiencing an issue similar to this question:

nodeValue from DomDocument returning weird characters in PHP

The root cause that I have found can be mimicked with mb_convert_encoding()

In my unit tests, this finally caught the issue:

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('Ã©',ENT_QUOTES,'UTF-8'),'values match');

The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.

All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using

$span->nodeValue

do I see a failure in encoding stability.

My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.

Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.

UPDATE

In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.

Also, one function I did not try was

utf8_decode

So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)

分享到QQ

分享到微博