PHP：每当我尝试编写 UTF-8 时，使用 DOMDocument 它会写入它的十六进制表示法

发布于 2024-09-16 12:44:32 字数 258 浏览 10 评论 0原文

当我尝试使用 DOMDocument 将 UTF-8 字符串写入 XML 文件时，它实际上会写入字符串的十六进制表示法，而不是字符串本身。

例如：

&#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD;

而不是：

ירושלים

有什么想法如何解决这个问题吗？

原文

When I try to write UTF-8 Strings into an XML file using DOMDocument it actually writes the hexadecimal notation of the string instead of the string itself.

for example:

ירושלים

instead of:

ירושלים

Any ideas how to resolve the issue?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陪你到最终 2024-09-23 12:44:32

好的，就这样：

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'));
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

会很好地工作，因为在这种情况下，您构建的文档将保留指定为第二个参数的编码：

<?xml version="1.0" encoding="utf-8"?>
<root>ירושלים</root>

但是，一旦将 XML 加载到未指定编码的文档中，您将丢失任何内容在构造函数中声明，这意味着：

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

不会有 utf-8 编码：

<?xml version="1.0"?>
<root>ירושלים</root>

因此，如果您加载 XML 某些内容，请确保它是，

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

并且它将按预期工作。

作为替代方案，您还可以指定编码加载文档后。

Ok, here you go:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'));
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will work fine, because in this case, the document you constructed will retain the encoding specified as the second argument:

<?xml version="1.0" encoding="utf-8"?>
<root>ירושלים</root>

However, once you load XML into a Document that does not specify an encoding, you will lose anything you declared in the constructor, which means:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

will not have an encoding of utf-8:

<?xml version="1.0"?>
<root>ירושלים</root>

So if you loadXML something, make sure it is

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('ירושלים'));
echo $dom->saveXml();

and it will work as expected.

As an alternative, you can also specify the encoding after loading the document.

回复收藏 0 原文

ゝ偶尔ゞ 2024-09-23 12:44:32

如果你想用 DOMDocument 输出 UTF-8，你需要指定。很简单，不是吗？如果你已经闻到了一个棘手的问题，那么你就离题不远了，但乍一看，它确实很简单。

考虑以下输出十六进制实体的（UTF-8 编码）代码示例：

$dom = new DOMDocument();
$dom->loadXml('<root>ירושלים</root>');
$dom->save('php://output');

输出：

<?xml version="1.0"?>
<root>ירושלים</root>

如前所述，如果要将其输出为 UTF-8，则需要指定它，并且它很简单：

...
$dom->encoding = 'UTF-8';
$dom->save('php://output');

然后的输出是UTF-8 明确：

<?xml version="1.0" encoding="UTF-8"?>
<root>ירושלים</root>

简单的部分就讲这么多。如果您对肮脏的小细节感兴趣，您可以继续阅读 - 如果没有，请不要问“为什么？”:)。

我刚刚写了“用 UTF-8 明确地”，因为在第一个示例中，输出也是 UTF-8 编码的，XML 仅包含完全有效的十六进制实体 - 即使以 UTF-8 格式！

您已经注意到我在这里开始挑剔，但请记住： UTF-8 是 XML 的默认编码。

如果您现在开始说：嘿等等，如果默认编码无论如何都是 UTF-8，为什么 PHP DOMDocument 首先使用实体？

事实是，它与问题中的发现并不矛盾。并非总是。

请参阅以下示例，该示例使用 XML 注释而不是包含 Ivrit 字母的节点值：

$dom = new DOMDocument();
$dom->loadXml('<root><!-- ירושלים --></root>');
$dom->save('php://output');

输出：

<?xml version="1.0"?>
<root><!-- ירושלים --></root>

好的，都清楚了吗？因此，这里肮脏的小秘密是：无论您是否拥有这些 XML 实体，对于文档来说都没有什么区别，它只是编写相同 XML 字符数据的不同形式。您已经感受到了邀请：让我们尝试 CDATA 来代替第一个示例：

$dom = new DOMDocument();
$dom->loadXML("<root><![CDATA[ירושלים]]></root>");
$dom->save('php://output');

输出：

<?xml version="1.0"?>
<root><![CDATA[ירושלים]]></root>

正如前面的 XML 注释示例所示，这里没有使用 XML 实体。好吧，它们无论如何都不会有效，就像 XML 注释示例一样。

作为概述，让我们创建一个包含所有这些内容的示例：

$dom = new DOMDocument();
$dom->loadXML("<!-- ירושלים --><root>ירושלים <![CDATA[ירושלים]]></root>");
$dom->save('php://output');

输出：

<?xml version="1.0"?>
<!-- ירושלים -->
<root>ירושלים <![CDATA[ירושלים]]></root>

经验教训：

始终使用 UTF-8。除非指定 UTF-8 编码，否则 PCDATA 中仅使用一些实体。如果指定了与 UTF-8 不同的编码，则应用不同的规则。
您无法通过在 PHP DOMDocument 本身中加载 XML 文档作为 UTF-8 编码字符串来指定是否要使用实体进行输出。即使使用 libxml 标志也不提供 BOM。 ^[1]
您可以通过将文档编码设置为 UTF-8 来指定不想使用实体。
如果可以的话，您可以操作具有 XML 声明的输入字符串，指定文档编码如戈登的答案中所述。

提示：如果您的字符串有一个与字符串编码不匹配的 XML 声明，或者您想在加载之前更改两者中的任何一个将字符串转换为 DOMDocument 您需要更改 XML 声明和/或重新编码字符串。这已在问题的答案中进行了介绍PHP XMLReader，通过显示 XMLRecoder 如何获取版本和编码 类有效。

希望就是这样。

^[1] 可能如果您从 HTTP 请求加载并提供流上下文并通过元数据标记字符编码 - 但这应该首先进行测试，我不知道。 BOM 不起作用在某种程度上表明所有这些东西都不起作用。

If you want to output UTF-8 with DOMDocument, you need to specify that. Simple, isn't it? If you already smell a trick question, you're not too far off, but on first sight, it really is straight forward.

Consider the following (UTF-8 encoded) code-example that outputs hexadecimal entities:

$dom = new DOMDocument();
$dom->loadXml('<root>ירושלים</root>');
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root>ירושלים</root>

As written, if you want to output this as UTF-8, you need to specify it, and it is straight forward:

...
$dom->encoding = 'UTF-8';
$dom->save('php://output');

The output then is in UTF-8 explicitly:

<?xml version="1.0" encoding="UTF-8"?>
<root>ירושלים</root>

So much for the straight forward part. If you are interested in the dirty little details, you are free to read on - if not, please do not ask "why?" :).

I just wrote "in UTF-8 explicitly" because also in the first example the output is UTF-8 encoded, the XML just contained hexadecimal entities which is perfectly valid - even in UTF-8!

You already notice that I start with nit-picking here, but remember: UTF-8 is the default encoding of XML.

And if you now start to say: Hey wait, if the default encoding is UTF-8 anyway, why does PHPs DOMDocument use the entities in the first place?

Well the truth is, it does not contrary to the finding in the question. Not always.

See the following example which is using an XML-comment instead of a node value containing the Ivrit letters:

$dom = new DOMDocument();
$dom->loadXml('<root><!-- ירושלים --></root>');
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root><!-- ירושלים --></root>

Okay, all clear? So the dirty little secret here is: Whether you've got those XML entities in there or not - for the document it does not make a difference, it is just a different form of writing the same XML character data. And you already feel invited: Lets try CDATA instead for the first example:

$dom = new DOMDocument();
$dom->loadXML("<root><![CDATA[ירושלים]]></root>");
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<root><![CDATA[ירושלים]]></root>

As this demonstrates like with the XML-comment example before, there are no XML entities used here. Well, they would not be valid anyway, like with the XML-comment example.

For the overview lets create an example that contains all these:

$dom = new DOMDocument();
$dom->loadXML("<!-- ירושלים --><root>ירושלים <![CDATA[ירושלים]]></root>");
$dom->save('php://output');

Output:

<?xml version="1.0"?>
<!-- ירושלים -->
<root>ירושלים <![CDATA[ירושלים]]></root>

Lessons learned:

UTF-8 is always used. Just some entities are used in PCDATA unless the UTF-8 encoding is specified. If a different to UTF-8 encoding is specified, different rules apply.
You can not specify if you want to use entities or not for output by loading an XML document as UTF-8 encoded string in PHPs DOMDocument per-se. Not even with libxml flags nor by providing a BOM. ^[1]
You can specify that you do not want to use entities by setting the documents encoding to UTF-8.
If you can, you can manipulate the input string having an XML-Declaration specifying the documents encoding as outlined in gordon's answer.

Tip: If your string has an XML-Declaration that mismatches the strings encoding or you want to change either of both before loading the string into DOMDocument you need to change the XML-Declaration and/or re-encode the string. This has been covered in an answer to the question PHP XMLReader, get the version and encoding by showing how the XMLRecoder class works.

And that's it hopefully.

^[1] Probably if you load from a HTTP request and you provide stream context and flag the character encoding via meta-data - but this should be tested first, I do not know. That the BOM does not work is somewhat a sign that all these things do not work.

回复收藏 0 原文

陌伤浅笑 2024-09-23 12:44:32

显然，将 documentElement 作为 $node 传递给 saveXML 可以解决此问题，尽管我不能说我理解原因。

例如，

$dom->saveXML($dom->documentElement);

而不是：

$dom->saveXML();

来源：http://www.php.net/手册/en/domdocument.savexml.php#88525

Apparently passing the documentElement as $node to saveXML works around this, although I can't say I understand why.

e.g.

$dom->saveXML($dom->documentElement);

rather than:

$dom->saveXML();

Source: http://www.php.net/manual/en/domdocument.savexml.php#88525

回复收藏 0 原文

雾里花 2024-09-23 12:44:32

就这一点而言，答案是：

当您的函数启动时，在获得内容后，执行以下操作：

$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

然后启动新文档等。以示例为例：

if ( empty( $content ) ) {
    return false;
}
$doc = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

然后执行您打算对代码执行的任何操作。

To the point answer is:

When your function starts, right after you get the content, do this:

$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

And then start the new document etc. Check this as example:

if ( empty( $content ) ) {
    return false;
}
$doc = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Then do whatever you were intending to do with your code.

回复收藏 0 原文

一身软味 2024-09-23 12:44:32

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
  if ($item->nodeType == XML_PI_NODE)
    $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
  if ($item->nodeType == XML_PI_NODE)
    $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

回复收藏 0 原文