解析带有特殊字符 (UTF-8) 的 XML

发布于 2024-08-23 08:56:27 字数 396 浏览 4 评论 0原文

我从一些看起来像这样的 XML 开始（简化）：

<?xml version="1.0" encoding="UTF-8"?>
<alldata>
   <data name="Forsetì" />
</alldata>
</xml>

但是在我用 simplexml_load_string 解析它之后，特殊字符（i）变成：à这显然是相当糟糕的。

有没有办法防止这种情况发生？

我知道 XML 确实很好，当保存为 .txt 并在浏览器中查看时，字符也很好。当我在 XML 上使用 simplexml_load_string 然后将值保存为文本文件或数据库时，其损坏。

原文

I'm starting out with some XML that looks like this (simplified):

<?xml version="1.0" encoding="UTF-8"?>
<alldata>
   <data name="Forsetì" />
</alldata>
</xml>

But after I've parsed it with simplexml_load_string the special character (the i) becomes: Ã¬ which is obviously pretty mangled.

Is there a way to prevent this from happening?

I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痞味浪人 2024-08-30 08:56:27

看起来 SimpleXML 正在创建一个 UTF-8 字符串，然后以 ISO-8859-1 (latin-1) 或类似于 CP-1252 的类似格式呈现。

当您将结果保存到文件并通过 Web 服务器提供该文件时，浏览器将使用文件中声明的编码。

包含在网页中
由于您的网页编码不是 UTF-8，因此您需要将字符串转换为您使用的任何编码，例如 ISO-8859-1 (latin-1)。

使用 iconv() 可以轻松完成此操作：

    $xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);

保存到数据库
您的数据库列未使用 UTF-8 排序规则，因此您应该使用 iconv 将字符串转换为数据库使用的字符集。

假设您的数据库排序规则与您渲染的编码相同，那么从数据库读取时您无需执行任何操作。

说明
在 UTF-8 中，0xc2 前缀字节用于访问“Latin-1 Suplement”块的上半部分，其中包括重音字母、货币符号、分数、上标 2 和 3、版权和注册商标符号等字符，和不间断的空间。

然而在 ISO-8859-1 中，字节 0xC2 代表 Â。因此，当您的 UTF-8 字符串被误解为其中之一时，您会得到 Â 后面跟着一些其他无意义的字符。

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.

When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.

Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).

This is easily done with iconv():

    $xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);

Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.

Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.

Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.

However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get Â followed by some other nonsense character.

回复收藏 0 原文

北凤男飞 2024-08-30 08:56:27

XML 很可能没问题，但在存储或输出时字符会被破坏。

如果您在 HTML 页面上输出数据：请确保它也以 UTF-8 编码。如果您的 HTML 页面采用 ISO-8859-1，您可以使用 utf8_decode 作为快速修复方法；从长远来看，使用 UTF-8 是更好的选择。

如果将数据存储在 mySQL 中，则需要始终选择 UTF8 作为编码：作为连接的编码、表中的编码以及插入数据的列中的编码。

回复收藏 0 原文

权谋诡计 2024-08-30 08:56:27

我也遇到过一些问题，它来自 PHP 脚本编码。确保将其设置为 UTF-8。
如果仍然不好，请尝试使用 uft8_encode 或 utf8_decode 打印变量。

回复收藏 0 原文

总攻大人 2024-08-30 08:56:27

XML 对于实体（例如 & ）是严格的。应该是 & ，ì 应该是 ì

所以你需要一个翻译表。

function xml_entity_decode($_string) {
    // Set up XML translation table
    $_xml=array();
    $_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
    while (list($_key,)=each($_xl8))
        $_xml['&#'.ord($_key).';']=$_key;
    return strtr($_string,$_xml);
}

XML is strict when it comes to entities, like & should be & and ì should ì

So you will need a translation table.

function xml_entity_decode($_string) {
    // Set up XML translation table
    $_xml=array();
    $_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
    while (list($_key,)=each($_xl8))
        $_xml['&#'.ord($_key).';']=$_key;
    return strtr($_string,$_xml);
}

回复收藏 0 原文