解析带有特殊字符 (UTF-8) 的 XML
我从一些看起来像这样的 XML 开始(简化):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
但是在我用 simplexml_load_string
解析它之后,特殊字符(i)变成:à
这显然是相当糟糕的。
有没有办法防止这种情况发生?
我知道 XML 确实很好,当保存为 .txt 并在浏览器中查看时,字符也很好。当我在 XML 上使用 simplexml_load_string 然后将值保存为文本文件或数据库时,其损坏。
I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string
the special character (the i) becomes: ì
which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
看起来 SimpleXML 正在创建一个 UTF-8 字符串,然后以 ISO-8859-1 (latin-1) 或类似于 CP-1252 的类似格式呈现。
当您将结果保存到文件并通过 Web 服务器提供该文件时,浏览器将使用文件中声明的编码。
包含在网页中
由于您的网页编码不是 UTF-8,因此您需要将字符串转换为您使用的任何编码,例如 ISO-8859-1 (latin-1)。
使用 iconv() 可以轻松完成此操作:
保存到数据库
您的数据库列未使用 UTF-8 排序规则,因此您应该使用 iconv 将字符串转换为数据库使用的字符集。
假设您的数据库排序规则与您渲染的编码相同,那么从数据库读取时您无需执行任何操作。
说明
在 UTF-8 中,0xc2 前缀字节用于访问“Latin-1 Suplement”块的上半部分,其中包括重音字母、货币符号、分数、上标 2 和 3、版权和注册商标符号等字符,和不间断的空间。
然而在 ISO-8859-1 中,字节 0xC2 代表 Â。因此,当您的 UTF-8 字符串被误解为其中之一时,您会得到 Â 后面跟着一些其他无意义的字符。
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
Saving to database
You database column is not using UTF-8 collation, so you should use
iconv
to convert the string to the charset that your database uses.Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
XML 很可能没问题,但在存储或输出时字符会被破坏。
如果您在 HTML 页面上输出数据:请确保它也以 UTF-8 编码。如果您的 HTML 页面采用 ISO-8859-1,您可以使用
utf8_decode
作为快速修复方法;从长远来看,使用 UTF-8 是更好的选择。如果将数据存储在 mySQL 中,则需要始终选择 UTF8 作为编码:作为连接的编码、表中的编码以及插入数据的列中的编码。
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use
utf8_decode
as a quick fix; using UTF-8 is the better option in the long run.If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
我也遇到过一些问题,它来自 PHP 脚本编码。确保将其设置为 UTF-8。
如果仍然不好,请尝试使用 uft8_encode 或 utf8_decode 打印变量。
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.
XML 对于实体(例如 & )是严格的。应该是
&
,ì 应该是ì
所以你需要一个翻译表。
XML is strict when it comes to entities, like & should be
&
and ì shouldì
So you will need a translation table.
聚会迟到了......但我已经面对过这个问题并解决如下。
您已经在 XML 中声明了编码,因此如果您使用 DOMDocument 加载 xml 文件,它将获胜不会造成任何问题。
但如果它发生在其他用例中,您可以使用
html_entity_decode
如下所示:Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use
html_entity_decode
like below: