由于 UTF8 编码错误而导致 XML 读取错误
我正在尝试创建一个脚本来将我的评论导出到 Disqus,为此,我需要创建一个巨大的 XML 文件。
我在 UTF 8 编码方面遇到问题。假设该文件采用 UTF-8 格式,但我需要进行 utf8_decode 才能正确显示我的西班牙语元素。
生成的文件如下:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dsq="http://www.disqus.com/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
<wp:comment>
<wp:comment_id>26</wp:comment_id>
<wp:comment_author>KA_DIE</wp:comment_author>
<wp:comment_author_email> </wp:comment_author_email>
<wp:comment_author_url></wp:comment_author_url>
<wp:comment_author_IP> </wp:comment_author_IP>
<wp:comment_date_gmt>2009-07-16 18:53:19</wp:comment_date_gmt>
<wp:comment_content><![CDATA[WTF TEH Gladios en español <br />tnx tnx <br />me usta mucho esa web estoy pendiente mucho se su actualziacion es buen saber ke esta en español <br />x que solo entendia el 80, 90% de la paguina jiji]]></wp:comment_content>
<wp:comment_approved>1</wp:comment_approved>
<wp:comment_parent>0</wp:comment_parent>
</wp:comment>
</channel>
</rss>
出于安全原因删除数据,例如 IP 或电子邮件。正如你所看到的,它包含“ñ”字母。但显示的 XML 会引发错误:
XML 读取错误:组成不良
我不知道确切的翻译,但它在内容行中崩溃。代码是这样生成的:
public function generateXmlElement (){
$xml = "<wp:comment>
<wp:comment_id>$this->id</wp:comment_id>
<wp:comment_author>$this->author</wp:comment_author>
<wp:comment_author_email>$this->author_email</wp:comment_author_email>
<wp:comment_author_url>$this->author_url</wp:comment_author_url>
<wp:comment_author_IP>$this->author_ip</wp:comment_author_IP>
<wp:comment_date_gmt>$this->date</wp:comment_date_gmt>
<wp:comment_content><![CDATA[$this->content]]></wp:comment_content>
<wp:comment_approved>$this->approved</wp:comment_approved>
<wp:comment_parent>0</wp:comment_parent>
</wp:comment>";
return $xml;
}
然后 fwrite 到文件。
你知道应该是什么问题吗?
I'm trying to create a script to export my comments to Disqus and, in order to do that, I need to make a huge XML file.
I have a problem with encodement in UTF 8. It's supposed that the file is in UTF-8 but I need to make utf8_decode in order to have my Spanish elements shown properly.
The file generated is like that:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dsq="http://www.disqus.com/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
<wp:comment>
<wp:comment_id>26</wp:comment_id>
<wp:comment_author>KA_DIE</wp:comment_author>
<wp:comment_author_email> </wp:comment_author_email>
<wp:comment_author_url></wp:comment_author_url>
<wp:comment_author_IP> </wp:comment_author_IP>
<wp:comment_date_gmt>2009-07-16 18:53:19</wp:comment_date_gmt>
<wp:comment_content><![CDATA[WTF TEH Gladios en español <br />tnx tnx <br />me usta mucho esa web estoy pendiente mucho se su actualziacion es buen saber ke esta en español <br />x que solo entendia el 80, 90% de la paguina jiji]]></wp:comment_content>
<wp:comment_approved>1</wp:comment_approved>
<wp:comment_parent>0</wp:comment_parent>
</wp:comment>
</channel>
</rss>
Deleted data for security reasons such as IP or email. As you can see, it contains "ñ" letter. But the XML shown throws an error:
XML read error: bad composed
I don't know the exactly translation but it crash in the content line. The code is generated with this:
public function generateXmlElement (){
$xml = "<wp:comment>
<wp:comment_id>$this->id</wp:comment_id>
<wp:comment_author>$this->author</wp:comment_author>
<wp:comment_author_email>$this->author_email</wp:comment_author_email>
<wp:comment_author_url>$this->author_url</wp:comment_author_url>
<wp:comment_author_IP>$this->author_ip</wp:comment_author_IP>
<wp:comment_date_gmt>$this->date</wp:comment_date_gmt>
<wp:comment_content><![CDATA[$this->content]]></wp:comment_content>
<wp:comment_approved>$this->approved</wp:comment_approved>
<wp:comment_parent>0</wp:comment_parent>
</wp:comment>";
return $xml;
}
And then fwrite to a file.
Do you know what should be the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题很可能是您的 XML 不是 UTF-8 编码的,而是实际上是其他编码的(ISO-8859-1?)。字符“ñ”(U+00F1) 在 UTF-8 中编码为 2 个八位字节 0xC3B1。在 Windows 1252 代码页和 ISO-8859 编码中,“ñ”是单个八位字节 0xF1。
您的 XML 文件的开头是否有 Unicode BOM (U+FEFF)? BOM(如果存在)指示编码和字节顺序。
0xEFBBBF
:UTF-8。字节顺序并不重要。0xFFFE
:UTF-16,小尾数0xFEFF
:(大端)0xFFFE0000
:UTF-32,小尾数0x0000FEFF
:UTF-32,大尾数XML 标准规定,如果不存在 BOM 并且不存在指示编码的 XML 声明,则文档应为默认情况下解释为 UTF-8 编码。我认为,如果 BOM(如果存在)和 XML 声明中指定的编码之间存在差异,会发生什么情况,这一点是模糊的。
您的文件可能有不正确的 XML 声明(例如,XML 声明不应显示
UTF-8
,而应显示类似ISO-8859-1
的内容。The problem is most likely that your XML isn't UTF-8 encoded, but is actually something else (ISO-8859-1?). The character 'ñ' (U+00F1) is encoded in UTF-8 as 2 octets 0xC3B1. In both the Windows 1252 code page and ISO-8859 encodings, 'ñ' is a single octet 0xF1.
Does your XML file have a Unicode BOM (U+FEFF) at the beginning of the file? The BOM, if present, indicates the encoding and byte order.
0xEFBBBF
: UTF-8. Byte order isn't signicant.0xFFFE
: UTF-16, little-endian0xFEFF
: (big-endian)0xFFFE0000
: UTF-32, little-endian0x0000FEFF
: UTF-32, big-endianThe XML standard says that if no BOM is present and no XML declaration indicating encoding is present, that the document shall be interpreted as UTF-8 encoded by default. I believe it's left fuzzy as to what happens if their is a discrepancy between BOM (if present) and encoding specified in the XML declaration.
It may be that your file has an incorrect XML declaration (e.g., rather than saying
UTF-8
, the XMl declaration should say something likeISO-8859-1
.您应该使用适当的 XML 库来生成 XML。 LibXML2 与 PHP 捆绑在一起,可以从 PHP 的 DOM API 访问。这将处理编码问题等。与此类事情的通常情况一样,这是一项前期学习投资,其好处不会立即显现出来。但有一个好处。
You should be using a proper XML library to generate XML. LibXML2 comes bundled with PHP and is accessible from PHP's DOM API. That will handle the encoding issues, among other things. As is usually the case with such things, it's an upfront learning investment the benefit of which will not immediately be clear. But a benefit there is.