使用 PHP 读取 XML 时处理编码错误
我正在使用 XMLReader 解析来自第 3 方的 XML。这些文件应该是 UTF-8,但我收到此错误:
解析器错误:输入不是正确的 UTF-8,指示编码!
字节:C 中的 0x11 0x72 0x20 0x41 :\file.php on line 166
在 notepad++ 中查看 XML 文件,很清楚导致此问题的原因:有一个控制字符 DC1 包含在有问题的行中。
XML 文件是由第三方提供的,我无法可靠地修复此问题/确保将来不会发生这种情况。有人可以推荐一个处理这个问题的好方法吗?我想删除控制字符——在这种特殊情况下,只需从 XML 文件中删除它就可以了——但我担心总是这样做可能会导致出现不可预见的问题。谢谢。
I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x11 0x72 0x20 0x41 in C:\file.php on line 166
Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.
The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为什么第三方不能可靠地解决这个问题?如果他们的 XML 中有非法字符,我敢打赌这是一个有效的问题。
话虽如此,为什么不在使用 str_replace?
Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.
Having said that, why not just remove the character before you parse it using str_replace?
如果字符串是有效 UTF-8,则可以使用
str_replace()
。请注意,str_replace() 将使用字节偏移量,因此您不再处理 PHP 字符串,而是处理字节字符串。还有一个问题:如果您的第 3 方包含在 XML 中无用的随机空格和控制字符,您不妨假设它们最终会破坏 UTF-8。因此,在您确定当天当前的转储并非完全无用之前,您不能放心地使用
str_replace()
(仅出于善意)。也许您可以采取捷径,将其填充到 libxml DOMDocument 对象中,并使用 @ 抑制错误,让 libxml 库来处理错误。像这样的东西:
You can use
str_replace()
provided that the string is valid UTF-8. Note thatstr_replace()
will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use
str_replace()
with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with @, leaving the libxml library to deal with errors. Something like:
为什么您和第三方以 XML 形式交换数据?想必双方都希望通过使用 XML 而不是某种随机的专有格式来获得一些好处。如果您允许他们生成不良 XML(我更愿意将其称为非 XML),那么双方都无法获得这些好处。改正自己的方式符合他们的利益。尝试让他们相信这一点。
Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.