使用 PHP 读取 XML 时处理编码错误

发布于 2024-12-01 20:36:50 字数 473 浏览 1 评论 0原文

我正在使用 XMLReader 解析来自第 3 方的 XML。这些文件应该是 UTF-8,但我收到此错误:

解析器错误:输入不是正确的 UTF-8,指示编码!

字节:C 中的 0x11 0x72 0x20 0x41 :\file.php on line 166

在 notepad++ 中查看 XML 文件,很清楚导致此问题的原因:有一个控制字符 DC1 包含在有问题的行中。

XML 文件是由第三方提供的,我无法可靠地修复此问题/确保将来不会发生这种情况。有人可以推荐一个处理这个问题的好方法吗?我想删除控制字符——在这种特殊情况下,只需从 XML 文件中删除它就可以了——但我担心总是这样做可能会导致出现不可预见的问题。谢谢。

I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:

parser error : Input is not proper UTF-8, indicate encoding !

Bytes: 0x11 0x72 0x20 0x41 in C:\file.php on line 166

Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.

The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

丢了幸福的猪 2024-12-08 20:36:50

为什么第三方不能可靠地解决这个问题?如果他们的 XML 中有非法字符,我敢打赌这是一个有效的问题。

话虽如此,为什么不在使用 str_replace

Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.

Having said that, why not just remove the character before you parse it using str_replace?

伤感在游骋 2024-12-08 20:36:50

如果字符串是有效 UTF-8,则可以使用 str_replace()。请注意,str_replace() 将使用字节偏移量,因此您不再处理 PHP 字符串,而是处理字节字符串。

还有一个问题:如果您的第 3 方包含在 XML 中无用的随机空格和控制字符,您不妨假设它们最终会破坏 UTF-8。因此,在您确定当天当前的转储并非完全无用之前,您不能放心地使用 str_replace()(仅出于善意)。

也许您可以采取捷径,将其填充到 libxml DOMDocument 对象中,并使用 @ 抑制错误,让 libxml 库来处理错误。像这样的东西:

$doc = new DOMDocument();
if(@$doc->loadXML($raw_string)) {
  // document is loaded. time to normalize() it.
}
else {
  throw new Exception("This data is junk");
}

You can use str_replace() provided that the string is valid UTF-8. Note that str_replace() will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.

And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use str_replace() with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.

Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with @, leaving the libxml library to deal with errors. Something like:

$doc = new DOMDocument();
if(@$doc->loadXML($raw_string)) {
  // document is loaded. time to normalize() it.
}
else {
  throw new Exception("This data is junk");
}
凯凯我们等你回来 2024-12-08 20:36:50

为什么您和第三方以 XML 形式交换数据?想必双方都希望通过使用 XML 而不是某种随机的专有格式来获得一些好处。如果您允许他们生成不良 XML(我更愿意将其称为非 XML),那么双方都无法获得这些好处。改正自己的方式符合他们的利益。尝试让他们相信这一点。

Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文