非法字符导致 xml 解析错误

发布于 2024-09-07 13:44:36 字数 406 浏览 2 评论 0 原文

所以,我问这是最后的手段,因为我完全没有想法。

我有一个 Windows ASP.NET ASMX Web 服务应用程序,它返回一个带有 -- 的序列化 Person 对象 姓名、地址、电子邮件...等,

但 xml 中的某些属性编码非常奇怪,例如 - &#x1a (我不知道编码发生在哪里。我假设在序列化过程中)

谷歌搜索这些字符 我发现它是“Windows-1252”编码。

问题发生在解析XML的过程中,我发现在1252编码的位置出现了“无效的unicode字符”的解析错误。

我怎样才能成功解析它?您建议什么解决方案?

SO, I am asking as a last resort, as I am completely out of ideas.

I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a --
name, address, email... etc

but some attributes in the xml are encoded very weirdly, for instance-  (I dont know where the encoding takes place. I assume in the serialization process)

googling those characters I see that it is "Windows-1252" encoding.

The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.

how can I successfully parse it? what solutions do you suggest?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

薄情伤 2024-09-14 13:44:36

解析器是正确的,无论产生什么序列化都是错误的。与大多数 C0/C1 控制字符一样,放置 U+001A SUBSTITUTE 到 XML 1.0 文件 (*),即使编码为字符引用,例如 

XML 解析器不会也不应该读取此内容。虽然您可以在将  序列传递给解析器之前尝试过滤掉一些可怕的 hack,但这种粗暴的 hack 不适用于一般情况。应修复串行器以停止生成它们。

实际上,我不知道该字符(通常用于在古老的可怕操作系统中标记文件结尾)如何进入 ASP.NET 应用程序使用的数据集,但它似乎在姓名、地址或电子邮件。也许您确实需要考虑清理数据。

(*:如果在 XML 1.1 文档中编码为字符引用,则这是合法的。如果绝对必须通过 XML 来回控制字符,则必须使用 XML 1.1。尽管这可能会导致与旧版 XML 解析器的兼容性问题,并且您仍然不能使用 U+0000 NULL 字符,因此您永远不会完全是二进制安全的。)

The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as .

No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out  sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.

Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.

(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文