如何告诉 Java SAX 解析器忽略无效字符引用?

发布于 2024-09-04 17:22:59 字数 340 浏览 7 评论 0原文

致命错误而惨死。

    org.xml.sax.SAXParseException: Character reference "&#x1"
                                   is an invalid XML character.

当尝试使用诸如 &#x1 之类的字符引用来解析不正确的 XML 时,Java 的 SAX 解析器会因诸如Is there any way around this? 之类的 在将 XML 文件交给 SAX 解析器之前,我是否必须清理该文件?如果是这样,有没有一种优雅的方式来解决这个问题?

When trying to parse incorrect XML with a character reference such as , Java's SAX Parser dies a horrible death with a fatal error such as

    org.xml.sax.SAXParseException: Character reference ""
                                   is an invalid XML character.

Is there any way around this? Will I have to clean up the XML file before I hand it off to the SAX Parser? If so, is there an elegant way of going about this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

淡淡绿茶香 2024-09-11 17:22:59

使用 XML 1.1! skaffman 是完全正确的,但是您只需将 粘贴在文件顶部,您就会处于良好状态。如果您正在处理流,请编写一个包装器来重写或添加该处理指令。

Use XML 1.1! skaffman is completely right, but you can just stick <?xml version="1.1"?> on the top of your files and you'll be in good shape. If you're dealing with streams, write a wrapper that rewrites or adds that processing instruction.

甜味超标? 2024-09-11 17:22:59

恐怕您必须清理 XML。根据 XML 规范,此类字符是无效的,否则无论如何说服都无法说服解析器。

XML 1.0 的有效 XML 字符

  • U+0009
  • U+000A
  • U+000D
  • U+0020U+D7FF
  • U+E000U+FFFD
  • U+10000U+10FFFF

为了清理,您必须将数据传递到更低的级别处理器,将其视为 unicode 字符流,删除那些无效的字符。

You're going to have to clean up your XML, I'm afraid. Such characters are invalid according to the XML spec, and no amount of persuasion is going to convince the parser otherwise.

Valid XML characters for XML 1.0:

  • U+0009
  • U+000A
  • U+000D
  • U+0020U+D7FF
  • U+E000U+FFFD
  • U+10000U+10FFFF

In order to clean up, you'll have to pass the data through a more low-level processor, which treats it as a unicode character stream, removing those characters that are invalid.

走过海棠暮 2024-09-11 17:22:59

这是无效的 XML,因此任何解析器都不应正确解析它。

但在现实世界中您确实会遇到这种手工制作的无效 XML。我的解决方案是手动将 CDATA 标记插入数据。例如,

  <data><![CDATA[ garbage with &invalid characters ]]></data>

当然,您将按原样取回数据,并且您必须自己处理无效字符。

This is invalid XML so no parser should parse it without error.

But you do encounter such hand-crafted invalid XML in real world. My solution is to manually insert CDATA markers to the data. For example,

  <data><![CDATA[ garbage with &invalid characters ]]></data>

Of course, you will get the data back as is and you have to deal with the invalid characters yourself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文