如何告诉 Java SAX 解析器忽略无效字符引用?
致命错误而惨死。
org.xml.sax.SAXParseException: Character reference ""
is an invalid XML character.
当尝试使用诸如 
之类的字符引用来解析不正确的 XML 时,Java 的 SAX 解析器会因诸如Is there any way around this? 之类的 在将 XML 文件交给 SAX 解析器之前,我是否必须清理该文件?如果是这样,有没有一种优雅的方式来解决这个问题?
When trying to parse incorrect XML with a character reference such as , Java's SAX Parser dies a horrible death with a fatal error such as
org.xml.sax.SAXParseException: Character reference ""
is an invalid XML character.
Is there any way around this? Will I have to clean up the XML file before I hand it off to the SAX Parser? If so, is there an elegant way of going about this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用 XML 1.1! skaffman 是完全正确的,但是您只需将
粘贴在文件顶部,您就会处于良好状态。如果您正在处理流,请编写一个包装器来重写或添加该处理指令。
Use XML 1.1! skaffman is completely right, but you can just stick
<?xml version="1.1"?>
on the top of your files and you'll be in good shape. If you're dealing with streams, write a wrapper that rewrites or adds that processing instruction.恐怕您必须清理 XML。根据 XML 规范,此类字符是无效的,否则无论如何说服都无法说服解析器。
XML 1.0 的有效 XML 字符:
U+0009
U+000A
U+000D
U+0020
–U+D7FF
U+E000
–U+FFFD
U+10000
–U+10FFFF
为了清理,您必须将数据传递到更低的级别处理器,将其视为 unicode 字符流,删除那些无效的字符。
You're going to have to clean up your XML, I'm afraid. Such characters are invalid according to the XML spec, and no amount of persuasion is going to convince the parser otherwise.
Valid XML characters for XML 1.0:
U+0009
U+000A
U+000D
U+0020
–U+D7FF
U+E000
–U+FFFD
U+10000
–U+10FFFF
In order to clean up, you'll have to pass the data through a more low-level processor, which treats it as a unicode character stream, removing those characters that are invalid.
这是无效的 XML,因此任何解析器都不应正确解析它。
但在现实世界中您确实会遇到这种手工制作的无效 XML。我的解决方案是手动将 CDATA 标记插入数据。例如,
当然,您将按原样取回数据,并且您必须自己处理无效字符。
This is invalid XML so no parser should parse it without error.
But you do encounter such hand-crafted invalid XML in real world. My solution is to manually insert CDATA markers to the data. For example,
Of course, you will get the data back as is and you have to deal with the invalid characters yourself.