如何让 SAXParser 忽略转义码
我正在编写一个Java程序来读取XML文件,实际上是一个XML plist格式的iTunes库。 我已经设法绕过了这种格式遇到的大多数障碍,除非遇到包含 &
的文本。 XLM 文件将此与符号表示为 &
,我只能设法读取任何特定文本部分中 &
后面的文本。
有没有办法禁用转义码检测?我正在使用 SAXParser。
I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format.
I have managed to get round most obstacles that this format throws up except when it encounters text containing the &
. The XLM file represents this ampersand as &
and I can only manage to read the text following the &
in any particular section of text.
Is there a way to disable detection of escape codes? I am using SAXParser.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你想做的事情有些可疑。
如果您尝试解析的文件格式仅包含与号 (
&
) 字符,那么它就不是格式良好的 XML。 & 符号在格式正确的 XML 中表示为字符实体(例如&
)。如果它确实应该是真正的 XML,那么写入/生成该文件的任何内容都存在错误。
如果它不应该是真实的 XML(即那些 & 符号不是错误),那么您可能不应该尝试使用 XML 解析器来解析它。
啊,我明白了。 XML 实际上已正确编码,但您没有获得正确的 SO 标记。
看起来您真正的问题是您的
characters(...)
回调是为&
之前的文本单独调用的,对于(解码的)&
,最后是&
之后的文本。您只需将文本块重新连接在一起即可解决此问题。ContentHandler.characters()
表示:There is something fishy about what you are trying to do.
If the file format you are trying to parse contains bare ampersand (
&
) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g.&
) in well-formed XML.If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.
If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.
Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.
It would appear that your real problem is that your
characters(...)
callback is being called separately for the text before the&
, for the (decoded)&
, and finally for the text after the&
. You simply have to have to deal with this by joining the text chunks back together.The javadoc for
ContentHandler.characters()
says this:这可能不是转义字符的最佳通用解决方案,但我只需要考虑新行,因此很容易检查 \n。
您可以检查反斜杠 \ 仅检查所有转义字符或在您的情况下 &,尽管我认为其他人会提供更优雅的解决方案。
It's probably not the best general solution for escape characters, but I only had to take into account new lines so it was easy to just check for \n.
You could check for the backslash \ only to check for all escape characters or in your case &, although I think others will come with more elegant solutions.
您有摘录给我们吗?文件是 itunes 生成的吗?如果是这样,对我来说这听起来像是 iTunes 中的一个错误,忘记了正确编码 & 符号。我不会感到惊讶:他们显然一开始就没有获得 XML,他们的架构[key] [value] 一定会让 XML 发明者感到恶心。
您可能想使用不同的、更强大的解析器。只要文件格式良好,SAX 就很棒。但我不知道 dom4j 和 jdom 有多强大。尝试一下吧。对于 python,我知道我会推荐
ElementTree
或BeautifulSoup
,它们非常强大。另请查看 http://code.google.com/p/xmlwise/我发现在 stackoverflow 中提到了(你使用过搜索吗?)。
更新:(根据更新的问题)您需要了解 XML 中实体的角色,从而了解 SAX 中的角色。它们默认是一个单独的节点,就像文本节点一样。因此,您可能需要将它们与相邻的文本节点连接起来以获得完整的值。您在解析器中使用 DTD 吗?使用正确的 DTD(带有实体定义)可以帮助解析很多内容,因为它可以包含从实体(例如
&
)到它们代表的字符&
的映射,并且解析器也许能够为您进行合并。 (至少我喜欢用于大文件的 python XML-pull 解析器在具体化子树时会这样做。)Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of
<name>[key]</name><string>[value]</string>
must make the XML inventors puke.You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment
ElementTree
orBeautifulSoup
which are very robust.Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as
&
to the characters they represent&
, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)我正在使用 SAXParser 解析下面的字符串
I am parsing the below string using SAXParser