CharacterData 忽略非转义字符
我正在使用以下方法通过网络从 XML 文档中读取一行文本:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
它工作正常,但如果遇到像 & 符号这样的字符,而它的写法不像 &< /code> 等它将完全忽略该字符和该行的其余部分。我可以做什么来纠正这个问题?
I'm using the following method to read in a line of text from an XML document via the web:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
It works fine, but if it comes across a character such as an ampersand which are not written like &
etc it will then completely ignore that character and the rest of the line. What can I do to rectify this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
唯一正确的解决方案是更正 XML,将
&
写为&
,或者将文本包裹在...
]]>
。它实际上不是 XML,除非您转义 & 符号或使用 CDATA。
The only proper solution ist to correct the XML, so that the
&
is written as&
, or the texts are wrapped in<![CDATA[
...]]>
.It's not actually XML unless you escape ampersands or use CDATA.
我怀疑有关输入格式不正确的说法是在转移注意力。如果源文档包含实体引用,则一个元素可能包含多个文本节点子节点,并且您的代码仅读取其中的第一个。它需要阅读所有这些内容。
(我认为有更简单的方法可以获取 DOM 中节点的文本内容。但我不确定,如果可以避免的话,我从不使用 DOM,因为它让一切变得如此困难。使用 JDOM 会更好或 XOM。)
I suspect the talk of the input not being well-formed is a red herring. If the source document contains entity references then an element may contain multiple text node children, and your code is only reading the first of them. It needs to read them all.
(I think there are easier ways of getting the text content of a Node in DOM. But I'm not sure, I never use the DOM if I can avoid it because it makes everything so difficult. You're much better off with JDOM or XOM.)