当 Xml 节点文本包含有趣的字符时会导致问题
我在以下事件中设置 xml 元素内的字符:
public void characters(char[] ch, int start, int length) {
elementText = new String(ch, start, length);
}
其中 elementText 是字符串。
<client-key>#<ABC::DEF::GHI:0x102548f78></client-key>
我正在将此 xml 数据加载到 java 对象中,并且我的对象属性具有以下值:
'\n '
现在,如果我更改上面元素
中的文本,它在我的对象属性中效果很好。
是否存在我需要以某种方式处理的编码问题?
public void endElement(String uri, String localName, String qName) {
if (qName.equals("client-key")) {
client.setClientKey(elementText);
}
}
I'm setting the characters inside the xml element in the following event:
public void characters(char[] ch, int start, int length) {
elementText = new String(ch, start, length);
}
Where elementText is a String.
<client-key>#<ABC::DEF::GHI:0x102548f78></client-key>
I am loading this xml data into java objects, and my objects property has this value:
'\n '
Now if I change the text in the element <client-key>
above, it comes out fine in my objects property.
Is there some encoding issue that I need to handle somehow?
public void endElement(String uri, String localName, String qName) {
if (qName.equals("client-key")) {
client.setClientKey(elementText);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您的 xml 已整理为如下所示,您可能会得到以下结果:
请参阅
你最好使用类似的东西:
This is probably what you would get if your xml has been tidied to look like:
See ContentHandler
You'd be better off using something like:
XML 解析器通常使用两个阶段来处理文档中的数据。在第一阶段,文档(字节序列)被解码为放置在输入缓冲区中的字符序列。实际的 XML 解析在第二阶段完成,其中分析不同的结构,例如元素开始和结束标记。请注意,两个阶段是并行执行的。更准确地说,随着 XML 解析的进行,输入缓冲区会根据需要重新填充。另请注意,如果文档已作为字符序列提供(例如使用
StringReader
),则将跳过第一阶段的解码,但解析器仍将使用输入缓冲区来存储字符从流中读取。正如其他人所指出的,SAX 解析器不需要将文本节点报告为单个块。它可以自行决定将节点分割成多个块。这称为非合并解析。
您所说的“有趣的字符”实际上是字符实体引用(在您的情况下为 < 和 > )。在将数据发送到应用程序之前,需要对它们进行解码(在您的情况下为“<”和“>”)。然而,这只能在第二阶段完成。原因是相同的字符序列(例如“<”)如果出现在不同的上下文中,特别是在 CDATA 部分中,则可能不需要解码。
要点是,如果文本节点不包含任何实体引用,则解析器可以将字符数据直接从输入缓冲区传递到应用程序。这增加了整个文本节点被报告为单个块的可能性。然而,即使在这种情况下,文本节点也可能不完全适合输入缓冲区,在这种情况下,解析器将在多个块中报告它。
另一方面,如果文本节点包含实体引用,则解析器无法将数据直接从输入缓冲区传递到应用程序,因为部分数据需要进一步解码。为了避免多次复制数据,大多数解析器会选择将不需要进一步解码的部分直接传递给应用程序,而实体引用首先被解码到单独的缓冲区中。这就是为什么您会得到原始文档中由实体引用分隔的块的原因。
An XML parser typically uses two stages to process the data in a document. In the first stage, the document (which is a sequence of bytes) is decoded into a sequence of characters which are placed in an input buffer. The actual XML parsing is done in a second stage, where the different constructs such as element start and end tags are analyzed. Note that both stages are executed in parallel. More precisely, the input buffer is refilled on demand as the XML parsing progresses. Also note that if the document is already supplied as a character sequence (e.g. using a
StringReader
), then the decoding in the first stage is skipped, but the parser will still use an input buffer to store the characters read from the stream.As noted by others, a SAX parser is not required to report a text node as a single chunk. It may at its own discretion decide to split the node into multiple chunks. This is called non-coalescing parsing.
What you call "funny characters" are actually character entity references (< and > in your case). They need to be decoded (to '<' and '>' in your case) before sending the data to the application. However, this can only be done in the second stage. The reason is that the same character sequence (e.g. '<') may not need decoding if it appears in a different context, in particular in a CDATA section.
The point is that if a text node doesn't contain any entity references, then the parser can pass the character data directly from the input buffer to the application. This increases the probability that the entire text node is reported as a single chunk. However, even in that case, it is possible that the text node doesn't fit entirely into the input buffer, in which case the parser will report it in multiple chunks.
On the other hand, if the text node contains entity references, then the parser can't pass the data directly from the input buffer to the application, because part of the data needs further decoding. To avoid copying the data around multiple times, most parsers will choose to pass the parts that don't need further decoding directly to the application, while the entity references are decoded into a separate buffer first. That is the reason why you get chunks that in the original document are delimited by entity references.
效果很好。但正如他所说,节点的内容分为多个块。所以你需要附加它。下面的示例显示了使用和不使用 cdata 的输出
输出:
对于第一个客户端密钥标记,您收到的最后一个块是带有一些空格的换行符。由于您没有附加它,因此您只会得到带有一些空格的换行符,这是最后一个块。
如果你有一个正常的角色,它就可以很好地工作,因为内容没有中断,你可以把它们分成一大块。
相同的输入:
输出:
因此您可以使用 CDATA 或附加。
It works fine. But as he said, content of the node comes in multiple chunks. So you need to append it. The below example shows the output with and without using cdata
The output:
The last chunk that you receive, for the first client-key tag, is the new line character with some spaces. Since you dont append it you are only getting the newline character with some spaces which is the last chunk.
It works fine if you have a normal character because there is no break in the content and you may get them in one chunk.
same input :
output:
So either you use CDATA or append.