当 Xml 节点文本包含有趣的字符时会导致问题

发布于 2024-12-20 14:52:21 字数 701 浏览 1 评论 0原文

我在以下事件中设置 xml 元素内的字符：

 public void characters(char[] ch, int start, int length) {
        elementText = new String(ch, start, length);
    }

其中 elementText 是字符串。

<client-key>#&lt;ABC::DEF::GHI:0x102548f78&gt;</client-key>

我正在将此 xml 数据加载到 java 对象中，并且我的对象属性具有以下值：

 '\n        '

现在，如果我更改上面元素中的文本，它在我的对象属性中效果很好。

是否存在我需要以某种方式处理的编码问题？

public void endElement(String uri, String localName, String qName) {

       if (qName.equals("client-key")) {
            client.setClientKey(elementText);
        }

}

原文

I'm setting the characters inside the xml element in the following event:

 public void characters(char[] ch, int start, int length) {
        elementText = new String(ch, start, length);
    }

Where elementText is a String.

<client-key>#<ABC::DEF::GHI:0x102548f78></client-key>

I am loading this xml data into java objects, and my objects property has this value:

 '\n        '

Now if I change the text in the element <client-key> above, it comes out fine in my objects property.

Is there some encoding issue that I need to handle somehow?

public void endElement(String uri, String localName, String qName) {

       if (qName.equals("client-key")) {
            client.setClientKey(elementText);
        }

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

So尛奶瓶 2024-12-27 14:52:21

如果您的 xml 已整理为如下所示，您可能会得到以下结果：

<client-key>
    #<ABC::DEF::GHI:0x102548f78>
</client-key>

请参阅

人物
...
解析器将调用此方法来报告每个字符数据块。 SAX 解析器可以在单个块中返回所有连续的字符数据，或者它们可以将其分割成多个块； ...

你最好使用类似的东西：

public void characters(char[] ch, int start, int length) {
  // Note the +=
  elementText += new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName) {

  if (qName.equals("client-key")) {
    client.setClientKey(elementText);
  }
  elementText = "";
}

This is probably what you would get if your xml has been tidied to look like:

<client-key>
    #<ABC::DEF::GHI:0x102548f78>
</client-key>

See ContentHandler

characters
...
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; ...

You'd be better off using something like:

public void characters(char[] ch, int start, int length) {
  // Note the +=
  elementText += new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName) {

  if (qName.equals("client-key")) {
    client.setClientKey(elementText);
  }
  elementText = "";
}

回复收藏 0 原文

奢望 2024-12-27 14:52:21

XML 解析器通常使用两个阶段来处理文档中的数据。在第一阶段，文档（字节序列）被解码为放置在输入缓冲区中的字符序列。实际的 XML 解析在第二阶段完成，其中分析不同的结构，例如元素开始和结束标记。请注意，两个阶段是并行执行的。更准确地说，随着 XML 解析的进行，输入缓冲区会根据需要重新填充。另请注意，如果文档已作为字符序列提供（例如使用StringReader），则将跳过第一阶段的解码，但解析器仍将使用输入缓冲区来存储字符从流中读取。

正如其他人所指出的，SAX 解析器不需要将文本节点报告为单个块。它可以自行决定将节点分割成多个块。这称为非合并解析。

您所说的“有趣的字符”实际上是字符实体引用（在您的情况下为 < 和 > ）。在将数据发送到应用程序之前，需要对它们进行解码（在您的情况下为“<”和“>”）。然而，这只能在第二阶段完成。原因是相同的字符序列（例如“<”）如果出现在不同的上下文中，特别是在 CDATA 部分中，则可能不需要解码。

要点是，如果文本节点不包含任何实体引用，则解析器可以将字符数据直接从输入缓冲区传递到应用程序。这增加了整个文本节点被报告为单个块的可能性。然而，即使在这种情况下，文本节点也可能不完全适合输入缓冲区，在这种情况下，解析器将在多个块中报告它。

另一方面，如果文本节点包含实体引用，则解析器无法将数据直接从输入缓冲区传递到应用程序，因为部分数据需要进一步解码。为了避免多次复制数据，大多数解析器会选择将不需要进一步解码的部分直接传递给应用程序，而实体引用首先被解码到单独的缓冲区中。这就是为什么您会得到原始文档中由实体引用分隔的块的原因。

An XML parser typically uses two stages to process the data in a document. In the first stage, the document (which is a sequence of bytes) is decoded into a sequence of characters which are placed in an input buffer. The actual XML parsing is done in a second stage, where the different constructs such as element start and end tags are analyzed. Note that both stages are executed in parallel. More precisely, the input buffer is refilled on demand as the XML parsing progresses. Also note that if the document is already supplied as a character sequence (e.g. using a StringReader), then the decoding in the first stage is skipped, but the parser will still use an input buffer to store the characters read from the stream.

As noted by others, a SAX parser is not required to report a text node as a single chunk. It may at its own discretion decide to split the node into multiple chunks. This is called non-coalescing parsing.

What you call "funny characters" are actually character entity references (< and > in your case). They need to be decoded (to '<' and '>' in your case) before sending the data to the application. However, this can only be done in the second stage. The reason is that the same character sequence (e.g. '<') may not need decoding if it appears in a different context, in particular in a CDATA section.

The point is that if a text node doesn't contain any entity references, then the parser can pass the character data directly from the input buffer to the application. This increases the probability that the entire text node is reported as a single chunk. However, even in that case, it is possible that the text node doesn't fit entirely into the input buffer, in which case the parser will report it in multiple chunks.

On the other hand, if the text node contains entity references, then the parser can't pass the data directly from the input buffer to the application, because part of the data needs further decoding. To avoid copying the data around multiple times, most parsers will choose to pass the parts that don't need further decoding directly to the application, while the entity references are decoded into a separate buffer first. That is the reason why you get chunks that in the original document are delimited by entity references.

回复收藏 0 原文

遥远的她 2024-12-27 14:52:21

效果很好。但正如他所说，节点的内容分为多个块。所以你需要附加它。下面的示例显示了使用和不使用 cdata 的输出

public class XMLTest {

    public static void main(String argv[]) {
        try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {

                public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
                }

                public void endElement(String uri, String localName, String qName) throws SAXException {
                }

                public void characters(char ch[], int start, int length) throws SAXException {
                    System.out.println(new String(ch, start, length));
                }
            };
            saxParser.parse("test.xml", handler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

<?xml version="1.0"?>
<company>
    <staff>
        <client-key>#<ABC::DEF::GHI:0x102548f78></client-key>    
        <client-key><![CDATA[#<ABC::DEF::GHI:0x102548f78>]]></client-key>    
    </staff>
</company>

输出：

#
<
ABC::DEF::GHI:0x102548f78
>


#<ABC::DEF::GHI:0x102548f78>

对于第一个客户端密钥标记，您收到的最后一个块是带有一些空格的换行符。由于您没有附加它，因此您只会得到带有一些空格的换行符，这是最后一个块。

如果你有一个正常的角色，它就可以很好地工作，因为内容没有中断，你可以把它们分成一大块。

相同的输入：

<client-key>testing</client-key>

输出：

testing

因此您可以使用 CDATA 或附加。

It works fine. But as he said, content of the node comes in multiple chunks. So you need to append it. The below example shows the output with and without using cdata

public class XMLTest {

    public static void main(String argv[]) {
        try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {

                public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
                }

                public void endElement(String uri, String localName, String qName) throws SAXException {
                }

                public void characters(char ch[], int start, int length) throws SAXException {
                    System.out.println(new String(ch, start, length));
                }
            };
            saxParser.parse("test.xml", handler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

<?xml version="1.0"?>
<company>
    <staff>
        <client-key>#<ABC::DEF::GHI:0x102548f78></client-key>    
        <client-key><![CDATA[#<ABC::DEF::GHI:0x102548f78>]]></client-key>    
    </staff>
</company>

The output:

#
<
ABC::DEF::GHI:0x102548f78
>


#<ABC::DEF::GHI:0x102548f78>

The last chunk that you receive, for the first client-key tag, is the new line character with some spaces. Since you dont append it you are only getting the newline character with some spaces which is the last chunk.

It works fine if you have a normal character because there is no break in the content and you may get them in one chunk.

same input :