为什么 XMLEventReader 报告包含标记的 CHARACTERS 事件?

发布于 2024-09-18 03:12:46 字数 1515 浏览 0 评论 0原文

我有一个 XMLEventReader。它是使用“UTF8”编码从 XMLInputFactory 构建的。我用它来读取“encoding”属性设置为“UTF-8”的 XML 文件。

我已经验证 XML 文件在 Firefox 下可以正确查看。当你查看页面编码时,它说它是UTF-8。

我已将 XMLEventReader 设置为合并字符事件,如下所示:

reader.setProperty(XMLEventReader.IS_COALESCING, Boolean.TRUE);

XML 文档没有 DTD。这是有效的。

XMLEventReader 有时会报告已收到一个 CHARACTERS 事件,其内容为(减去引号),例如:

r problems were most severe and frequent.) Did you sleep a lot more than usual nearly every night during that period?</text>  Ð 

请注意样本末尾附近存在标记标签以及大写的 thorn。另请注意,该句子已被删除;据推测,在此事件之前还有另一个 CHARACTERS 事件,其中包含句子的前导部分。

为什么 XMLEventReader 会搞砸解析?为什么字符显示不正确?如果确实发生了这种情况,为什么 XMLEventReader 不合并 CHARACTERS 事件?为什么 StAX 如此丑陋且难以预测?

我正在 Mac 上使用 Java 运行时 (Java 6) 提供的 XMLEventReader。

下面是一些示例 XML,当然,我只是从编辑器中复制了它,所以谁知道因此会发生什么字符转换,但无论如何:

<question id="BMHPD17">
  <permittedResponseCount>1</permittedResponseCount>
  <text>It’s hard for me to stay out of trouble. (Would you say this is true or false for you?)</text>
  <namedAnswerSet idref="TrueFalse"></namedAnswerSet>
</question>

请注意第 3 行上的“智能撇号”。

我正在通过反应来阅读此内容到 CHARACTERS 事件,将其内容保存到堆栈上的字符串中,然后对名称为“question”的 END_ELEMENT 事件做出反应。收到问题的 END_ELEMENT 事件后,我检索刚才提到的字符串的值,并构造一个 Java 对象,该对象将我刚才提到的字符串作为输入。

当我使用 System.out.println() 结果时,我(有时)会得到我之前提到的虚假垃圾。

当我将 System.out 包装在带有“UTF8”编码集的 PrintWriter 中时,这样我就不是简单地根据平台的编码输出字符,我得到了相同的结果。

I have an XMLEventReader. It has been built from an XMLInputFactory with the "UTF8" encoding. I am using it to read an XML file whose "encoding" attribute is set to "UTF-8".

I have verified that the XML file views correctly under Firefox. When you view the page encoding, it says that it is UTF-8.

I have set the XMLEventReader to coalesce character events like this:

reader.setProperty(XMLEventReader.IS_COALESCING, Boolean.TRUE);

The XML document does not have a DTD. It is valid.

The XMLEventReader will occasionally report that a CHARACTERS event has been received whose content is (minus the quotation marks), for example:

r problems were most severe and frequent.) Did you sleep a lot more than usual nearly every night during that period?</text>  Ð 

Note the presence of the markup tag near the end of the sample, as well as the capital thorn. Note also that the sentence has been lopped off; presumably there was another CHARACTERS event before this one that contains the leading part of the sentence.

Why does the XMLEventReader screw up the parsing? Why are the characters not displaying correctly? Why does the XMLEventReader not coalesce CHARACTERS events, if that's what's going on? Why is StAX so unbelievably festeringly ugly and unpredictable?

I am using the XMLEventReader supplied to me by my Java runtime (Java 6) on a Mac.

Here is some sample XML, which of course I've simply copied from my editor, so who knows what character conversions occurred as a result of that, but anyhow:

<question id="BMHPD17">
  <permittedResponseCount>1</permittedResponseCount>
  <text>It’s hard for me to stay out of trouble. (Would you say this is true or false for you?)</text>
  <namedAnswerSet idref="TrueFalse"></namedAnswerSet>
</question>

Note the "smart apostrophe" on line 3.

I am reading this by reacting to a CHARACTERS event, saving its contents to a String on the stack, then reacting to an END_ELEMENT event whose name is "question". Upon receiving the END_ELEMENT event for question, I retrieve the value of the String I just mentioned, and construct a Java object that takes the string I just mentioned as input.

When I System.out.println() the result, I get (sometimes) the bogus junk I referred to earlier.

When I wrap System.out inside a PrintWriter with "UTF8" encoding set, so that I'm not simply outputting characters according to the platform's encoding, I get the same results.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浪漫之都 2024-09-25 03:12:46

事实证明这是 Mac OSX JVM 上的一个错误。控制台使用的字符编码并不默认为 UTF-8,即使默认字符编码的所有其他用法都是 UTF8。

This turns out to be a bug on Mac OSX's JVM. The character encoding used by the console does not default to UTF-8, even though all other usages of the default character encoding are UTF8.

叹梦 2024-09-25 03:12:46

这是否与基础 SAX 事件相同(包括起始偏移量和长度)?如果是这样,您可能会发现它们指定了排除标记的字符串区域。

Is this even the same as the underlying SAX event, which includes a start offset and length? If so, you will probably find these specify a region of the string that excludes the markup.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文