如何在 XML 解析器抛出 MalformedByteSequenceException 后定位错误

发布于 2024-12-06 11:07:50 字数 3455 浏览 0 评论 0原文

解析 XML 文件时出现 MalformedByteSequenceException。

我的应用程序允许外部客户提交 XML 文件。他们可以使用任何支持的编码,但大多数根据提供给他们的示例在文件顶部指定 ...encoding="UTF-8"... 。但有些人会使用 windows-1252 对其数据进行编码,这将导致非 ascii 字符出现 MalformedByteSequenceException。

我想使用 XML 解析器来识别文件编码并解码文件,因此我不想进行测试编码或将 InputStream 转换为 Reader 的初步步骤。我觉得 XML 解析器应该处理这一步。

尽管我已经声明了 ValidationEventHandler,但当出现 MalformedByteSequenceException 时,它不会被调用。

有什么方法可以让 Unmarshaller 报告文件中发生错误的位置吗?

这是我的 Java 代码:

InputStream input = ...
JAXBContext jc = JAXBContext.newInstance(MyClass.class.getPackage().getName());
Unmarshaller unmarshaller = jc.createUnmarshaller();
SchemaFactory sf = SchemaFactory.newInstance(javax.xml.XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source source = new StreamSource(getClass().getResource("my.xsd").toExternalForm());
Schema schema = sf.newSchema(sources);
unmarshaller.setSchema(schema);
ValidationEventHandler handler = new MyValidationEventHandler();
unmarshaller.setEventHandler(handler);
MyClass myClass = (MyClass) unmarshaller.unmarshal(input);

以及生成的堆栈跟踪

javax.xml.bind.UnmarshalException
 - with linked exception:
[com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.]
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:202)
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:173)
        at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)
        at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)
        at (my code)
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
        at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
        at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:470)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanContent(XMLEntityScanner.java:916)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2788)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:200)
        ... 51 more

I'm getting a MalformedByteSequenceException when parsing an XML file.

My app allows external customers to submit XML files. They can use any supported encoding but most specify ...encoding="UTF-8"... at the top of the file as per the examples that were provided to them. But then some will use windows-1252 to encode their data which will cause a MalformedByteSequenceException for non-ascii characters.

I want to use the XML parser to identify the file encoding and decode the file so I don't want to have a preliminary step of testing the encoding or of converting the InputStream to a Reader. I feel that the XML parser should handle that step.

Even though I have declared a ValidationEventHandler, it is not called when a MalformedByteSequenceException.

Is there any way of getting the Unmarshaller to report the location in the file where the error occurs?

Here is my Java code:

InputStream input = ...
JAXBContext jc = JAXBContext.newInstance(MyClass.class.getPackage().getName());
Unmarshaller unmarshaller = jc.createUnmarshaller();
SchemaFactory sf = SchemaFactory.newInstance(javax.xml.XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source source = new StreamSource(getClass().getResource("my.xsd").toExternalForm());
Schema schema = sf.newSchema(sources);
unmarshaller.setSchema(schema);
ValidationEventHandler handler = new MyValidationEventHandler();
unmarshaller.setEventHandler(handler);
MyClass myClass = (MyClass) unmarshaller.unmarshal(input);

and the resulting stack-trace

javax.xml.bind.UnmarshalException
 - with linked exception:
[com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.]
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:202)
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:173)
        at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)
        at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)
        at (my code)
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
        at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
        at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:470)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1742)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanContent(XMLEntityScanner.java:916)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2788)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
        at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:200)
        ... 51 more

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

终弃我 2024-12-13 11:07:50

我还没有测试过,但我会

  • 使用 SAXSource (javax.xml.transform.sax.SAXSource) 而不是 StreamSource
  • 关联到 SAXSource 我自己的 org.xml.sax.ErrorHandler 实现 (SAXSource.getXMLReader().setErrorHandler)

这样我就会收到 SAXParseException 的通知,其中存在解析错误的位置。

I haven't tested but I would

  • use a SAXSource (javax.xml.transform.sax.SAXSource) instead of a StreamSource
  • associate to the SAXSource my own implementation of org.xml.sax.ErrorHandler (SAXSource.getXMLReader().setErrorHandler)

Like that I would get informed of SAXParseException in which there is the location of the parsing error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文