如何检测“文本内容中发现无效字符”
我正在使用 SAX 在 Java 中进行 XML 验证,并且我想识别以下类型的错误: “在文本内容中发现无效字符”。
目前,我使用 SAX 进行了验证,对于某些文档,我损坏了未检测为错误的字符。例如,当我尝试使用 IE 浏览器打开结果 XML 文件时,我收到一条错误消息“在文本内容中发现无效字符”。
这是 XML 数据的示例:
<?xml version='1.0' encoding='UTF-8' standalone='yes'>
<!DOCTYPE blabla SYSTEM 'blabla.dtd'>
<blabla type='type' num='num'>
<...>... corrupted character </...>
</blabla>
这是解析器实例化的示例:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
parser = factory.newSAXParser();
parser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
parser.setProperty(JAXP_SCHEMA_SOURCE, new File(theConfig.getRoot()
.concat(File.separator).concat(theConfig.getXsdFileName())
.concat("-v").concat(theConfig.getXsdFileVersion()).concat(
XSD_EXTENSION)));
reader = parser.getXMLReader();
reader.setErrorHandler(getHandler());
reader.setEntityResolver(new MyEntityResolver(theConfig.getRoot(),
theConfig));
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(theDataToParse));
reader.parse(is);
错误处理程序实现了“warning”、“error”和“fatalError”方法,但未检测到任何内容。 实体解析器能够引导存储在配置目录中的客户实体文件。
有人知道为什么没有检测到这种格式错误的字符错误吗?是因为我的流来自字符串而不是文件吗?
提前感谢您的帮助。
问候。
I'm doing an XML validation in Java, using SAX, and i'd like to recognize the following kind of error :
"An invalid character was found in text content".
At the moment, i have a validation with SAX, and for some documents i have corrupted characters not detected as errors. When i try to open the result XML file with IE Browser for example, i get an error message "an invalid character was found in text content".
This is an example of XML data:
<?xml version='1.0' encoding='UTF-8' standalone='yes'>
<!DOCTYPE blabla SYSTEM 'blabla.dtd'>
<blabla type='type' num='num'>
<...>... corrupted character </...>
</blabla>
And this is an example of the instanciation of the parser:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
parser = factory.newSAXParser();
parser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
parser.setProperty(JAXP_SCHEMA_SOURCE, new File(theConfig.getRoot()
.concat(File.separator).concat(theConfig.getXsdFileName())
.concat("-v").concat(theConfig.getXsdFileVersion()).concat(
XSD_EXTENSION)));
reader = parser.getXMLReader();
reader.setErrorHandler(getHandler());
reader.setEntityResolver(new MyEntityResolver(theConfig.getRoot(),
theConfig));
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(theDataToParse));
reader.parse(is);
The error handler implements methods 'warning', 'error' and 'fatalError', but nothing is detected.
The entity resolver enable to lead a custome entity file, stored in a configuration directory.
Does someone have an idea why such malformed character error is not detected ? Is it because my stream comes from a String and not a file ?
Thanks in advance for your help.
Regards.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,显然您已经完成了字节到字符的转换,因为您已经保存了字符串。如果要检测无效字符,则需要解析字节。一般来说,将 xml 数据保存为字符串数据并不好,因为您可能会因不正确的字符编码而损坏它。处理 xml 的最佳方法是将其视为二进制数据。
yes, apparently you have already done the byte to character conversion since you are holding the string already. if you want to detect the invalid character, you need to parse the bytes. in general, it's not good to hold xml data as string data as you risk corrupting it through incorrect character encoding. the best way to treat xml is as binary data.