使用 dom4j 读取时转换文档编码
有什么方法可以将 dom4j 的 SAXReader 解析的文档从 ISO-8859-2 编码转换为 UTF-8 吗? 我需要在解析时发生这种情况,以便 dom4j 创建的对象已经是 Unicode/UTF-8 并运行如下代码:
"some text".equals(node.getText());
返回 true。
Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:
"some text".equals(node.getText());
returns true.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是由 dom4j 自动完成的。 Java 中的所有
String
实例都采用通用的解码形式; 一旦创建了String
,就不可能知道原始字符编码是什么(或者即使字符串是从编码字节创建的)。只需确保 XML 文档指定了字符编码(除非是 UTF-8,否则这是必需的)。
This is done automatically by dom4j. All
String
instances in Java are in a common, decoded form; once aString
is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).
解码发生在
InputSource
中(或之前)(在SAXReader
之前)。 从该类的 javadoc 中:所以这取决于您如何创建
InputSource
。 为了保证正确的解码,您可以使用如下所示的内容:The decoding happens in (or before) the
InputSource
(before theSAXReader
). From that class's javadocs:So it depends on how you are creating the
InputSource
. To guarantee the proper decoding you can use something like the following: