é 未正确解析

发布于 2024-07-09 04:05:42 字数 173 浏览 15 评论 0原文

我的应用程序将从 urlconnection 读取 xml。 xml编码是ISO-8859-1,它包含é字符。 我使用 xerces saxparser 来解析接收到的 xml 内容。 但是,在 lunix 操作系统下运行应用程序时,无法正确解析 é。 在 Windows 中一切工作正常。 你们能给我一些提示吗? 多谢

My application will read xml from urlconnection. The xml encoding is ISO-8859-1, it contains é character. I use xerces saxparser to parse received xml content. However, é can not be parsed correctly while running application under lunix OS. Everything works fine in Windows. Could you guys please give me some hints? Thanks a lot

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我很OK 2024-07-16 04:05:42

这可能是标记为“ISO-8859-1”的文件实际上采用另一种编码的情况。

“ISO-8859-1”和“Windows-2152”经常会发生这种情况:它们的使用就好像它们可以互换一样,但事实并非如此。 (在这个答案的评论中已经澄清,两种编码都同意“é”的字符代码,因此 Windows-1252 可能不是它。)

您可以使用十六进制编辑器来查找文件中“é”的确切字符代码。 您可以将该值作为文件采用的编码方式的提示。如果您可以控制文件的生成方式,那么也建议查看负责的代码/方法。

This is probably a case of a file marked as "ISO-8859-1" when it in reality is in another encoding.

Often this happens with "ISO-8859-1" and "Windows-2152": They are being used as if they were interchangeable, but they are not. (In the comments to this answer it has been clarified that both encodings agree on a character code for "é", so Windows-1252 is probably not it.)

You can use a Hex editor to find out the exact char code of the "é" in your file. You can take that value as a hint to what encoding the file is in. If you have control over how the file is produced, a look at the responsible is code/method is also advisable.

一桥轻雨一伞开 2024-07-16 04:05:42

我敢打赌这与 file.encoding 有关。 尝试在 Linux 上使用 -Dfile.encoding=iso-8859-1 作为 VM 参数运行。

如果这有效,您可能需要在打开流时(在代码中的某个位置)指定正确的格式。

I bet this is related to file.encoding. Try running with -Dfile.encoding=iso-8859-1 as a VM parameter on linux.

If this works, you probably need to specify the correct format when opening the stream (somewhere in your code).

回眸一笑 2024-07-16 04:05:42

您应该做的第一件事是确定 xml 文件的实际编码,正如 Tomalak 所建议的那样,而不是 header 中声明的编码。

您可以首先使用 Internet Explorer 打开它。 如果编码不正确,您可能会看到如下错误:

在文本中发现无效字符
内容。 处理资源时出错
...

或者下面的一个:

从当前编码切换到
不支持指定的编码。
处理资源时出错...

下一步是使用支持多种编码的文本编辑器。 您可以使用 Notepad++,它是免费的、易于使用且支持多种编码。 无论 xml 标头如何描述编码,编辑器都会尝试检测文件的编码并将其显示在状态栏上。

如果您确定文件编码是正确的,那么您可能无法正确处理 Java 内部的编码。 考虑到 Java 字符串是 UTF-16,默认情况下,在从字节数组转换或转换为字节数组时,如果未指定编码,Java 默认为系统编码(Windows 下的 Windows-1521 或现代 Linux 上的 UTF-8)。 某些编码转换只会导致出现“奇怪”字符,例如固定 8 位编码之间的转换(即 Windows-1252 <-> ISO-8859-1)。 其他转换会因无效字符而引发编码异常(例如,尝试将 Windows-1252 文本导入为 UTF-8)。

无效代码的示例如下:

// Parse the input
SAXParser saxParser = factory.newSAXParser();
InputStream is = new ByteArrayInputStream(stringToParse.getBytes());
saxParser.parse( is, handler );

转换 stringToParse.getBytes() 默认情况下返回 Windows 平台上编码为 Windows-1252 的字符串。 如果在此步骤中 XML 文本是用 ISO-8859-1 编码的,则字符有误。 正确的步骤应该是将 XML 作为字节而不是字符串读取,并让 SAX 管理 xml 编码。

The first thing you should do is determining the real encoding of the xml file, as Tomalak suggests, not the encoding stated in header.

You can start by opening it with Internet Explorer. If encoding is not correct you may see an error like this:

An invalid character was found in text
content. Error processing resource
...

Or the following one:

Switch from current encoding to
specified encoding not supported.
Error processing resource ...

Using a text editor with several encodings support is the next step. You can use Notepad++ that is free, easy to use and supports several encodings. No matter what xml header says about encoding, the editor tries to detect encoding of the file and displays it on status bar.

If you determine that the file encoding is correct then you may be not handling correctly the encoding inside Java. Take into account that Java strings are UTF-16 and by default when converting from/to byte arrays, if no encoding is specified Java defaults to system encoding (Windows-1521 under Windows or UTF-8 on modern Linuxes). Some encoding conversions only cause "strange" characters to appear, such as conversions between fixed 8 bit encodings (ie Windows-1252 <-> ISO-8859-1). Other conversions raise enconding exceptions because of invalid characters (try importing Windows-1252 text as UTF-8 for example).

An example of invalid code is the following:

// Parse the input
SAXParser saxParser = factory.newSAXParser();
InputStream is = new ByteArrayInputStream(stringToParse.getBytes());
saxParser.parse( is, handler );

The conversion stringToParse.getBytes() returns by default the string encoded as Windows-1252 on Windows platforms. If the XML text was encoded in ISO-8859-1 at this step you have wrong characters. The correct step should be reading XML as bytes and not a String and let SAX manage xml encoding.

望她远 2024-07-16 04:05:42

如果 XML 声明未指定编码,则 sax 解析器将尝试使用默认编码 UTF-8。

如果您知道字符编码但未在 XML 声明中指定,则可以告诉解析器通过 InputSource 使用该编码:

InputSource inputSource = new InputSource(xmlInputStream);
inputSource.setEncoding("ISO-8859-1");

If the XML declaration doesn't specify an encoding, the sax parser will try to use the default encoding, UTF-8.

If you know the character encoding but it isn't specified in the XML declaration, you can tell the parser to use that encoding with an InputSource:

InputSource inputSource = new InputSource(xmlInputStream);
inputSource.setEncoding("ISO-8859-1");
浅黛梨妆こ 2024-07-16 04:05:42

回复晚了,请原谅。 我们解决了问题。 我们对输入流做了一些错误的操作(正如 Fernando Miguélez 所说,转换导致了问题)。

感谢大家的帮助。

Sorry for my late reply. We solved the problem. We did some wrong operation on the input stream (just as what Fernando Miguélez said, conversion caused problem).

Thanks for all of you guys' help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文