Java 应用程序：无法正确读取 iso-8859-1 编码文件

发布于 2024-07-12 04:59:38 字数 1152 浏览 9 评论 0原文

我有一个编码为 iso-8859-1 的文件，并包含 ô 等字符。

我正在用java代码读取这个文件，类似于：

File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
    int byteCount = fr.read(buffer, 0, buffer.length);
    if (byteCount <= 0) {
        break;
    }

    String s = new String(buffer, 0, byteCount,"ISO-8859-1");
    System.out.println(s);
}

但是 ô 字符总是乱码，通常打印为 ? 。

我已经阅读了该主题（并在途中学到了一些），例如

但仍然无法使其工作

有趣的是，这适用于我的本地电脑（xp），但不适用于我的 Linux 机器。

我已经使用以下命令检查了我的 jdk 是否支持所需的字符集（它们是标准的，所以这并不奇怪）：

System.out.println(java.nio.charset.Charset.availableCharsets());

原文

I have a file which is encoded as iso-8859-1, and contains characters such as ô .

I am reading this file with java code, something like:

File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
    int byteCount = fr.read(buffer, 0, buffer.length);
    if (byteCount <= 0) {
        break;
    }

    String s = new String(buffer, 0, byteCount,"ISO-8859-1");
    System.out.println(s);
}

However the ô character is always garbled, usually printing as a ? .

I have read around the subject (and learnt a little on the way) e.g.

but still can not get this working

Interestingly this works on my local pc (xp) but not on my linux box.

I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :

System.out.println(java.nio.charset.Charset.availableCharsets());

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

缘字诀 2024-07-19 04:59:38

我怀疑您的文件实际上没有编码为 ISO-8859-1，或者 System.out 不知道如何打印该字符。

我建议首先检查文件中的相关字节。要检查第二个，请检查字符串中的相关字符，并用以下命令打印出来。

 System.out.println((int) s.getCharAt(index));

在这两种情况下，结果应该为十进制 244； 0xf4 十六进制。

有关一般建议，请参阅我关于 Unicode 调试的文章（提供的代码是用 C# 编写的，但是转换成Java很容易，原理是一样的）。

顺便说一句，一般来说，我会使用带有正确编码的 InputStreamReader 来包装流 - 这比“手动”创建新字符串更容易。我意识到这可能只是演示代码。

编辑：这是一个非常简单的方法来证明控制台是否可以工作：

 System.out.println("Here's the character: \u00f4");

I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.

I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with

 System.out.println((int) s.getCharAt(index));

In both cases the result should be 244 decimal; 0xf4 hex.

See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).

In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.

EDIT: Here's a really easy way to prove whether or not the console will work:

 System.out.println("Here's the character: \u00f4");

回复收藏 0 原文

幸福％小乖 2024-07-19 04:59:38

将文件解析为固定大小的字节块并不好——如果某个字符的字节表示跨越两个块怎么办？使用 InputStreamReader 用适当的字符编码代替：

 BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

顺便说一句，请记住检查 unicode 字符确实可以正确显示。您还可以将程序输出重定向到文件，然后将其与原始文件进行比较。

正如乔恩·斯基特表明，问题也可能与控制台相关。尝试 System.console().printf( s) 看看是否有差异。

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:

 BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

回复收藏 0 原文

梦里泪两行 2024-07-19 04:59:38

@Joel - 你自己的答案确认问题是操作系统上的默认编码（UTF-8，Java 采用的编码）与终端使用的编码 (ISO-8859-1) 之间的差异所致。

考虑以下代码：

public static void main(String[] args) throws IOException {
    byte[] data = { (byte) 0xF4 };
    String decoded = new String(data, "ISO-8859-1");
    if (!"\u00f4".equals(decoded)) {
        throw new IllegalStateException();
    }

    // write default charset
    System.out.println(Charset.defaultCharset());

    // dump bytes to stdout
    System.out.write(data);

    // will encode to default charset when converting to bytes
    System.out.println(decoded);
}

默认情况下，我的 Ubuntu (8.04) 终端使用 UTF-8 编码。使用此编码，将打印以下内容：

UTF-8
?

如果我将终端的编码切换为 ISO 8859-1，则会打印以下内容：

UTF-8
??

两种情况下，Java 程序都会发出相同的字节：

5554 462d 380a f4c3 b40a

唯一的区别在于终端如何解释它接收到的字节。在 ISO 8859-1 中，ô 编码为 0xF4。在 UTF-8 中，ô 编码为 0xC3B4。其他字符对于两种编码都是通用的。

@Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).

Consider this code:

public static void main(String[] args) throws IOException {
    byte[] data = { (byte) 0xF4 };
    String decoded = new String(data, "ISO-8859-1");
    if (!"\u00f4".equals(decoded)) {
        throw new IllegalStateException();
    }

    // write default charset
    System.out.println(Charset.defaultCharset());

    // dump bytes to stdout
    System.out.write(data);

    // will encode to default charset when converting to bytes
    System.out.println(decoded);
}

By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:

UTF-8
?ô

If I switch the terminal's encoding to ISO 8859-1, this is printed:

UTF-8
ôÃ´

In both cases, the same bytes are being emitted by the Java program:

5554 462d 380a f4c3 b40a

The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

回复收藏 0 原文

箜明 2024-07-19 04:59:38

如果可以的话，尝试在调试器中运行您的程序，以查看创建后的 's' 字符串中的内容。有可能它的内容是正确的，但调用 System.out.println(s) 后输出是乱码。在这种情况下，Java 认为的输出编码与 Linux 上的终端/控制台的字符编码之间可能不匹配。

回复收藏 0 原文

雨落星ぅ辰 2024-07-19 04:59:38

基本上，如果它在您的本地 XP PC 上工作，但在 Linux 上不工作，并且您正在解析完全相同的文件（即，您在盒子之间以二进制方式传输它），那么它可能与 System.out 有关。 println 调用。我不知道如何验证输出，但如果您通过从 XP 机器连接远程 shell 来验证输出，则需要考虑 shell（和客户端）的字符集。

此外，Zach Scrivena 的建议也是正确的 - 您不能假设可以以这种方式从数据块创建字符串 - 要么使用 InputStreamReader 要么首先将完整数据读入数组（显然不适用于大文件）。然而，由于它似乎确实可以在 XP 上运行，那么我敢说，在这种特定情况下，这可能不是您的问题。

回复收藏 0 原文

~没有更多了~