我已经阅读 Unicode 和 UTF-8 编码有一段时间了,我想我理解它,所以希望这不会是一个愚蠢的问题:
我有一个包含一些 CJK 字符的文件,并且已保存为UTF-8。我安装了各种亚洲语言包,并且其他应用程序可以正确呈现字符,所以我知道这很有效。
在我的 Java 应用程序中,我按如下方式读取该文件:
// Create objects
fis = new FileInputStream(new File("xyz.sgf"));
InputStreamReader is = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(is);
// Read and display file contents
StringBuffer sb = new StringBuffer();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
System.out.println(sb);
输出将 CJK 字符显示为“???”。调用 is.getEncoding()
确认它确实使用 UTF-8。为了使角色正确显示,我缺少哪一步?如果有影响,我会使用 Eclipse 控制台查看输出。
I've been reading up on Unicode and UTF-8 encoding for a while and I think I understand it, so hopefully this won't be a stupid question:
I have a file which contains some CJK characters, and which has been saved as UTF-8. I have various Asian language packs installed and the characters are rendered properly by other applications, so I know that much works.
In my Java app, I read the file as follows:
// Create objects
fis = new FileInputStream(new File("xyz.sgf"));
InputStreamReader is = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(is);
// Read and display file contents
StringBuffer sb = new StringBuffer();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
System.out.println(sb);
The output shows the CJK characters as '???'. A call to is.getEncoding()
confirms that it is definitely using UTF-8. What step am I missing to make the characters appear properly? If it makes a difference, I'm looking at the output using the Eclipse console.
发布评论
评论(4)
问题出在上面这一行。这将使用默认系统编码对字符数据进行编码并将数据发送到 STDOUT。在许多系统上,这是一个有损过程。
如果更改默认值,
System.out
使用的编码和控制台使用的编码必须匹配。唯一受支持的更改默认系统编码的机制是通过操作系统。 (有些人会建议使用
file.encoding
系统属性,但这是 不支持并且可能会产生意想不到的副作用。)您可以使用setOut 到您自己的自定义PrintStream
:您可以通过 运行配置 。
您可以通过我的个人资料在我的博客上找到许多有关该主题的帖子。
The problem is the above line. This will encode character data using the default system encoding and emit the data to STDOUT. On many systems, this is a lossy process.
If you change the defaults, the encoding used by
System.out
and the encoding used by the console must match.The only supported mechanism to change the default system encoding is via the operating system. (Some will advise using the
file.encoding
system property, but this is not supported and may have unintended side-effects.) You can use setOut to your own customPrintStream
:You can change the Eclipse console encoding via the Run configuration.
You can find a number of posts about the subject on my blog - via my profile.
以下程序使用 TextPad 将 CJK 字符打印到控制台。要查看韩文朝鲜文和日文平假名,我必须告诉 Java 将打印流的编码更改为 EUC_KR 并设置 TextPad 工具输出窗口的属性:
工具输出是:
і다 こんにちは
The following program prints CJK characters to the console using TextPad. To see the Korean Hangul and Japanese Hiragana I had to tell Java to change the print stream's encoding to EUC_KR and set the properties of TextPad's tool output window:
Tool Output is:
가다 こんにちは
是的,您需要按照 如何在 eclipse-console 中显示中文字符 文章
Yeah, you need to change the encoding of the Eclipse console as explained in this how-to-display-chinese-character-in-eclipse-console article
根据您的平台,您的控制台(或 Windows CMD)很可能不支持或不使用 UTF-8 字符集,因此会将所有不可映射的字符转换为问号。
例如,在 Windows 上,CMD 几乎总是使用 WIN1252 或类似的单字节字符集。
Depending on your platform, it is highly likely that your console (or windows CMD) does not support or use the UTF-8 characterset, and therefor converts all unmappable characters to a question mark.
On Windows for example CMD almost always uses WIN1252 or a similar single byte characterset.