如何在 Java 中查找默认字符集/编码?
显而易见的答案是使用 Charset.defaultCharset()
但我们最近发现这可能不是正确的答案。有人告诉我,结果与 java.io 类在某些情况下使用的实际默认字符集不同。看起来 Java 保留了 2 套默认字符集。有人对这个问题有任何见解吗?
我们能够重现一个失败案例。这是一种用户错误,但它仍然可能暴露所有其他问题的根本原因。这是代码,
public class CharSetTest {
public static void main(String[] args) {
System.out.println("Default Charset=" + Charset.defaultCharset());
System.setProperty("file.encoding", "Latin-1");
System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("Default Charset=" + Charset.defaultCharset());
System.out.println("Default Charset in Use=" + getDefaultCharSet());
}
private static String getDefaultCharSet() {
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc;
}
}
我们的服务器需要 Latin-1 中的默认字符集来处理旧协议中的一些混合编码 (ANSI/Latin-1/UTF-8)。所以我们所有的服务器都使用这个 JVM 参数运行,
-Dfile.encoding=ISO-8859-1
这是 Java 5 上的结果,
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1
有人尝试通过在代码中设置 file.encoding 来更改编码运行时。我们都知道这是行不通的。然而,这显然会抛出defaultCharset(),但它不会影响OutputStreamWriter使用的真正的默认字符集。
这是错误还是功能?
编辑:接受的答案显示了问题的根本原因。基本上,您不能信任 Java 5 中的 defaultCharset(),它不是 I/O 类使用的默认编码。看起来 Java 6 纠正了这个问题。
The obvious answer is to use Charset.defaultCharset()
but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?
We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,
public class CharSetTest {
public static void main(String[] args) {
System.out.println("Default Charset=" + Charset.defaultCharset());
System.setProperty("file.encoding", "Latin-1");
System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("Default Charset=" + Charset.defaultCharset());
System.out.println("Default Charset in Use=" + getDefaultCharSet());
}
private static String getDefaultCharSet() {
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc;
}
}
Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,
-Dfile.encoding=ISO-8859-1
Here is the result on Java 5,
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1
Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.
Is this a bug or feature?
EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这真的很奇怪......一旦设置,默认的字符集就会被缓存,并且当类在内存中时它不会改变。使用
System.setProperty("file.encoding", "Latin-1");
设置"file.encoding"
属性不会执行任何操作。每次调用 Charset.defaultCharset() 时,它都会返回缓存的字符集。这是我的结果:
不过我使用的是 JVM 1.6。
(更新)
好的。我确实用 JVM 1.5 重现了你的错误。
查看1.5的源代码,没有设置缓存的默认字符集。我不知道这是否是一个错误,但 1.6 更改了此实现并使用缓存的字符集:
JVM 1.5:
JVM 1.6:
当您将文件编码设置为
file.encoding=Latin-1
下次调用Charset.defaultCharset()
时,会发生的情况是,由于未设置缓存的默认字符集,它将尝试为名称Latin-1< 查找适当的字符集/代码>。找不到此名称,因为它不正确,并返回默认的
UTF-8
。至于为什么
OutputStreamWriter
等IO类会返回意外的结果,对于 JVM 1.5 和 JVM 1.6,
sun.nio.cs.StreamEncoder
的实现(由这些 IO 类使用)也不同。 JVM 1.6 实现基于Charset.defaultCharset()
方法来获取默认编码(如果未向 IO 类提供默认编码)。 JVM 1.5 实现使用不同的方法Converters.getDefaultEncodingName();
来获取默认字符集。此方法使用自己的缓存,该缓存是在 JVM 初始化时设置的默认字符集:JVM 1.6:
JVM 1.5:
但我同意这些评论。您不应依赖此属性。这是一个实施细节。
This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the
"file.encoding"
property withSystem.setProperty("file.encoding", "Latin-1");
does nothing. Every timeCharset.defaultCharset()
is called it returns the cached charset.Here are my results:
I'm using JVM 1.6 though.
(update)
Ok. I did reproduce your bug with JVM 1.5.
Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:
JVM 1.5:
JVM 1.6:
When you set the file encoding to
file.encoding=Latin-1
the next time you callCharset.defaultCharset()
, what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the nameLatin-1
. This name isn't found, because it's incorrect, and returns the defaultUTF-8
.As for why the IO classes such as
OutputStreamWriter
return an unexpected result,the implementation of
sun.nio.cs.StreamEncoder
(witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in theCharset.defaultCharset()
method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different methodConverters.getDefaultEncodingName();
to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:JVM 1.6:
JVM 1.5:
But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.
看起来像未定义的行为。我知道,在实践中,您可以使用命令行属性更改默认编码,但我认为执行此操作时会发生什么是已定义的。
错误 ID:4153515 有关设置此属性的问题:
当我看到人们在命令行上设置编码时,我感到畏缩 - 你不知道这会影响哪些代码。
如果您不想使用默认编码,请通过适当的方法显式设置您想要的编码/构造函数。
Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.
Bug ID: 4153515 on problems setting this property:
I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.
If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.
这种行为其实并不那么奇怪。查看类的实现,这是由于:
Charset.defaultCharset()
没有缓存 Java 5 中确定的字符集。Charset.defaultCharset()
再次导致系统属性的第二次评估,没有找到名称为“Latin-1”的字符集,因此Charset.defaultCharset()
默认为“UTF” -8”。正如已经指出的,没有记录虚拟机在这种情况下必须如何表现。
Charset.defaultCharset()
API 文档对于如何确定默认字符集不是很精确,仅提到它通常是在 VM 启动时根据操作系统默认字符集或默认字符集等因素完成的语言环境。The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:
Charset.defaultCharset()
is not caching the determined character set in Java 5.Charset.defaultCharset()
again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, soCharset.defaultCharset()
defaults to "UTF-8".OutputStreamWriter
is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts fromCharset.defaultCharset()
if the system property "file.encoding" has been changed at runtime.As already pointed out, it is not documented how the VM must behave in such a situation. The
Charset.defaultCharset()
API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.首先,Latin-1 与 ISO-8859-1 相同,因此,默认值已经可以了。正确的?
您已使用命令行参数成功将编码设置为 ISO-8859-1。您还可以通过编程方式将其设置为“Latin-1”,但是,这不是 Java 文件编码的可识别值。请参阅 http://java.sun.com /javase/6/docs/technotes/guides/intl/encoding.doc.html
当您这样做时,从源代码来看,字符集似乎重置为 UTF-8。这至少解释了大部分行为。
我不知道为什么OutputStreamWriter显示ISO8859_1。它委托给闭源 sun.misc.* 类。我猜它并不完全通过相同的机制处理编码,这很奇怪。
但当然,您应该始终指定在此代码中所指的编码。我永远不会依赖平台默认设置。
First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?
You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.
I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.
But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.
我已将 WAS 服务器中的 vm 参数设置为 -Dfile.encoding=UTF-8 以更改服务器的默认字符集。
I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.
检查
它似乎与系统命令行中使用的编码相同。
check
it seems to be the same encoding as the one used in your system's command line.