如何在 Java 中查找默认字符集/编码？

发布于 2024-08-11 08:55:57 字数 1424 浏览 2 评论 0原文

显而易见的答案是使用 Charset.defaultCharset() 但我们最近发现这可能不是正确的答案。有人告诉我，结果与 java.io 类在某些情况下使用的实际默认字符集不同。看起来 Java 保留了 2 套默认字符集。有人对这个问题有任何见解吗？

我们能够重现一个失败案例。这是一种用户错误，但它仍然可能暴露所有其他问题的根本原因。这是代码，

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

我们的服务器需要 Latin-1 中的默认字符集来处理旧协议中的一些混合编码 (ANSI/Latin-1/UTF-8)。所以我们所有的服务器都使用这个 JVM 参数运行，

-Dfile.encoding=ISO-8859-1

这是 Java 5 上的结果，

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

有人尝试通过在代码中设置 file.encoding 来更改编码运行时。我们都知道这是行不通的。然而，这显然会抛出defaultCharset()，但它不会影响OutputStreamWriter使用的真正的默认字符集。

这是错误还是功能？

编辑：接受的答案显示了问题的根本原因。基本上，您不能信任 Java 5 中的 defaultCharset()，它不是 I/O 类使用的默认编码。看起来 Java 6 纠正了这个问题。

原文

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

Is this a bug or feature?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紙鸢 2024-08-18 08:55:57

这真的很奇怪......一旦设置，默认的字符集就会被缓存，并且当类在内存中时它不会改变。使用 System.setProperty("file.encoding", "Latin-1"); 设置 "file.encoding" 属性不会执行任何操作。每次调用 Charset.defaultCharset() 时，它都会返回缓存的字符集。

这是我的结果：

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

不过我使用的是 JVM 1.6。

（更新）

好的。我确实用 JVM 1.5 重现了你的错误。

查看1.5的源代码，没有设置缓存的默认字符集。我不知道这是否是一个错误，但 1.6 更改了此实现并使用缓存的字符集：

JVM 1.5：

public static Charset defaultCharset() {
    synchronized (Charset.class) {
        if (defaultCharset == null) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                return cs;
            return forName("UTF-8");
        }
        return defaultCharset;
    }
}

JVM 1.6：

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

当您将文件编码设置为 file.encoding=Latin-1下次调用 Charset.defaultCharset() 时，会发生的情况是，由于未设置缓存的默认字符集，它将尝试为名称 Latin-1< 查找适当的字符集/代码>。找不到此名称，因为它不正确，并返回默认的 UTF-8。

至于为什么OutputStreamWriter等IO类会返回意外的结果，
对于 JVM 1.5 和 JVM 1.6，sun.nio.cs.StreamEncoder 的实现（由这些 IO 类使用）也不同。 JVM 1.6 实现基于 Charset.defaultCharset() 方法来获取默认编码（如果未向 IO 类提供默认编码）。 JVM 1.5 实现使用不同的方法 Converters.getDefaultEncodingName(); 来获取默认字符集。此方法使用自己的缓存，该缓存是在 JVM 初始化时设置的默认字符集：

JVM 1.6：

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Charset.defaultCharset().name();
    try {
        if (Charset.isSupported(csn))
            return new StreamEncoder(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
    throw new UnsupportedEncodingException (csn);
}

JVM 1.5：

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Converters.getDefaultEncodingName();
    if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
        try {
            if (Charset.isSupported(csn))
                return new CharsetSE(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
    }
    return new ConverterSE(out, lock, csn);
}

但我同意这些评论。您不应依赖此属性。这是一个实施细节。

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
    synchronized (Charset.class) {
        if (defaultCharset == null) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                return cs;
            return forName("UTF-8");
        }
        return defaultCharset;
    }
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            java.security.PrivilegedAction pa =
                    new GetPropertyAction("file.encoding");
            String csn = (String) AccessController.doPrivileged(pa);
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Charset.defaultCharset().name();
    try {
        if (Charset.isSupported(csn))
            return new StreamEncoder(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
    throw new UnsupportedEncodingException (csn);
}

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
        Object lock,
        String charsetName)
        throws UnsupportedEncodingException
{
    String csn = charsetName;
    if (csn == null)
        csn = Converters.getDefaultEncodingName();
    if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
        try {
            if (Charset.isSupported(csn))
                return new CharsetSE(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
    }
    return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

回复收藏 0 原文

不及他 2024-08-18 08:55:57

这是错误还是功能？

看起来像未定义的行为。我知道，在实践中，您可以使用命令行属性更改默认编码，但我认为执行此操作时会发生什么是已定义的。

错误 ID：4153515 有关设置此属性的问题：

这不是一个错误。 J2SE 不需要“file.encoding”属性
平台规范；这是 Sun 实现的内部细节
不应由用户代码检查或修改。这也是为了
只读；技术上不可能支持该属性的设置
命令行上或程序期间任何其他时间的任意值
执行。

更改虚拟机和运行时使用的默认编码的首选方法
系统的目的是在启动之前更改底层平台的区域设置
Java程序。

当我看到人们在命令行上设置编码时，我感到畏缩 - 你不知道这会影响哪些代码。

如果您不想使用默认编码，请通过适当的方法显式设置您想要的编码/构造函数。

回复收藏 0 原文

仲春光 2024-08-18 08:55:57

这种行为其实并不那么奇怪。查看类的实现，这是由于：

Charset.defaultCharset() 没有缓存 Java 5 中确定的字符集。
设置系统属性“file.encoding”并调用 Charset.defaultCharset() 再次导致系统属性的第二次评估，没有找到名称为“Latin-1”的字符集，因此 Charset.defaultCharset() 默认为“UTF” -8”。
然而，OutputStreamWriter 正在缓存默认字符集，并且可能已在 VM 初始化期间使用，因此如果系统属性“，则其默认字符集将从 Charset.defaultCharset() 转移” file.encoding”已在运行时更改。

正如已经指出的，没有记录虚拟机在这种情况下必须如何表现。 Charset.defaultCharset() API 文档对于如何确定默认字符集不是很精确，仅提到它通常是在 VM 启动时根据操作系统默认字符集或默认字符集等因素完成的语言环境。

回复收藏 0 原文

甜嗑 2024-08-18 08:55:57

首先，Latin-1 与 ISO-8859-1 相同，因此，默认值已经可以了。正确的？

您已使用命令行参数成功将编码设置为 ISO-8859-1。您还可以通过编程方式将其设置为“Latin-1”，但是，这不是 Java 文件编码的可识别值。请参阅 http://java.sun.com /javase/6/docs/technotes/guides/intl/encoding.doc.html

当您这样做时，从源代码来看，字符集似乎重置为 UTF-8。这至少解释了大部分行为。

我不知道为什么OutputStreamWriter显示ISO8859_1。它委托给闭源 sun.misc.* 类。我猜它并不完全通过相同的机制处理编码，这很奇怪。

但当然，您应该始终指定在此代码中所指的编码。我永远不会依赖平台默认设置。

回复收藏 0 原文