Java中一个字符是1字节还是2字节?

发布于 2024-11-06 23:11:59 字数 1002 浏览 0 评论 0原文

我认为java中的字符是16位,如java doc中建议的那样。字符串不也是这样吗?我有一个将对象存储到文件中的代码:

public static void storeNormalObj(File outFile, Object obj) {
    FileOutputStream fos = null;
    ObjectOutputStream oos = null;
    try {
        fos = new FileOutputStream(outFile);
        oos = new ObjectOutputStream(fos);
        oos.writeObject(obj);
        oos.flush();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            oos.close();
            try {
                fos.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

基本上,当我打开 时,我尝试将字符串 "abcd" 存储到文件 "output" 中用编辑器>output,删除掉无字符串部分,只剩下字符串“abcd”,总共4个字节。有人知道为什么吗?对于ASCII可以支持的字符串,java是否会自动使用ASCII而不是UNICODE来节省空间?谢谢

I thought characters in java are 16 bits as suggested in java doc. Isn't it the case for strings? I have a code that stores an object into a file:

public static void storeNormalObj(File outFile, Object obj) {
    FileOutputStream fos = null;
    ObjectOutputStream oos = null;
    try {
        fos = new FileOutputStream(outFile);
        oos = new ObjectOutputStream(fos);
        oos.writeObject(obj);
        oos.flush();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            oos.close();
            try {
                fos.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Basically, I tried to store an string "abcd" in to file "output", when I opened up output with an editor and deleted the none string part, what's left is just the string "abcd", which is 4 bytes in total. Anyone knows why? Does java automatically saves space by using ASCII instead of UNICODE for Strings that can be supported by ASCII? Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

2024-11-13 23:11:59

(我认为“无字符串部分”是指创建 ObjectOutputStream 时发出的字节。您可能不想使用 ObjectOutputStream,但我不知道您的要求。)

仅供参考,Unicode 和UTF-8 不是一回事。 Unicode 是一个标准,除其他外,还指定了哪些字符可用。 UTF-8 是一种字符编码,指定如何将这些字符物理编码为 1 和 0。 UTF-8 可以使用 1 个字节表示 ASCII (<= 127),最多使用 4 个字节表示其他 Unicode 字符。

UTF-8 是 ASCII 的严格超集。因此,即使您为文件指定 UTF-8 编码并向其写入“abcd”,它也只会包含这四个字节:它们在 ASCII 中具有与 UTF-8 中相同的物理编码。

您的方法使用 ObjectOutputStream,它实际上具有与 ASCII 或 UTF-8 显着不同的编码!如果您仔细阅读 Javadoc,如果 obj 是一个字符串并且已经出现在流中,则对 writeObject 的后续调用将导致发出对前一个字符串的引用,在重复字符串的情况下,可能会导致写入的字节数减少很多。

如果您真的想了解这一点,您确实应该花大量时间阅读有关 Unicode 和字符编码系统的内容。维基百科有一篇关于 Unicode 的优秀文章作为开始。

(I think by "none string part" you are referring to the bytes that ObjectOutputStream emits when you create it. It is possible you don't want to use ObjectOutputStream, but I don't know your requirements.)

Just FYI, Unicode and UTF-8 are not the same thing. Unicode is a standard that specifies, amongst other things, what characters are available. UTF-8 is a character encoding that specifies how these characters shall be physically encoded in 1s and 0s. UTF-8 can use 1 byte for ASCII (<= 127) and up to 4 bytes to represent other Unicode characters.

UTF-8 is a strict superset of ASCII. So even if you specify a UTF-8 encoding for a file and you write "abcd" to it, it will contain just those four bytes: they have the same physical encoding in ASCII as they do in UTF-8.

Your method uses ObjectOutputStream which actually has a significantly different encoding than either ASCII or UTF-8! If you read the Javadoc carefully, if obj is a string and has already occurred in the stream, subsequent calls to writeObject will cause a reference to the previous string to be emitted, potentially causing many fewer bytes to be written in the case of repeated strings.

If you're serious about understanding this, you really should spend a good amount of time reading about Unicode and character encoding systems. Wikipedia has an excellent article on Unicode as a start.

月棠 2024-11-13 23:11:59

是的,char 只是 Java 运行时环境上下文中的 Unicode。如果您希望使用 16 位编码来编写它,请使用 FileWriter

    FileWriter outputStream = null;

    try {
        outputStream = new FileWriter("myfilename.dat");

        int c;
        while ((c = inputStream.read()) != -1) {
            outputStream.write(c);
        }
    } finally {
        if (outputStream != null) {
            outputStream.close();
        }
    }

Yea, the char is only Unicode within the context of the Java runtime environment. If you wish to write it using 16-bit encoding, use a FileWriter.

    FileWriter outputStream = null;

    try {
        outputStream = new FileWriter("myfilename.dat");

        int c;
        while ((c = inputStream.read()) != -1) {
            outputStream.write(c);
        }
    } finally {
        if (outputStream != null) {
            outputStream.close();
        }
    }
寄风 2024-11-13 23:11:59

如果您查看 String 的源代码,会注意到它调用 DataOutput.writeUTF 来写入字符串。如果您读过,您会发现它们被写为“modified UTF-8”。细节很长,但是如果你不使用非7位ascii,是的,它将占用一个字节。如果您想要详细信息,请查看 DataOutput.writeUTF() 中的超长 javadoc

If you look at the source of String, it will note that it calls DataOutput.writeUTF to write Strings. And if you read that you'll find out they are written as "modified UTF-8". The details are lengthy, but if you don't use non 7 bit ascii, yes, it will take one byte. If you want the gory details look at the EXTREMELY long javadoc in DataOutput.writeUTF()

绾颜 2024-11-13 23:11:59

您可能有兴趣了解 Java Update 21 性能版本及更高版本中有一个 -XX:+UseCompressedStrings 选项。这将允许 String 对不需要 char[] 的字符串使用 byte[]

尽管 Java Hotspot VM Options 指南建议默认情况下它可能处于打开状态,这可能仅适用于性能版本。只有当我明确打开它时,它才对我有用。

You may be interested to know there is a -XX:+UseCompressedStrings option in Java Update 21 performance release and later. This will allows String to use a byte[] for strings which do not need a char[]

Despite the Java Hotspot VM Options guide suggesting it may be on by default, this may only be for performance releases. It only appears to work for me if I turn it on explicitly.

猫性小仙女 2024-11-13 23:11:59

那么您期望一个 16*4=64 位 = 8 字节 文件吗?多于UTF-8或ASCII编码。一旦文件被写入文件。内存(就空间而言)管理取决于操作系统。并且您的代码无法对其进行控制。

So do you expect a 16*4=64 bits = 8 bytes file? More than UTF-8 or ASCII encoding. Once the file is written to a file. The memory (in terms of space) management is up to the operating system. And your code doesn't have a control on it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文