ByteBuffer.asCharBuffer() 使用什么字符集?

发布于 2024-11-25 05:22:52 字数 489 浏览 1 评论 0原文

字符集的作用 ByteBuffer.asCharBuffer( ) 使用?在我的系统上它似乎将 3 个字节转换为 1 个字符。

相关说明,CharsetDecoder 与 ByteBuffer.asCharBuffer() 相关吗?

更新:关于我使用的 ByteBuffer 的实现,我正在调用 ByteBuffer.allocate(1024).asCharBuffer()。我无法评论幕后使用的实现。

What Charset does ByteBuffer.asCharBuffer() use? It seems to convert 3 bytes to one character on my system.

On a related note, how does CharsetDecoder relate to ByteBuffer.asCharBuffer()?

UPDATE: With respect to what implementation of ByteBuffer I am using, I am invoking ByteBuffer.allocate(1024).asCharBuffer(). I can't comment on what implementation gets used under the hood.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

哭了丶谁疼 2024-12-02 05:22:52

对于第一个问题 - 我相信它使用 Java 的本机字符编码(UTF-16)。

For the first question - I believe it uses native character encoding of Java (UTF-16).

紫南 2024-12-02 05:22:52

据我了解,它没有任何用处。它只是假设它已经被正确解码为 Java 字符串,这意味着 UTF-16。这可以通过查看 HeapByteBuffer 的源代码来显示,其中返回的 charbuffer 最终调用(小尾数版本):

static private char makeChar(byte b1, byte b0) {
return (char)((b1 << 8) | (b0 & 0xff));
}

因此,这里处理的唯一事情是您负责的其余部分的尾数。这也意味着使用可以指定编码的 Decoder 类通常更有用。

As I understand it, it doesn't use anything. It just assumes it is already correctly decoded as a string for Java, which means UTF-16. This can be shown by looking at the source for the HeapByteBuffer, where the returned charbuffer finally calls (little endian version):

static private char makeChar(byte b1, byte b0) {
return (char)((b1 << 8) | (b0 & 0xff));
}

So the only thing that is handled here is the endianness for the rest you're responsible. Which also means it's usually much more useful to use the Decoder class where you can specify the encoding.

任谁 2024-12-02 05:22:52

查看jdk7,jdk/src/share/classes/java/nio

  1. X-Buffer.java.templateByteBuffer.allocate()映射到Heap-X-Buffer.java.template
  2. Heap-X-Buffer.java.templateByteBuffer.asCharBuffer() 映射到ByteBufferAs-X-Buffer.java.template
  3. ByteBuffer.asCharBuffer().toString() 调用 CharBuffer.put(CharBuffer) 但我可以'不知道这会导致什么

最终这可能会导致 Bits.makeChar() 定义为:

static private char makeChar(byte b1, byte b0) {
    return (char)((b1 << 8) | (b0 & 0xff));
}

但我不知道如何实现。

Looking at jdk7, jdk/src/share/classes/java/nio

  1. X-Buffer.java.template maps ByteBuffer.allocate() to Heap-X-Buffer.java.template
  2. Heap-X-Buffer.java.template maps ByteBuffer.asCharBuffer() to ByteBufferAs-X-Buffer.java.template
  3. ByteBuffer.asCharBuffer().toString() invokes CharBuffer.put(CharBuffer) but I can't figure out where this leads

Eventually this probably leads to Bits.makeChar() which is defined as:

static private char makeChar(byte b1, byte b0) {
    return (char)((b1 << 8) | (b0 & 0xff));
}

but I can't figure out how.

难理解 2024-12-02 05:22:52

我想扩展 @Petteri H 的答案。确实,asCharBuffer() 期望 ByteBuffer 已经是 UTF-16 编码的。不执行进一步的编码转换。您可以使用下面的代码运行实验。

首先,创建一个名为 test.txt 的纯文本文件,其中包含几行。

Hello World
Hi Moon
Howdy Jupiter

该文件默认采用 UTf-8 编码。我们预计这会成为一个问题,因为 CharBuffer 将读取两个连续的字节来构造一个字符并为您提供垃圾值。稍后,我们将修复该问题。

以下代码将简单地转储文件中的每个字符。注意:它将把每个双字节序列视为一个字符。

import java.io.RandomAccessFile;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.util.HashMap;

public class Main {
    public static void main(String[] args) {
         try (var file = new RandomAccessFile("test.txt", "r")) {
            var mappedMemory = file.getChannel()
                    .map(FileChannel.MapMode.READ_ONLY, 0, file.length());
            var buff = mappedMemory.asCharBuffer();

            for (int i = 0; i < buff.length(); ++i) {
                var ch = buff.get(i);

                System.out.print(ch);
            }
         } catch (Exception e) {
            e.printStackTrace();
         }
    }
}

当您运行代码时,您将看到意外的字符:

䡥汬漠坯牬搊䡩⁍潯渊䡯睤礠䩵灩瑥爊

现在,让我们使用 UTF-16 对同一文件进行编码。

iconv -f utf-8 -t utf-16 test.txt > test-fixed.txt

更改 Java 代码以读取 test-fixed.txt。然后再次运行它。

现在,您将看到正确的输出。

有趣的是,CharBuffer 跳过了 test-fixed.txt 文件将具有的 BOM 标记。

I wanted to expand on the answer by @Petteri H. It is true that asCharBuffer() expects the ByteBuffer to be already UTF-16 encoded. No further encoding conversion is performed. You can run an experiment using the code below.

First, create a plain text file called test.txt with a few lines.

Hello World
Hi Moon
Howdy Jupiter

This file will be UTf-8 encoded by default. We expect this to be a problem since CharBuffer will read two consecutive bytes to construct a character and give you garbage values. Later, we will fix the issue.

The following code will simply dump each character from the file. Note: It will treat each double byte sequence as a character.

import java.io.RandomAccessFile;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.util.HashMap;

public class Main {
    public static void main(String[] args) {
         try (var file = new RandomAccessFile("test.txt", "r")) {
            var mappedMemory = file.getChannel()
                    .map(FileChannel.MapMode.READ_ONLY, 0, file.length());
            var buff = mappedMemory.asCharBuffer();

            for (int i = 0; i < buff.length(); ++i) {
                var ch = buff.get(i);

                System.out.print(ch);
            }
         } catch (Exception e) {
            e.printStackTrace();
         }
    }
}

When you run the code you will see unexpected characters:

䡥汬漠坯牬搊䡩⁍潯渊䡯睤礠䩵灩瑥爊

Now, let's encode the same file using UTF-16.

iconv -f utf-8 -t utf-16 test.txt > test-fixed.txt

Change Java code to read test-fixed.txt. Then run it again.

Now, you will see the right output.

It is interesting to note that CharBuffer skips the BOM marker which test-fixed.txt file will have.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文