如何使用 BOM 对 UTF-16LE 字节数组进行编码/解码?

发布于 2024-07-20 04:58:09 字数 1489 浏览 9 评论 0原文

我需要在 java.lang.String 之间对 UTF-16 字节数组进行编码/解码。 字节数组是通过 字节顺序标记 (BOM) 提供给我的,我需要到带有 BOM 的编码字节数组。

另外,因为我正在处理 Microsoft 客户端/服务器,所以我想以小尾数法(以及 LE BOM)发出编码以避免任何误解。 我确实意识到,对于 BOM,它应该以大端方式工作,但我不想在 Windows 世界中逆流而上。

举个例子,下面是一个用 BOM 将 java.lang.String 编码为 UTF-16 的方法

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

:爪哇? 理想情况下,我想避免将整个字节数组复制到一个新的字节数组中,该数组在开头分配了两个额外的字节。

解码这样的字符串也是如此,但是使用 java.lang.String 构造函数

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

洒一地阳光 2024-07-27 04:58:09

“UTF-16”字符集名称将始终使用 BOM 进行编码,并使用大/小字节序对数据进行解码,但“UnicodeBig”和“UnicodeLittle”对于以特定字节顺序进行编码非常有用。 对于无 BOM 使用 UTF-16LE 或 UTF-16BE - 请参阅这篇文章了解如何使用“\uFEFF”手动处理 BOM。 请参阅此处了解字符集的规范命名字符串名称或(最好)Charset 类。 另请注意,只有 绝对需要支持编码

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

画▽骨i 2024-07-27 04:58:09

首先,对于解码,您可以使用字符集“UTF-16”; 自动检测初始 BOM。 对于编码 UTF-16BE,您还可以使用“UTF-16”字符集 - 这将编写正确的 BOM,然后输出大端字节序内容。

对于使用 BOM 编码为小端,我认为您当前的代码并不算太糟糕,即使使用双重分配也是如此(除非您的字符串确实非常巨大)。 如果它们是,您可能想要做的不是处理字节数组,而是处理 java.nio ByteBuffer,并使用 java.nio.charset.CharsetEncoder 类。 (您可以从 Charset.forName("UTF-16LE").newEncoder() 获取)。

First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).

萌无敌 2024-07-27 04:58:09

这就是你在 nio 中的做法:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

它当然应该更快,但我不知道它在幕后生成了多少个数组,但我对 API 要点的理解是它应该最大限度地减少这一点。

This is how you do it in nio:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

ˇ宁静的妩媚 2024-07-27 04:58:09
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

编辑:重读您的问题,我发现您宁愿完全避免双数组分配。 不幸的是,据我所知,API 并没有给你这些。 (有一个方法,但它已被弃用,并且您不能用它指定编码)。

我在看到你的评论之前写了上面的内容,我认为使用 nio 类的答案是正确的。 我正在研究这个,但我对 API 不太熟悉,无法立即知道如何完成它。

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

卖梦商人 2024-07-27 04:58:09

这是一个老问题,但我仍然找不到适合我的情况的可接受的答案。 基本上,Java 没有内置的带有 BOM 的 UTF-16LE 编码器。 因此,您必须推出自己的实施。

这就是我最终得到的结果:

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

This is an old question, but still, I couldn't find an acceptable answer for my situation. Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. And so, you have to roll out your own implementation.

Here's what I ended up with:

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}
绝不放开 2024-07-27 04:58:09

为了从 String 转换为 byte[] 并强制使用带有顺序标记的 Little 或 Big Endian,我使用 Apache 的通用 lang ArrayUtils 提出了以下 1 行解决方案:
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFF, (byte) 0xFE}, message.getBytes(UTF_16LE))
对于小端和大端:
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFE, (byte) 0xFF}, message.getBytes(UTF_16BE))
对于顺序标记的字节数组来说,它的级别相当低,但仍然比问题中的命题更容易。

To convert from String to byte[] forcing Little or Big Endian with order mark, I came up to the following 1-line solution, using Apache's common lang ArrayUtils:
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFF, (byte) 0xFE}, message.getBytes(UTF_16LE))
for little Endian, and for big Endian:
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFE, (byte) 0xFF}, message.getBytes(UTF_16BE))
It's rather low level with the byte array for the order mark, but still easier than the proposition in the question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文