如何使用 BOM 对 UTF-16LE 字节数组进行编码/解码？

发布于 2024-07-20 04:58:09 字数 1489 浏览 9 评论 0原文

我需要在 java.lang.String 之间对 UTF-16 字节数组进行编码/解码。字节数组是通过字节顺序标记 (BOM) 提供给我的，我需要到带有 BOM 的编码字节数组。

另外，因为我正在处理 Microsoft 客户端/服务器，所以我想以小尾数法（以及 LE BOM）发出编码以避免任何误解。我确实意识到，对于 BOM，它应该以大端方式工作，但我不想在 Windows 世界中逆流而上。

举个例子，下面是一个用 BOM 将 java.lang.String 编码为 UTF-16 的方法

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

：爪哇？理想情况下，我想避免将整个字节数组复制到一个新的字节数组中，该数组在开头分配了两个额外的字节。

解码这样的字符串也是如此，但是使用 java.lang.String 构造函数：

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

原文

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

洒一地阳光 2024-07-27 04:58:09

“UTF-16”字符集名称将始终使用 BOM 进行编码，并使用大/小字节序对数据进行解码，但“UnicodeBig”和“UnicodeLittle”对于以特定字节顺序进行编码非常有用。对于无 BOM 使用 UTF-16LE 或 UTF-16BE - 请参阅这篇文章了解如何使用“\uFEFF”手动处理 BOM。请参阅此处了解字符集的规范命名字符串名称或（最好）Charset 类。另请注意，只有绝对需要支持编码。

回复收藏 0 原文

画▽骨i 2024-07-27 04:58:09

首先，对于解码，您可以使用字符集“UTF-16”；自动检测初始 BOM。对于编码 UTF-16BE，您还可以使用“UTF-16”字符集 - 这将编写正确的 BOM，然后输出大端字节序内容。

对于使用 BOM 编码为小端，我认为您当前的代码并不算太糟糕，即使使用双重分配也是如此（除非您的字符串确实非常巨大）。如果它们是，您可能想要做的不是处理字节数组，而是处理 java.nio ByteBuffer，并使用 java.nio.charset.CharsetEncoder 类。（您可以从 Charset.forName("UTF-16LE").newEncoder() 获取）。

回复收藏 0 原文

萌无敌 2024-07-27 04:58:09

这就是你在 nio 中的做法：

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

它当然应该更快，但我不知道它在幕后生成了多少个数组，但我对 API 要点的理解是它应该最大限度地减少这一点。

This is how you do it in nio:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

回复收藏 0 原文

ˇ宁静的妩媚 2024-07-27 04:58:09

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

编辑：重读您的问题，我发现您宁愿完全避免双数组分配。不幸的是，据我所知，API 并没有给你这些。（有一个方法，但它已被弃用，并且您不能用它指定编码）。

我在看到你的评论之前写了上面的内容，我认为使用 nio 类的答案是正确的。我正在研究这个，但我对 API 不太熟悉，无法立即知道如何完成它。

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

回复收藏 0 原文

卖梦商人 2024-07-27 04:58:09

这是一个老问题，但我仍然找不到适合我的情况的可接受的答案。基本上，Java 没有内置的带有 BOM 的 UTF-16LE 编码器。因此，您必须推出自己的实施。

这就是我最终得到的结果：

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

This is an old question, but still, I couldn't find an acceptable answer for my situation. Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. And so, you have to roll out your own implementation.

Here's what I ended up with:

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

回复收藏 0 原文

绝不放开 2024-07-27 04:58:09

为了从 String 转换为 byte[] 并强制使用带有顺序标记的 Little 或 Big Endian，我使用 Apache 的通用 lang ArrayUtils 提出了以下 1 行解决方案：
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFF, (byte) 0xFE}, message.getBytes(UTF_16LE))
对于小端和大端：
tmp = ArrayUtils.addAll(new byte[] {(byte) 0xFE, (byte) 0xFF}, message.getBytes(UTF_16BE))
对于顺序标记的字节数组来说，它的级别相当低，但仍然比问题中的命题更容易。