处理 Java 字符串中的 Unicode 代理值

发布于 2024-07-23 18:16:53 字数 926 浏览 16 评论 0原文

考虑以下代码：

byte aBytes[] = { (byte)0xff,0x01,0,0,
                  (byte)0xd9,(byte)0x65,
                  (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07,
                  (byte)0x17,(byte)0x33, (byte)0x74, (byte)0x6f,
                   0, 1, 2, 3, 4, 5,
                   0 };
String sCompressedBytes = new String(aBytes, "UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
    System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}

得到以下错误的输出：

ff01, 0, fffd, 506, 717, 3374, 6f00, 102, 304, 500.

但是，如果将输入数据中的 0xd9 更改为 0x9d，则会得到以下正确的输出：

ff01, 0, 9d65, 304, 506, 717, 3374, 6f00, 102, 304, 500.

我意识到该功能是因为字节 0xd9 是一个高代理 Unicode 标记。

问题：有没有办法在 Java Unicode 字符串中提供、识别和提取代理字节（0xd800 到 0xdfff）？
谢谢

原文

Consider the following code:

byte aBytes[] = { (byte)0xff,0x01,0,0,
                  (byte)0xd9,(byte)0x65,
                  (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07,
                  (byte)0x17,(byte)0x33, (byte)0x74, (byte)0x6f,
                   0, 1, 2, 3, 4, 5,
                   0 };
String sCompressedBytes = new String(aBytes, "UTF-16");
for (int i=0; i<sCompressedBytes.length; i++) {
    System.out.println(Integer.toHexString(sCompressedBytes.codePointAt(i)));
}

Gets the following incorrect output:

ff01, 0, fffd, 506, 717, 3374, 6f00, 102, 304, 500.

However, if the 0xd9 in the input data is changed to 0x9d, then the following correct output is obtained:

ff01, 0, 9d65, 304, 506, 717, 3374, 6f00, 102, 304, 500.

I realize that the functionality is because of the fact that the byte 0xd9 is a high-surrogate Unicode marker.

Question: Is there a way to feed, identify and extract surrogate bytes (0xd800 to 0xdfff) in a Java Unicode string?
Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以酷 2024-07-30 18:16:53

编辑：这解决了评论中的问题

如果您想在字符串中对任意二进制数据进行编码，则不应使用普通的文本编码。您在该编码中没有有效的文本 - 您只有任意的二进制数据。

Base64 就是这里的方法。 Java 中没有直接支持 Base64（无论如何，在公共类中），但是您可以使用各种 3rd 方库，例如 Apache Commons Codec 库中的一个。

是的，base64 会增加数据的大小 - 但它允许您稍后对其进行解码而不会丢失信息。

编辑：这解决了原来的问题

我相信问题是您没有指定正确的代理对。您应该指定代表低代理项的字节，然后指定代表高代理项的字节。之后，您应该能够添加适当的代码点。就您而言，您本身就给出了低代理。

下面是演示这一点的代码：

public class Test
{
    public static void main(String[] args)
        throws Exception // Just for simplicity
    {
        byte[] data = 
        {
            0, 0x41, // A
            (byte) 0xD8, 1, // High surrogate
            (byte) 0xDC, 2, // Low surrogate
            0, 0x42, // B
        };

        String text = new String(data, "UTF-16");

        System.out.printf("%x\r\n", text.codePointAt(0));
        System.out.printf("%x\r\n", text.codePointAt(1));
        // Code point at 2 is part of the surrogate pair
        System.out.printf("%x\r\n", text.codePointAt(3));       
    }
}

输出：

41
10402
42

EDIT: This addresses the question from the comment

If you want to encode arbitrary binary data in a string, you should not use a normal text encoding. You don't have valid text in that encoding - you just have arbitrary binary data.

Base64 is the way to go here. There's no base64 support directly in Java (in a public class, anyway) but there are various 3rd party libraries you can use, such as the one in the Apache Commons Codec library.

Yes, base64 will increase the size of the data - but it'll allow you to decode it later without losing information.

EDIT: This addresses the original question

I believe that the problem is that you haven't specified a proper surrogate pair. You should specify bytes representing a low surrogate and then a high surrogate. After that, you should be able to extra the appropriate code point. In your case, you've given a low surrogate on its own.

Here's code to demonstrate this:

public class Test
{
    public static void main(String[] args)
        throws Exception // Just for simplicity
    {
        byte[] data = 
        {
            0, 0x41, // A
            (byte) 0xD8, 1, // High surrogate
            (byte) 0xDC, 2, // Low surrogate
            0, 0x42, // B
        };

        String text = new String(data, "UTF-16");

        System.out.printf("%x\r\n", text.codePointAt(0));
        System.out.printf("%x\r\n", text.codePointAt(1));
        // Code point at 2 is part of the surrogate pair
        System.out.printf("%x\r\n", text.codePointAt(3));       
    }
}

Output:

41
10402
42

回复收藏 0 原文

撧情箌佬 2024-07-30 18:16:53

有没有办法在 Java Unicode 字符串中提供、识别和提取代理字节（0xd800 到 0xdfff）？

只是因为没有人提到它，我会指出 Character 类包含使用代理对的方法。例如 isHighSurrogate(char), codePointAt(CharSequence, int) 和 toChars(int)。我意识到这超出了所述问题的重点。

new String(aBytes, "UTF-16");

这是一个将转换输入数据的解码操作。我很确定这是不合法的，因为所选的解码操作要求输入以 0xfe 0xff 或 0xff 0xfe 开头（字节顺序标记）。此外，并非每个可能的字节值都可以正确解码，因为 UTF-16 是可变宽度编码。

如果您想要将任意字节对称转换为 String 并返回，则最好使用 8 位单字节编码，因为每个字节值都是有效字符：

Charset iso8859_15 = Charset.forName("ISO-8859-15");
byte[] data = new byte[256];
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
  data[i - Byte.MIN_VALUE] = (byte) i;
}
String asString = new String(data, iso8859_15);
byte[] encoded = asString.getBytes(iso8859_15);
System.out.println(Arrays.equals(data, encoded));

注意：字符数将等于字符数字节（数据大小加倍）；生成的字符串不一定是可打印的（可能包含一堆控制字符）。

不过，我和乔恩 - 任意放置将字节序列转换为 Java 字符串几乎总是一个坏主意。

Is there a way to feed, identify and extract surrogate bytes (0xd800 to 0xdfff) in a Java Unicode string?

Just because no one has mentioned it, I'll point out that the Character class includes the methods for working with surrogate pairs. E.g. isHighSurrogate(char), codePointAt(CharSequence, int) and toChars(int). I realise that this is besides the point of the stated problem.

new String(aBytes, "UTF-16");

This is a decoding operation that will transform the input data. I'm pretty sure it is not legal because the chosen decoding operation requires the input to start with either 0xfe 0xff or 0xff 0xfe (the byte order mark). In addition, not every possible byte value can be decoded correctly because UTF-16 is a variable width encoding.

If you wanted a symmetric transformation of arbitrary bytes to String and back, you are better off with an 8-bit, single-byte encoding because every byte value is a valid character:

Charset iso8859_15 = Charset.forName("ISO-8859-15");
byte[] data = new byte[256];
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
  data[i - Byte.MIN_VALUE] = (byte) i;
}
String asString = new String(data, iso8859_15);
byte[] encoded = asString.getBytes(iso8859_15);
System.out.println(Arrays.equals(data, encoded));

Note: the number of characters is going to equal the number of bytes (doubling the size of the data); the resultant string isn't necessarily going to be printable (containing as it might, a bunch of control characters).

I'm with Jon, though - putting arbitrary byte sequences into Java strings is almost always a bad idea.

回复收藏 0 原文

~没有更多了~