当前位置：文江博客话题详情

什么是“代理对”？在Java中？

发布于 2024-11-05 23:50:10 字数 216 浏览 12 评论 0原文

我正在阅读 StringBuffer 的文档，特别是 reverse() 方法。该文档提到了一些关于代理对的内容。在这种情况下，代理对是什么？什么是低和高代理？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

百思不得你姐 2024-11-12 23:50:10

术语“代理对”是指在 UTF-16 编码方案中对具有高代码点的 Unicode 字符进行编码的方法。

在 Unicode 字符编码中，字符映射到 0x0 到 0x10FFFF 之间的值。

在内部，Java 使用 UTF-16 编码方案来存储 Unicode 文本的字符串。在 UTF-16 中，使用 16 位（两字节）代码单元。由于 16 位只能包含从 0x0 到 0xFFFF 的字符范围，因此需要使用一些额外的复杂性来存储高于此范围（0x10000 到 0x10FFFF）的值。这是通过使用称为代理的代码单元对来完成的。

代理代码单元处于称为“高代理”和“低代理”的两个范围内，具体取决于它们是否被允许出现在两个代码单元序列的开头或结尾。

回复收藏 0 原文

浅忆 2024-11-12 23:50:10

早期的 Java 版本使用 16 位 char 数据类型表示 Unicode 字符。这种设计在当时是有意义的，因为所有 Unicode 字符的值都小于 65,535 (0xFFFF)，并且可以用 16 位表示。但后来，Unicode 将最大值增加到 1,114,111 (0x10FFFF)。由于 16 位值太小，无法表示 Unicode 版本 3.1 中的所有 Unicode 字符，因此 UTF-32 编码方案采用了 32 位值（称为代码点）。
但为了有效使用内存，16 位值优于 32 位值，因此 Unicode 引入了新设计以允许继续使用 16 位值。该设计采用UTF-16编码方案，将1,024个值分配给16位高代理项（在U+D800到U+DBFF范围内），并将另外1,024个值分配给16位低代理项（在U+DC00范围内）到 U+DFFF）。它使用高代理项后跟低代理项（代理对）来表示（1,024 和 1,024 的乘积）1,048,576 (0x100000) 65,536 (0x10000) 和 1,114,111 (0x10FFFF) 之间的值。

回复收藏 0 原文

任性一次 2024-11-12 23:50:10

在这篇文章的上述答案中添加更多信息。

在 Java-12 中测试，应该适用于 5 以上的所有 Java 版本。

如下所述：https://stackoverflow.com/a/47505451 /2987755,
无论哪个字符（其 Unicode 高于 U+FFFF）都表示为代理对，Java 将其存储为一对 char 值，即单个 Unicode 字符表示为两个相邻的 Java 字符。
正如我们在下面的示例中看到的。

长度：
```
"
```

Adding some more info to the above answers from this post.

Tested in Java-12, should work in all Java versions above 5.

As mentioned here: https://stackoverflow.com/a/47505451/2987755,
whichever character (whose Unicode is above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values, i.e. the single Unicode character is represented as two adjacent Java characters.
As we can see in the following example.

Length:

"????".length()  //2, Expectations was it should return 1

"????".codePointCount(0,"????".length())  //1, To get the number of Unicode characters in a Java String

Equality:
Represent "????" to String using Unicode \ud83c\udf09 as below and check equality.
```
"????".equals("\ud83c\udf09") // true
```
Java does not support UTF-32
```
"????".equals("\u1F309") // false
```

You can convert Unicode character to Java String

"????".equals(new String(Character.toChars(0x0001F309))) //true

String.substring() does not consider supplementary characters

"????????".substring(0,1) //"?"
"????????".substring(0,2) //"????"
"????????".substring(0,4) //"????????"

To solve this we can use String.offsetByCodePoints(int index, int codePointOffset)

"????????".substring(0,"????????".offsetByCodePoints(0,1) // "????"
"????????".substring(2,"????????".offsetByCodePoints(1,2)) // "????"

Iterating Unicode string with BreakIterator
Sorting Strings with Unicode java.text.Collator
Character's toUpperCase(), toLowerCase(), methods should not be used, instead, use String uppercase and lowercase of particular locale.
Character.isLetter(char ch) does not support, better used Character.isLetter(int codePoint), for each methodName(char ch) method in the Character class there will be type of methodName(int codePoint) which can handle supplementary characters.
Specify charset in String.getBytes(), converting from Bytes to String, InputStreamReader, OutputStreamWriter

New Methods are added in Java-21, java.lang.Character.isEmoji and new Regex related patterns, emoji data from here, This new functions can be helpful if you are using any library as mentioned here

public static void main(String[] args) {
    System.out.println('☺' + " isEmoji : " + isEmoji('☺')); // true
    System.out.println('❌' + " isEmoji : " + isEmoji('❌')); // true
    System.out.println('ž' + " isEmoji : " + isEmoji('ž')); // false

    emojiChecks("A");
    emojiChecks("©");
    emojiChecks("☺");
    emojiChecks("\uD83D\uDE0A");
}

private static void emojiChecks(String emoji) {
    // If any string is not emoji then it can not be moji_Component, Emoji_Presentation, Emoji_Modifier, and Emoji_Modifier_Base. 
    // Ref: https://unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files
    final Pattern emojiPattern = Pattern.compile("\\p{IsEmoji}");
    final Pattern emojiModifierBasePattern = Pattern.compile("\\p{IsEmoji_Modifier_Base}");
    final Pattern emojiComponentPattern = Pattern.compile("\\p{IsEmoji_Component}");
    final Pattern emojiPresentationPattern = Pattern.compile("\\p{IsEmoji_Presentation}");
    final Pattern isExtendedPictographicPattern = Pattern.compile("\\p{IsExtended_Pictographic}");
    System.out.println(emoji + " IsEmoji: " + emojiPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiModifierBase: " + emojiModifierBasePattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiComponent: " + emojiComponentPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiPresentation: " + emojiPresentationPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsExtended_Pictographic: " + isExtendedPictographicPattern.matcher(emoji).matches());
    System.out.println("----------------------------------------");
}

// output
☺ isEmoji : true
❌ isEmoji : true
ž isEmoji : false
A IsEmoji: false
A IsEmojiModifierBase: false
A IsEmojiComponent: false
A IsEmojiPresentation: false
A IsExtended_Pictographic: false
----------------------------------------
© IsEmoji: true
© IsEmojiModifierBase: false
© IsEmojiComponent: false
© IsEmojiPresentation: false
© IsExtended_Pictographic: true
----------------------------------------
☺ IsEmoji: true
☺ IsEmojiModifierBase: false
☺ IsEmojiComponent: false
☺ IsEmojiPresentation: false
☺ IsExtended_Pictographic: true
----------------------------------------
???? IsEmoji: true
???? IsEmojiModifierBase: false
???? IsEmojiComponent: false
???? IsEmojiPresentation: true
???? IsExtended_Pictographic: true
----------------------------------------

Ref:
https://coolsymbol.com/emojis/emoji-for-copy-and-paste.html#objects
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
https://www.ibm.com/developerworks/library/j-unicode/index.html
https://www.oracle.com/technetwork/articles/javaee/supplementary-142654.html

More info on example image1 image2
Other terms worth to explore: Normalization, BiDi

回复收藏 0 原文

萤火眠眠 2024-11-12 23:50:10

该文档所说的是，无效的 UTF-16 字符串在调用 reverse 方法后可能会变得有效，因为它们可能是有效字符串的反转。代理对（此处讨论）是 UTF-16 格式的一对 16 位值编码单个 Unicode 代码点；低代理和高代理是该编码的两半。

回复收藏 0 原文

救赎№ 2024-11-12 23:50:10

小前言

Unicode 表示代码点。每个代码点都可以根据 Unicode 标准编码为 8 位、16 位或 32 位块。
在版本 3.1 之前，主要使用的是 8 位编码（称为 UTF-8）和 16 位编码（称为 UCS-2 或“以 2 个八位位组编码的通用字符集”）。 UTF-8 将 Unicode 点编码为 1 字节块的序列，而 UCS-2 始终占用 2 字节：
A = 41 - UTF-8 的一个 8 位块
A = 0041 - 采用 UCS-2 的 16 位块
Ω = CE A9 - 两个 8 位 UTF-8 块
Ω = 03A9 - 采用 UCS-2 的 16 位块

问题

该联盟认为 16 位足以覆盖任何人类可读的语言，这给出了 >2^16 = 65536 个可能的代码值。对于 Plane 0（也称为 BMP 或基本多语言平面）来说，情况就是如此，它目前包含 65536 个代码点中的 55,445 个。 BMP 几乎涵盖了世界上所有人类语言，包括中日韩符号 (CJK)。

随着时间的流逝，新的亚洲字符集被添加，仅中文符号就花费了七万多点。现在，甚至有表情符号点作为标准的一部分

Small preface

Unicode represents code points. Each code point can be encoded in 8-, 16,- or 32-bit blocks according to the Unicode standard.
Prior to the Version 3.1, mostly in use was 8-bit enconding, known as UTF-8, and 16-bit encoding, known as UCS-2 or “Universal Character Set coded in 2 octets”. UTF-8 encodes Unicode points as a sequence of 1-byte blocks, while UCS-2 always takes 2 bytes:
A = 41 - one block of 8-bits with UTF-8
A = 0041 - one block of 16-bits with UCS-2
Ω = CE A9 - two blocks of 8-bits with UTF-8
Ω = 03A9 - one block of 16-bits with UCS-2

Problem

The consortium thought that 16 bits would be enough to cover any human-readable language, which gives 2^16 = 65536 possible code values. This was true for the Plane 0, also known as BMP or Basic Multilingual Plane, that includes 55,445 of 65536 code points today. BMP covers almost every human language in the world, including Chinese-Japanese-Korean symbols (CJK).

The time passed and new Asian character sets were added, Chinese symbols took more than 70,000 points alone. Now, there are even Emoji points as part of the standard ????. New 16 "additional" Planes were added. The UCS-2 room was not enough to cover anything bigger than Plane-0.

Unicode decision

Limit Unicode to the 17 planes × 65 536 characters per plane = 1 114 112 maximum points.
Present UTF-32, former known as UCS-4, to hold 32-bits for each code point and cover all planes.
Continue to use UTF-8 as dynamic encoding, limit UTF-8 to 4 bytes maximum for each code point, i.e. from 1 up to 4 bytes per point.
Deprecate UCS-2
Create UTF-16 based on UCS-2. Make UTF-16 dynamic, so it takes 2 bytes or 4 bytes per point. Assign 1024 points U+D800–U+DBFF, called High Surrogates, to UTF-16; assign 1024 symbols U+DC00–U+DFFF, called Low Surrogates, to UTF-16.

With those changes, BMP is covered with 1 block of 16 bits in UTF-16, while all "Supplementary characters" are covered with Surrogate Pairs presenting 2 blocks by 16 bits each, totally 1024x1024 = 1 048 576 points.

A high surrogate precedes a low surrogate. Any deviation from this rule is considered as a bad encoding. For example, a surrogate without a pair is incorrect, a low surrogate standing before a high surrogate is incorrect.

????, 'MUSICAL SYMBOL G CLEF', is encoded in UTF-16 as a pair of surrogates 0xD834 0xDD1E (2 by 2 bytes),
in UTF-8 as 0xF0 0x9D 0x84 0x9E (4 by 1 byte),
in UTF-32 as 0x0001D11E (1 by 4 bytes).

Current situation

Although according to the standard the surrogates are specifically assigned only to UTF-16, historically some Windows and Java applications used UTF-8 and UCS-2 points reserved now to the surrogate range.
To support legacy applications with incorrect UTF-8/UTF-16 encodings, a new standard WTF-8, Wobbly Transformation Format, was created. It supports arbitrary surrogate points, such as a non-paired surrogate or an incorrect sequence. Today, some products do not comply with the standard and treat UTF-8 as WTF-8.
The surrogate solution opened some security problems, as well as attempts to use "illigal surrogate pairs".

Many historic details were suppressed to follow the topic ⚖.
The latest Unicode Standard can be found at http://www.unicode.org/versions/latest

回复收藏 0 原文