为什么Unicode u＆＃x2b; 202e和u＆＃x2b; 202c导致输出文本具有不同的结果

发布于 2025-01-25 02:28:31 字数 340 浏览 5 评论 0 原文

在java中：

如果我打印“ 123 \ u202e987 \ u202c456abc” ，则结果是 123987456ABC
如果我打印” 123 \ u202e987 /代码>然后结果为 123987xyzabc

您会看到，当“ 456”更改为“ xyz”为“ xyz”字符串中的“ xyz”时打印输出序列不同。

这是如何运作的？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十二 2025-02-01 02:28:32

Unicode正在这样做。
因为两者都取决于他们之后的文字并以某种方式进行编辑。

\ u202E逆转文本（左右覆盖权）
流行方向格式

\ u202c：问题中的， 123 \ u202e987 \ u202cxyzabc 输出>输出 123987xyzabc 。 \ u202e导致987被输出（反向）为789。\ u202c停止了左右覆盖。

在第二种情况下，\ u202c之后是一些数字，它们的方向性较弱。因此，Unicode仅导致数字在\ u202e之前。

编辑： @skomisa的答案更好。

回复收藏 0 原文

冷情 2025-02-01 02:28:31

TLDR：您看到的效果会出现，因为数字和字母字符通过Unicode算法的处理方式不同，该算法决定了包含格式控制字符的文本的渲染。

对于您要显示的文本：

\ u202e 是左右覆盖权（RLO）字符。
\ u202c 是（pdf）字符。
两者都是在Unicode中格式化的控制字符，其唯一效果是影响输出文本的外观。
在您的示例中，rlo字符指定要从右到左（rlo）显示以下文本，而pdf字符取消（“ pops”）RLO的效果。

这就解释了为什么文本 123 \ u202e987 \ u202cxyzabc 在您的示例中被渲染为 123987xyzabc 。 RLO（\ u202e）导致以下文本以右顺序呈现（SO 987 显示为 789 ），PDF（\ u202c）终止后续文本的逆转。

但这并不能解释为什么 123 \ u202e987 \ u202c456abc 被渲染为123 456 789ABC。通过该参数，预期的输出应为123789 456 abc。

用于确定这样的方案中输出的算法非常复杂，但一个因素是字符渲染的方向性。字母字符具有很强的方向性，但是数字（即数字字符）的方向性较弱。有关详细信息，请参见Unicode文档unicode®标准附件＃9
Unicode双向算法，尤其是

该文档提供了一个类似于您的示例，其中包含 pright to-lef-left嵌入（rle）字符（而不是rlo），后来是PDF和一些包含数字的尾随文本：

内存：称为“ [rle] Java [pdf]简介 - $ 19.95
精装。

显示：称为“ $ 19.95 - ”精装中的java ot noitcudortni na。

请注意，在他们的示例中，不仅仅是移动的数字。美元符号和时期也是如此，因为文本中的所有六个字符 $ 19.95 的方向性较弱。

注意：

您可以使用Character.getDirectionality(int codePoint)
The Unicode document linked above is heavy reading.双向文本的基本介绍包括W3C的 and Unicode的写作方向和双向文本常见问题Q 。

TLDR: The effect you are seeing arises because digits and alphabetic characters are treated differently by the Unicode algorithm that determines the rendering of text containing format control characters.

For the texts you are displaying:

\u202e is the RIGHT-TO-LEFT OVERRIDE (RLO) character.
\u202c is the POP DIRECTIONAL FORMATTING (PDF) character.
Both are formatting control characters in Unicode, and their sole effect is to impact the appearance of output text.
In your examples the RLO character specifies that the text which follows is to be displayed from right to left (RLO), and PDF character cancels ("pops") the effect of the RLO.

That explains why the text 123\u202e987\u202cxyzabc in your example is rendered as 123‮987‬xyzabc. The RLO (\u202e) causes the text that follows to be rendered in right to left order (so 987 is displayed as 789), and the PDF (\u202c) terminates reversal for the subsequent text.

But it does not explain why 123\u202e987\u202c456abc is rendered as 123456789abc. By that argument, the expected output should be 123789456abc instead.

The algorithm used to determine the output in scenarios like this is very complex, but one factor is the directionality of the characters being rendered. Alphabetic characters have strong directionality, but numbers (i.e. digit characters) have weak directionality. For full details see the Unicode document Unicode® Standard Annex #9
UNICODE BIDIRECTIONAL ALGORITHM, and especially section 3.3.4 Resolving Weak Types

That document provides an example similar to yours, with text containing a RIGHT-TO-LEFT EMBEDDING (RLE) character (rather than an RLO), later followed by a PDF and some trailing text containing digits:

Memory: it is called "[RLE]AN INTRODUCTION TO java[PDF]" - $19.95 in
hardcover.

Display: it is called "$19.95 - "java OT NOITCUDORTNI NA in hardcover.

Note that in their example it wasn't just the digits that were moved. The dollar sign and the period were as well, because all six of the characters in the text $19.95 have weak directionality.

Notes:

You can get the directionality category of any Unicode character in Java using Character.getDirectionality(int codePoint)
The Unicode document linked above is heavy reading. Basic introductions to bidirectional text include W3C's Unicode Bidirectional Algorithm basics and Unicode's Writing Direction and Bidirectional Text FAQ.