阿拉伯语:“来源” Unicode到最终显示Unicode

发布于 2024-12-10 15:05:20 字数 705 浏览 0 评论 0原文

简单的问题:

这是我正在寻找的最终显示字符串

??????

一个空格以阻止连接)

??????

它们不是相同的字符,有一些神奇的转换可以将它们融合在一起并将它们转换为新的 Unicode 字符。

然后在上面,字符实际上是从右到左出现的(在内存中,它们是从左到右)

所以我的简单问题是:我在哪里可以获得一个独立于平台的 c/c++ 函数,它将采用我的源 16 位 Unicode字符串,然后对其进行转换以生成 Unicode 字符串,该字符串将创建上面第一个引用的字符串?进行 RTL 转换和连接?

这就是我想要的,一个能做到这一点的函数。

更新:

好的,是的,我知道上面两个例子中的“字符”是相同的,它们是相同的“字母”,但是(在 chrome 或最新的 IE 中查看)任何人都可以清楚地看到字形是不同的。现在我相当有信心需要完成的这种转换可以在 unicode 级别上完成,因为我的字体文件和 unicode 标准似乎为字符的单独版本和各种连接版本指定了不同的字形/字母。 (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)

那么,我可以将我的unicode放入一个函数中并获取转换后的unicode吗?

simple question:

this is the final display string I am looking for

لعبة ديدة

now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)

ل ع ب ة د ي د ة

note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.

and then in that above, the characters are actually appearing right to left (in memory, they are left to right)

so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?

that's all I want, one function that does that.

UPDATE:

ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)

so, can I just put my unicode into a function and get the transformed unicode out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

空袭的梦i 2024-12-17 15:05:20

连接和 RTL 转换不会发生在 Unicode 字符级别。

换句话说:字符的顺序实际的unicode代码点在此过程中不会改变。

事实上,合并和处理 RTL/LTR 转换是由文本渲染引擎处理的。

维基百科关于阿拉伯字母表的文章中的这句话很好地解释了这一点:

最后,阿拉伯语的 Unicode 编码采用逻辑顺序,即字符按照书写和发音的顺序输入并存储在计算机内存中,而无需担心方向它们将显示在纸上或屏幕上。同样,由渲染引擎使用 Unicode 的 双向以正确的方向呈现字符文本 功能。在这方面,如果该页面上的阿拉伯文字是从左到右书写的,则表明用于显示它们的 Unicode 渲染引擎已过时。

The joining and RTL conversion don't happen at the level of Unicode characters.

In other words: the order of the characters and the actual unicode codepoints are not changed during this process.

In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.

This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:

Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.

寂寞笑我太脆弱 2024-12-17 15:05:20

您要查找的处理称为连字。与许多基于拉丁语的语言不同,在这些语言中,您只需将一个字符放在另一个字符即可呈现文本,而连字是阿拉伯语的基础。替换是在文本渲染引擎中完成的,连字信息通常存储在字体文件中。

注意它们为何不是相同的字符

对于阿拉伯读者来说它们是相同的。它仍然可读。
无需对 Unicode16 源文本进行任何转换。您必须向文本渲染器提供整个字符串。在 C/C++ 中,当您采用独立于平台的方式时,您可以使用 Pango 进行渲染。

<子>
注意:也许您想写 ?????????(即新游戏)? 因为你举的例子在阿拉伯语中没有任何意义。

The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.

note how they are NOT the same characters

They are the same for an Arabic reader. It is still readable.
There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.


Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.

难理解 2024-12-17 15:05:20

我意识到这是一个老问题,但您正在寻找的是 FriBidi,< a href="http://www.unicode.org/reports/tr9/" rel="nofollow">Unicode 双向算法。

该程序执行问题中询问的字形选择,以及处理双向文本(从右到左和从左到右文本的混合)。

I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.

This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).

独﹏钓一江月 2024-12-17 15:05:20

您正在寻找的是阿拉伯文字合成算法。我不知道有一个开源软件存在。如果您到达,请发帖。

几点:

在存储级别,没有 Unicode 转换。正如其他答案所指出的,字符串有一个抽象表示。

在渲染级别,您可以选择使用 Unicode 表示形式,但也可以选择使用其他形式。 Unicode 表示形式并不是表示输出编码应该是什么的标准 - 相反,它们只是可以由渲染引擎使用脚本合成输出的表示代码的一个示例。

更清楚地说:不会有一个标准转换(即合成算法)可以从 A 转换到 B,其中 A 是标准 Unicode 阿拉伯语页面,B 是标准 Unicode 阿拉伯语表示形式。相反,会有不同的转换,其复杂性可能不同,并且 B 可以有不同的编码系统,但可用于 B 的编码之一是 Unicode 表示形式。
例如,简单的打字机样式需要简单的渲染算法,而无需演示表单。事实上,确实存在现代书写风格(尽管不常见),其中 A 和 B 实际上是相同的,只是使用不同的字体页面来进行渲染。另一方面,渲染排版或传统书法形式的转换会更加复杂,并且需要类似于 Unicode 表示形式的东西。

以下是有关该主题的更多信息的一些提示:

What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.

Some points:

At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.

At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.

To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms.
For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.

Here are a couple of pointers for more information on the topic:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文