如何水平翻转文本?

发布于 2024-12-28 19:51:33 字数 2110 浏览 2 评论 0 原文

我需要编写一个函数来从左到右翻转字符串的所有字符。

例如:

那只快活的狐狸跳了起来。

应该成为

.goş yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT

我可以将问题限制为 UTF-16(它与 UTF-8 具有相同的问题,只是较少出现)。

幼稚的解决方案

幼稚的解决方案可能会尝试翻转所有事物(例如逐字,其中一个字是 16 位 - 如果我们可以假设一个字节是 16 位,我会说逐字节。我可以还说逐个字符,其中字符是表示单个代码点的数据类型Char):

String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
   flipped = c+fipped;
}

导致错误翻转的文本:

  • < code>ɗỉf̴ḟếr̆ęnͥt
  • ̨tͥnę̆rếḟ̴fỉɗ

这是因为“字符”需要多个“代码点”。

  • ɗỉf̴ḟếr̆ęnͥt
  • ɗ f ~ ế r ˘ ę n i t ˛

并翻转每个“代码点”给出:

  • ˛ t i <代码>n <代码>ę <代码>˘ <代码>r <代码>ế <代码>ḟ <代码> 〜 f ɗ

这不仅不是有效的 UTF-16 编码,而且不是相同的字符。

失败

UTF-16 编码中出现以下情况时会出现问题:

  • 在另一个语言平面中组合变音符号字符

同样的问题也发生在 UTF-8 编码中,另外还有

  • 0..127 之外的任何字符ASCII 范围

我可以将自己限制为更简单的 UTF-16 编码(因为这是我正在使用的语言所具有的编码(例如 C#、Delphi)。

在我看来,问题是发现是否有许多后续 代码点正在组合字符,并且需要与基本字形一起出现

在线文本反向网站未能考虑到这一点。

注意:

  • 任何解决方案都应该假设无法访问 UTF-32 编码库(主要是因为我无法访问任何 UTF-32 编码库)
  • 访问 UTF-32 编码库可以解决 UTF-8/UTF-16 语言平面问题,但不能解决组合变音符号问题

i'm need to write a function that will flip all the characters of a string left-to-right.

e.g.:

Thė quiçk ḇrown fox jumṕềᶁ ovểr thë lⱥzy ȡog.

should become

.goȡ yzⱥl ëht rểvo ᶁềṕmuj xof nworḇ kçiuq ėhT

i can limit the question to UTF-16 (which has the same problems as UTF-8, just less often).

Naive solution

A naive solution might try to flip all the things (e.g. word-for-word, where a word is 16-bits - i would have said byte for byte if we could assume that a byte was 16-bits. i could also say character-for-character where character is the data type Char which represents a single code-point):

String original = "ɗỉf̴ḟếr̆ęnͥt";
String flipped = "";
foreach (Char c in s)
{
   flipped = c+fipped;
}

Results in the incorrectly flipped text:

  • ɗỉf̴ḟếr̆ęnͥt
  • ̨tͥnę̆rếḟ̴fỉɗ

This is because one "character" takes multiple "code points".

  • ɗỉf̴ḟếr̆ęnͥt
  • ɗ f ˜ ế r ˘ ę n i t ˛

and flipping each "code point" gives:

  • ˛ t i n ę ˘ r ế ˜ f ɗ

Which not only is not a valid UTF-16 encoding, it's not the same characters.

Failure

The problem happens in UTF-16 encoding when there is:

Those same issues happen in UTF-8 encoding, with the additional case

  • any character outside the 0..127 ASCII range

i can limit myself to the simpler UTF-16 encoding (since that's the encoding that the language that i'm using has (e.g. C#, Delphi)

The problem, it seems to me, is discovering if a number of subsequent code points are combining characters, and need to come along with the base glyph.

It's also fun to watch an online text reverser site fail to take this into account.

Note:

  • any solution should assume that don't have access to a UTF-32 encoding library (mainly becuase i don't have access to any UTF-32 encoding library)
  • access to a UTF-32 encoding library would solve the UTF-8/UTF-16 lingual planes problem, but not the combining diacritics problem

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

被翻牌 2025-01-04 19:51:33

您要查找的术语是“字素簇”,如 Unicode TR29 簇边界 中所定义。

使用代理算法(简单)将 UTF-16 代码单元分组为 Unicode 代码点(=字符),然后使用 Grapheme_Cluster_Break 规则将字符分组为字素簇。最后颠倒组顺序。

您将需要 Unicode 字符数据库的副本才能识别字素簇边界。这已经占用了大量的空间,因此您可能需要一个库来完成它。例如,在 ICU 中,您可能会使用 CharacterIterator(该名称具有误导性,因为它适用于字素簇,而不是 Unicode 所熟知的“字符”)。

The term you're looking for is “grapheme cluster”, as defined in Unicode TR29 Cluster Boundaries.

Group the UTF-16 code units into Unicode code points (=characters) using the surrogate algorithm (easy), then group the characters into grapheme clusters using the Grapheme_Cluster_Break rules. Finally reverse the group order.

You will need a copy of the Unicode character database in order to recognise grapheme cluster boundaries. That's already going to take up a considerable amount of space, so you're probably going to want to get a library to do it. For example in ICU you might use a CharacterIterator (which is misleadingly named as it works on grapheme clusters, not ‘characters’ as Unicode knows it).

终难愈 2025-01-04 19:51:33

如果您使用 UTF-32,则可以解决非基平面问题。从 UTF-8 或 UTF-16 转换为 UTF-32(以及反之)是相对简单的操作(请参阅维基百科)。你不必拥有一个图书馆。

大多数组合字符都在几个范围内。您可以通过扫描 Unicode 数据库(请参阅 Unicode.org)来确定这些范围。将这些范围硬编码到您的应用程序中。这样,您就可以确定代表单个字符的代码点组。 (缺点是将来可能会引入新的组合标记,并且您需要更新表。)

适当分段,反转顺序(逐段),然后转换回 UTF-8 或 UTF-16(或任何你想要的)。

If you work in UTF-32, you solve the non-base-plane issue. Converting from UTF-8 or UTF-16 to UTF-32 (and back) is relatively simple bit twiddling (see Wikipedia). You don't have to have a library for it.

Most of the combining characters are in a few ranges. You could determine those ranges by scanning the Unicode database (see Unicode.org). Hardcode those ranges into your application. With that, you can determine the groups of codepoints that represent a single character. (The drawback is that new combining marks could be introduced in the future, and you'd need to update your table.)

Segment appropriately, reverse the order (segment by segment), and convert back to UTF-8 or UTF-16 (or whatever you want).

强者自强 2025-01-04 19:51:33

Text Mechanic 的 文本生成器 似乎可以在 JavaScript 中执行此操作。我确信在获得作者同意后可以将 JS 翻译成另一种语言(如果您可以找到该网站的“联系”链接)。

Text Mechanic's Text Generator seems to do this in JavaScript. I'm sure it would be possible to translate the JS into another language after obtaining the author's consent (if you can find a 'contact' link for that site).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文