如何识别 C++ 中的 RTL 字符串

发布于 2024-10-27 05:46:16 字数 76 浏览 5 评论 0原文

我需要在打印之前知道文本的方向。

我正在使用 Unicode 字符。

我怎样才能在 C++ 中做到这一点?

I need to know the direction of my text before printing.

I'm using Unicode Characters.

How can I do that in C++?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不打扰别人 2024-11-03 05:46:16

如果您不想使用 ICU,您可以随时手动解析 unicode 数据库 (例如,使用 python 脚本)。它是一个以分号分隔的文本文件,每一行代表一个字符代码点。查找每行中的第五条记录 - 这是字符类。如果是 RAL,则您有一个 RTL 字符,而“L”是一个 LTR 字符。其他类是弱类型或中性类型(如数字),我想您会想忽略它们。使用该信息,您可以生成所有 RTL 字符的查找表,然后在 C++ 代码中使用它。如果您确实关心代码大小,则可以通过使用范围(而不是每个字符的条目)来最小化查找表在代码中占用的大小,因为大多数字符都位于其 BiDi 类的块中。

现在,定义一个名为 GetCharDirection(wchar_t ch) 的函数,它返回一个枚举值(例如:Dir_LTRDir_RTLDir_Neutral code>) 通过检查查找表。

现在,您可以定义一个函数 GetStringDirection(const wchar_t*),它会遍历字符串中的所有字符,直到遇到非 Dir_Neutral 的字符。字符串中的第一个非中性字符应设置该字符串的基本方向。或者至少重症监护病房似乎是这样运作的。

If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.

Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.

Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.

羞稚 2024-11-03 05:46:16

您可以使用 ICU 库,它有一个函数(ubidi_getDirection ubidi_getBaseDirection)。

可以通过重新编译数据库(通常约为 15MB 大)来减小 ICU 的大小,以仅包含项目所需的转换器/局部变量。

网站http:// 的减少 ICU 数据大小:转换表部分userguide.icu-project.org/icudata,包含如何减小数据库大小的信息。

如果只需要支持最常见的编码(US-ASCII、ISO-8859-1、UTF-7/8/16/32、SCSU、BOCU-1、CESU-8),则无论如何都不需要数据库。

You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).

The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.

The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.

If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.

来世叙缘 2024-11-03 05:46:16

Boaz Yaniv 之前说过,也许这样的事情会比解析整个文件更容易、更快:

int aft_isrtl(int c){
  if (
    (c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
    ((c>=0x05D0)&&(c<=0x05F4))||
    (c==0x0608)||(c==0x060B)||(c==0x060D)||
    ((c>=0x061B)&&(c<=0x064A))||
    ((c>=0x066D)&&(c<=0x066F))||
    ((c>=0x0671)&&(c<=0x06D5))||
    ((c>=0x06E5)&&(c<=0x06E6))||
    ((c>=0x06EE)&&(c<=0x06EF))||
    ((c>=0x06FA)&&(c<=0x0710))||
    ((c>=0x0712)&&(c<=0x072F))||
    ((c>=0x074D)&&(c<=0x07A5))||
    ((c>=0x07B1)&&(c<=0x07EA))||
    ((c>=0x07F4)&&(c<=0x07F5))||
    ((c>=0x07FA)&&(c<=0x0815))||
    (c==0x081A)||(c==0x0824)||(c==0x0828)||
    ((c>=0x0830)&&(c<=0x0858))||
    ((c>=0x085E)&&(c<=0x08AC))||
    (c==0x200F)||(c==0xFB1D)||
    ((c>=0xFB1F)&&(c<=0xFB28))||
    ((c>=0xFB2A)&&(c<=0xFD3D))||
    ((c>=0xFD50)&&(c<=0xFDFC))||
    ((c>=0xFE70)&&(c<=0xFEFC))||
    ((c>=0x10800)&&(c<=0x1091B))||
    ((c>=0x10920)&&(c<=0x10A00))||
    ((c>=0x10A10)&&(c<=0x10A33))||
    ((c>=0x10A40)&&(c<=0x10B35))||
    ((c>=0x10B40)&&(c<=0x10C48))||
    ((c>=0x1EE00)&&(c<=0x1EEBB))
  ) return 1;
  return 0;
}

From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:

int aft_isrtl(int c){
  if (
    (c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
    ((c>=0x05D0)&&(c<=0x05F4))||
    (c==0x0608)||(c==0x060B)||(c==0x060D)||
    ((c>=0x061B)&&(c<=0x064A))||
    ((c>=0x066D)&&(c<=0x066F))||
    ((c>=0x0671)&&(c<=0x06D5))||
    ((c>=0x06E5)&&(c<=0x06E6))||
    ((c>=0x06EE)&&(c<=0x06EF))||
    ((c>=0x06FA)&&(c<=0x0710))||
    ((c>=0x0712)&&(c<=0x072F))||
    ((c>=0x074D)&&(c<=0x07A5))||
    ((c>=0x07B1)&&(c<=0x07EA))||
    ((c>=0x07F4)&&(c<=0x07F5))||
    ((c>=0x07FA)&&(c<=0x0815))||
    (c==0x081A)||(c==0x0824)||(c==0x0828)||
    ((c>=0x0830)&&(c<=0x0858))||
    ((c>=0x085E)&&(c<=0x08AC))||
    (c==0x200F)||(c==0xFB1D)||
    ((c>=0xFB1F)&&(c<=0xFB28))||
    ((c>=0xFB2A)&&(c<=0xFD3D))||
    ((c>=0xFD50)&&(c<=0xFDFC))||
    ((c>=0xFE70)&&(c<=0xFEFC))||
    ((c>=0x10800)&&(c<=0x1091B))||
    ((c>=0x10920)&&(c<=0x10A00))||
    ((c>=0x10A10)&&(c<=0x10A33))||
    ((c>=0x10A40)&&(c<=0x10B35))||
    ((c>=0x10B40)&&(c<=0x10C48))||
    ((c>=0x1EE00)&&(c<=0x1EEBB))
  ) return 1;
  return 0;
}
如此安好 2024-11-03 05:46:16

如果您使用 Windows GDI,则 GetFontLanguageInfo(HDC) 似乎返回一个 DWORD;如果设置了 GCP_REORDER,则语言需要重新排序才能显示,例如希伯来语或阿拉伯语。

If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文