如何检测一个字符是否属于从右到左的语言?

发布于 2024-10-06 02:10:21 字数 501 浏览 7 评论 0 原文

判断字符串是否包含从右到左语言的文本的好方法是什么。

我发现了这个问题 建议采用以下方法:

public bool IsArabic(string strCompare)
{
  char[] chars = strCompare.ToCharArray();
  foreach (char ch in chars)
    if (ch >= '\u0627' && ch <= '\u0649') return true;
  return false;
}

虽然这可能适用于阿拉伯语,但这似乎不适用于其他 RTL 语言,例如希伯来语。是否有通用方法可以知道某个特定字符属于 RTL 语言?

What is a good way to tell whether a string contains text in a Right To Left language.

I have found this question which suggests the following approach:

public bool IsArabic(string strCompare)
{
  char[] chars = strCompare.ToCharArray();
  foreach (char ch in chars)
    if (ch >= '\u0627' && ch <= '\u0649') return true;
  return false;
}

While this may work for Arabic this doesn't seem to cover other RTL languages such as Hebrew. Is there a generic way to know that a particular character belongs to a RTL language?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

魂归处 2024-10-13 02:10:21

Unicode 字符具有不同的相关属性。这些属性不能从代码点导出;您需要一个表格来告诉您某个角色是否具有某种属性。

您对具有双向属性“R”或“AL”(RandALCat) 的字符感兴趣。

RandALCat 角色是具有明确从右到左方向性的角色。

以下是 Unicode 3.2 的完整列表(来自 RFC 3454):

D. Bidirectional tables

D.1 Characters with bidirectional property "R" or "AL"

----- Start Table D.1 -----
05BE
05C0
05C3
05D0-05EA
05F0-05F4
061B
061F
0621-063A
0640-064A
066D-066F
0671-06D5
06DD
06E5-06E6
06FA-06FE
0700-070D
0710
0712-072C
0780-07A5
07B1
200F
FB1D
FB1F-FB28
FB2A-FB36
FB38-FB3C
FB3E
FB40-FB41
FB43-FB44
FB46-FBB1
FBD3-FD3D
FD50-FD8F
FD92-FDC7
FDF0-FDFC
FE70-FE74
FE76-FEFC
----- End Table D.1 -----

以下是一些获取代码Unicode 6.0 的完整列表:

var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";

var query = from record in new WebClient().DownloadString(url).Split('\n')
            where !string.IsNullOrEmpty(record)
            let properties = record.Split(';')
            where properties[4] == "R" || properties[4] == "AL"
            select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);

foreach (var codepoint in query)
{
    Console.WriteLine(codepoint.ToString("X4"));
}

请注意,这些值是 Unicode 代码点。 C#/.NET 中的字符串采用 UTF-16 编码,需要首先转换为 Unicode 代码点(请参阅 Char.ConvertToUtf32)。下面是一种检查字符串是否至少包含一个 RandALCat 字符的方法:

static void IsAnyCharacterRightToLeft(string s)
{
    for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        var codepoint = char.ConvertToUtf32(s, i);
        if (IsRandALCat(codepoint))
        {
            return true;
        }
    }
    return false;
}

Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.

You are interested in characters with bidirectional property "R" or "AL" (RandALCat).

A RandALCat character is a character with unambiguously right-to-left directionality.

Here's the complete list as of Unicode 3.2 (from RFC 3454):

D. Bidirectional tables

D.1 Characters with bidirectional property "R" or "AL"

----- Start Table D.1 -----
05BE
05C0
05C3
05D0-05EA
05F0-05F4
061B
061F
0621-063A
0640-064A
066D-066F
0671-06D5
06DD
06E5-06E6
06FA-06FE
0700-070D
0710
0712-072C
0780-07A5
07B1
200F
FB1D
FB1F-FB28
FB2A-FB36
FB38-FB3C
FB3E
FB40-FB41
FB43-FB44
FB46-FBB1
FBD3-FD3D
FD50-FD8F
FD92-FDC7
FDF0-FDFC
FE70-FE74
FE76-FEFC
----- End Table D.1 -----

Here's some code to get the complete list as of Unicode 6.0:

var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";

var query = from record in new WebClient().DownloadString(url).Split('\n')
            where !string.IsNullOrEmpty(record)
            let properties = record.Split(';')
            where properties[4] == "R" || properties[4] == "AL"
            select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);

foreach (var codepoint in query)
{
    Console.WriteLine(codepoint.ToString("X4"));
}

Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:

static void IsAnyCharacterRightToLeft(string s)
{
    for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        var codepoint = char.ConvertToUtf32(s, i);
        if (IsRandALCat(codepoint))
        {
            return true;
        }
    }
    return false;
}
触ぅ动初心 2024-10-13 02:10:21

您可以尝试在命名块中使用“命名块”。 Regular-Expressions.info/unicode.html">正则表达式。只需挑选从右到左的块,并形成正则表达式即可。例如:

\p{IsArabic}|\p{IsHebrew}

如果该正则表达式返回 true,则字符串中至少有一个希伯来语或阿拉伯语字符。

You can try using "named blocks" in regular expressions. Just pick out the blocks that are right to left, and form the regex. For example:

\p{IsArabic}|\p{IsHebrew}

If that regex returns true, then there was at least one hebrew or arabic character in the string.

相思碎 2024-10-13 02:10:21

Unicode 6.0 的所有“AL”或“R”(来自 http://www.unicode.org /Public/6.0.0/ucd/UnicodeData.txt

bool hasRandALCat = 0;
if(c >= 0x5BE && c <= 0x10B7F)
{
    if(c <= 0x85E)
    {
        if(c == 0x5BE)                        hasRandALCat = 1;
        else if(c == 0x5C0)                   hasRandALCat = 1;
        else if(c == 0x5C3)                   hasRandALCat = 1;
        else if(c == 0x5C6)                   hasRandALCat = 1;
        else if(0x5D0 <= c && c <= 0x5EA)     hasRandALCat = 1;
        else if(0x5F0 <= c && c <= 0x5F4)     hasRandALCat = 1;
        else if(c == 0x608)                   hasRandALCat = 1;
        else if(c == 0x60B)                   hasRandALCat = 1;
        else if(c == 0x60D)                   hasRandALCat = 1;
        else if(c == 0x61B)                   hasRandALCat = 1;
        else if(0x61E <= c && c <= 0x64A)     hasRandALCat = 1;
        else if(0x66D <= c && c <= 0x66F)     hasRandALCat = 1;
        else if(0x671 <= c && c <= 0x6D5)     hasRandALCat = 1;
        else if(0x6E5 <= c && c <= 0x6E6)     hasRandALCat = 1;
        else if(0x6EE <= c && c <= 0x6EF)     hasRandALCat = 1;
        else if(0x6FA <= c && c <= 0x70D)     hasRandALCat = 1;
        else if(c == 0x710)                   hasRandALCat = 1;
        else if(0x712 <= c && c <= 0x72F)     hasRandALCat = 1;
        else if(0x74D <= c && c <= 0x7A5)     hasRandALCat = 1;
        else if(c == 0x7B1)                   hasRandALCat = 1;
        else if(0x7C0 <= c && c <= 0x7EA)     hasRandALCat = 1;
        else if(0x7F4 <= c && c <= 0x7F5)     hasRandALCat = 1;
        else if(c == 0x7FA)                   hasRandALCat = 1;
        else if(0x800 <= c && c <= 0x815)     hasRandALCat = 1;
        else if(c == 0x81A)                   hasRandALCat = 1;
        else if(c == 0x824)                   hasRandALCat = 1;
        else if(c == 0x828)                   hasRandALCat = 1;
        else if(0x830 <= c && c <= 0x83E)     hasRandALCat = 1;
        else if(0x840 <= c && c <= 0x858)     hasRandALCat = 1;
        else if(c == 0x85E)                   hasRandALCat = 1;
    }
    else if(c == 0x200F)                      hasRandALCat = 1;
    else if(c >= 0xFB1D)
    {
        if(c == 0xFB1D)                       hasRandALCat = 1;
        else if(0xFB1F <= c && c <= 0xFB28)   hasRandALCat = 1;
        else if(0xFB2A <= c && c <= 0xFB36)   hasRandALCat = 1;
        else if(0xFB38 <= c && c <= 0xFB3C)   hasRandALCat = 1;
        else if(c == 0xFB3E)                  hasRandALCat = 1;
        else if(0xFB40 <= c && c <= 0xFB41)   hasRandALCat = 1;
        else if(0xFB43 <= c && c <= 0xFB44)   hasRandALCat = 1;
        else if(0xFB46 <= c && c <= 0xFBC1)   hasRandALCat = 1;
        else if(0xFBD3 <= c && c <= 0xFD3D)   hasRandALCat = 1;
        else if(0xFD50 <= c && c <= 0xFD8F)   hasRandALCat = 1;
        else if(0xFD92 <= c && c <= 0xFDC7)   hasRandALCat = 1;
        else if(0xFDF0 <= c && c <= 0xFDFC)   hasRandALCat = 1;
        else if(0xFE70 <= c && c <= 0xFE74)   hasRandALCat = 1;
        else if(0xFE76 <= c && c <= 0xFEFC)   hasRandALCat = 1;
        else if(0x10800 <= c && c <= 0x10805) hasRandALCat = 1;
        else if(c == 0x10808)                 hasRandALCat = 1;
        else if(0x1080A <= c && c <= 0x10835) hasRandALCat = 1;
        else if(0x10837 <= c && c <= 0x10838) hasRandALCat = 1;
        else if(c == 0x1083C)                 hasRandALCat = 1;
        else if(0x1083F <= c && c <= 0x10855) hasRandALCat = 1;
        else if(0x10857 <= c && c <= 0x1085F) hasRandALCat = 1;
        else if(0x10900 <= c && c <= 0x1091B) hasRandALCat = 1;
        else if(0x10920 <= c && c <= 0x10939) hasRandALCat = 1;
        else if(c == 0x1093F)                 hasRandALCat = 1;
        else if(c == 0x10A00)                 hasRandALCat = 1;
        else if(0x10A10 <= c && c <= 0x10A13) hasRandALCat = 1;
        else if(0x10A15 <= c && c <= 0x10A17) hasRandALCat = 1;
        else if(0x10A19 <= c && c <= 0x10A33) hasRandALCat = 1;
        else if(0x10A40 <= c && c <= 0x10A47) hasRandALCat = 1;
        else if(0x10A50 <= c && c <= 0x10A58) hasRandALCat = 1;
        else if(0x10A60 <= c && c <= 0x10A7F) hasRandALCat = 1;
        else if(0x10B00 <= c && c <= 0x10B35) hasRandALCat = 1;
        else if(0x10B40 <= c && c <= 0x10B55) hasRandALCat = 1;
        else if(0x10B58 <= c && c <= 0x10B72) hasRandALCat = 1;
        else if(0x10B78 <= c && c <= 0x10B7F) hasRandALCat = 1;
    }
}

All "AL" or "R" of Unicode 6.0 (from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt)

bool hasRandALCat = 0;
if(c >= 0x5BE && c <= 0x10B7F)
{
    if(c <= 0x85E)
    {
        if(c == 0x5BE)                        hasRandALCat = 1;
        else if(c == 0x5C0)                   hasRandALCat = 1;
        else if(c == 0x5C3)                   hasRandALCat = 1;
        else if(c == 0x5C6)                   hasRandALCat = 1;
        else if(0x5D0 <= c && c <= 0x5EA)     hasRandALCat = 1;
        else if(0x5F0 <= c && c <= 0x5F4)     hasRandALCat = 1;
        else if(c == 0x608)                   hasRandALCat = 1;
        else if(c == 0x60B)                   hasRandALCat = 1;
        else if(c == 0x60D)                   hasRandALCat = 1;
        else if(c == 0x61B)                   hasRandALCat = 1;
        else if(0x61E <= c && c <= 0x64A)     hasRandALCat = 1;
        else if(0x66D <= c && c <= 0x66F)     hasRandALCat = 1;
        else if(0x671 <= c && c <= 0x6D5)     hasRandALCat = 1;
        else if(0x6E5 <= c && c <= 0x6E6)     hasRandALCat = 1;
        else if(0x6EE <= c && c <= 0x6EF)     hasRandALCat = 1;
        else if(0x6FA <= c && c <= 0x70D)     hasRandALCat = 1;
        else if(c == 0x710)                   hasRandALCat = 1;
        else if(0x712 <= c && c <= 0x72F)     hasRandALCat = 1;
        else if(0x74D <= c && c <= 0x7A5)     hasRandALCat = 1;
        else if(c == 0x7B1)                   hasRandALCat = 1;
        else if(0x7C0 <= c && c <= 0x7EA)     hasRandALCat = 1;
        else if(0x7F4 <= c && c <= 0x7F5)     hasRandALCat = 1;
        else if(c == 0x7FA)                   hasRandALCat = 1;
        else if(0x800 <= c && c <= 0x815)     hasRandALCat = 1;
        else if(c == 0x81A)                   hasRandALCat = 1;
        else if(c == 0x824)                   hasRandALCat = 1;
        else if(c == 0x828)                   hasRandALCat = 1;
        else if(0x830 <= c && c <= 0x83E)     hasRandALCat = 1;
        else if(0x840 <= c && c <= 0x858)     hasRandALCat = 1;
        else if(c == 0x85E)                   hasRandALCat = 1;
    }
    else if(c == 0x200F)                      hasRandALCat = 1;
    else if(c >= 0xFB1D)
    {
        if(c == 0xFB1D)                       hasRandALCat = 1;
        else if(0xFB1F <= c && c <= 0xFB28)   hasRandALCat = 1;
        else if(0xFB2A <= c && c <= 0xFB36)   hasRandALCat = 1;
        else if(0xFB38 <= c && c <= 0xFB3C)   hasRandALCat = 1;
        else if(c == 0xFB3E)                  hasRandALCat = 1;
        else if(0xFB40 <= c && c <= 0xFB41)   hasRandALCat = 1;
        else if(0xFB43 <= c && c <= 0xFB44)   hasRandALCat = 1;
        else if(0xFB46 <= c && c <= 0xFBC1)   hasRandALCat = 1;
        else if(0xFBD3 <= c && c <= 0xFD3D)   hasRandALCat = 1;
        else if(0xFD50 <= c && c <= 0xFD8F)   hasRandALCat = 1;
        else if(0xFD92 <= c && c <= 0xFDC7)   hasRandALCat = 1;
        else if(0xFDF0 <= c && c <= 0xFDFC)   hasRandALCat = 1;
        else if(0xFE70 <= c && c <= 0xFE74)   hasRandALCat = 1;
        else if(0xFE76 <= c && c <= 0xFEFC)   hasRandALCat = 1;
        else if(0x10800 <= c && c <= 0x10805) hasRandALCat = 1;
        else if(c == 0x10808)                 hasRandALCat = 1;
        else if(0x1080A <= c && c <= 0x10835) hasRandALCat = 1;
        else if(0x10837 <= c && c <= 0x10838) hasRandALCat = 1;
        else if(c == 0x1083C)                 hasRandALCat = 1;
        else if(0x1083F <= c && c <= 0x10855) hasRandALCat = 1;
        else if(0x10857 <= c && c <= 0x1085F) hasRandALCat = 1;
        else if(0x10900 <= c && c <= 0x1091B) hasRandALCat = 1;
        else if(0x10920 <= c && c <= 0x10939) hasRandALCat = 1;
        else if(c == 0x1093F)                 hasRandALCat = 1;
        else if(c == 0x10A00)                 hasRandALCat = 1;
        else if(0x10A10 <= c && c <= 0x10A13) hasRandALCat = 1;
        else if(0x10A15 <= c && c <= 0x10A17) hasRandALCat = 1;
        else if(0x10A19 <= c && c <= 0x10A33) hasRandALCat = 1;
        else if(0x10A40 <= c && c <= 0x10A47) hasRandALCat = 1;
        else if(0x10A50 <= c && c <= 0x10A58) hasRandALCat = 1;
        else if(0x10A60 <= c && c <= 0x10A7F) hasRandALCat = 1;
        else if(0x10B00 <= c && c <= 0x10B35) hasRandALCat = 1;
        else if(0x10B40 <= c && c <= 0x10B55) hasRandALCat = 1;
        else if(0x10B58 <= c && c <= 0x10B72) hasRandALCat = 1;
        else if(0x10B78 <= c && c <= 0x10B7F) hasRandALCat = 1;
    }
}
沐歌 2024-10-13 02:10:21

编辑:

这是我现在使用的,它包括元音字符以及希伯来语和阿拉伯语中的所有内容:

[\u0591-\u07FF]

旧答案:

如果您需要检测句子中的 RTL 语言,这简化的正则表达式可能就足够了:

[א-ת؀-ۿ]

如果想用希伯来语写一些东西,则必须使用这些字符之一,情况与阿拉伯语类似。

它不包括元音字符,因此如果您需要捕获所有整个单词或绝对所有 RTL 字符,您最好使用其他答案之一。
希伯来语中的元音字符在非诗歌文本中非常罕见。
我不知道阿拉伯语文本。

EDIT:

This is what I use now, it includes the Vowelization chars and everything in Hebrew and Arabic:

[\u0591-\u07FF]

OLD ANSWER:

If you need to detect RTL language in a sentence, this simplified RegEx will probably be enough:

[א-ת؀-ۿ]

If one wants to write something in Hebrew it will have to use one of these characters, and the case is similar with Arabic.

It does not include vowelization characters, so if you need to catch all whole words or absolutely all RTL chars you better use one of the other answers.
Vowelization chars in Hebrew are very rare in non-poetry texts.
I don't know about Arabic texts.

豆芽 2024-10-13 02:10:21

在我的正则表达式实现中,我既不能使用 \u、\x 也不能使用 {} 语言命名组。

因此,我根据 UnicodeData.txt

[־׀׃א-״؛-ي٭-ە‏ײַ-ﳝﶈ-ﷺﺂ-ﻼ]

这应该是相当全面的,到目前为止我已经在阿拉伯语和希伯来语文本上对其进行了测试。

On my implementation of regex I could not use neither \u, \x, nor {} language named groups.

So I built my own pattern programatically based on all "R" and "AL" (RandALCat) bidirectional characters as listed in UnicodeData.txt.

[־׀׃א-״؛-ي٭-ە‏ײַ-ﳝﶈ-ﷺﺂ-ﻼ]

This should be decently comprehensive and I've tested it on Arabic and Hebrew text so far.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文