当前位置：文江博客话题详情

Java中获取某种语言的unicode字符

发布于 2024-10-03 18:30:20 字数 56 浏览 11 评论 0原文

Java 有什么方法可以让我获得特定语言（例如孟加拉语或阿拉伯语）的所有 Unicode 字符吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁月如刀 2024-10-10 18:30:20

java.lang.Character 类有一个名为 UnicodeBlock 的内部静态类。例如，您可以这样获取阿拉伯语 Unicode 块：

Character.UnicodeBlock block = Character.UnicodeBlock.ARABIC;

通过迭代所有字符（或更准确地说，Unicode 代码点），可以检查每个字符以找到其 Unicode 块：

public static void main(String[] args) {
    Set<Character> arabicChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.ARABIC);
    Set<Character> bengaliChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.BENGALI);
}

private static Set<Character> findCharactersInUnicodeBlock(final Character.UnicodeBlock block) {
    final Set<Character> chars = new HashSet<Character>();
    for (int codePoint = Character.MIN_CODE_POINT; codePoint <= Character.MAX_CODE_POINT; codePoint++) {
        if (block == Character.UnicodeBlock.of(codePoint)) {
            chars.add((char) codePoint);
        }
    }
    return chars;
}

The java.lang.Character class has a inner static class called UnicodeBlock. You can, for example, get the Arabic Unicode Block thusly:

Character.UnicodeBlock block = Character.UnicodeBlock.ARABIC;

By iterating over all characters (or more precisely, Unicode code points) it is possible to check each to find its Unicode Block:

public static void main(String[] args) {
    Set<Character> arabicChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.ARABIC);
    Set<Character> bengaliChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.BENGALI);
}

private static Set<Character> findCharactersInUnicodeBlock(final Character.UnicodeBlock block) {
    final Set<Character> chars = new HashSet<Character>();
    for (int codePoint = Character.MIN_CODE_POINT; codePoint <= Character.MAX_CODE_POINT; codePoint++) {
        if (block == Character.UnicodeBlock.of(codePoint)) {
            chars.add((char) codePoint);
        }
    }
    return chars;
}

回复收藏 0 原文

神妖 2024-10-10 18:30:20

在 1.7 之前，Java 不支持 Unicode 脚本。不过，Java 对 Unicode 属性的支持非常粗略。它基本上停留在 Unicode 千禧年前的化身上。这是一个真正的问题。他们声称他们将通过 JDK7 赶上 Unicode 6，但我还没有看到任何证据表明他们将拥有适当的属性支持。

在 Unicode 6.0 中，总共有 1,051 个代码点算作阿拉伯语，其中 1,020 个代码点位于基本多语言平面中：

% unichars --bmp  '\p{Script=Arabic}' | wc -l
    1020

% unichars -a '\p{Script=Arabic}' | wc -l
    1051

有效的原因是 unichars 程序是用 Perl 编写的，而 Perl 始终具有优秀 Unicode 属性支持。我正在针对 Unicode 6.0 运行它；以前版本的 Unicode 中的数量要少一些。事实上，Unicode 6.0 添加了 17 个新的阿拉伯字符：

 % unichars -a '\p{Script=Arabic}' '\p{Age:6.0}' | wc -l
         17

您不能尝试为此使用块。脚本与块不同。并非给定块中的所有代码点都属于同一脚本。同样重要的是，您经常会发现给定脚本的字符分散在奇怪的块中。

例如，希腊语块中有 18 个非希腊字符：

% unichars '\p{InGreek}' '\P{IsGreek}'ˋ | wc -l
     18

阿拉伯语块中有 13 个非阿拉伯字符：

% unichars '\p{InArabic}' '\P{IsArabic}' | wc -l
     13

另外还有 4 个希腊语块和 4（或 5）个阿拉伯语块：

% uniprops -l | grep 'Block:.*Greek'
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block:Greek_And_Coptic
Block:Greek_Extended

% uniprops -l | grep 'Block:.*Arab'
Block:Arabic
Block:Arabic_Presentation_Forms_A
Block:Arabic_Presentation_Forms_B
Block:Arabic_Supplement 
Block:Old_South_Arabian

\p{Block:Greek } 和 \p{Greek_and_Coptic} 是别名，但其余的都是不同的。

但即使你查看了所有这些块，你还是会错过一些。例如：

% unichars '\p{IsGreek}' '[^\p{InAncient_Greek_Musical_Notation}\p{InAncient_Greek_Numbers}\p{InGreek}\p{InGreek_Extended}]' 
 ᴦ  7462 1D26 GREEK LETTER SMALL CAPITAL GAMMA
 ᴧ  7463 1D27 GREEK LETTER SMALL CAPITAL LAMDA
 ᴨ  7464 1D28 GREEK LETTER SMALL CAPITAL PI
 ᴩ  7465 1D29 GREEK LETTER SMALL CAPITAL RHO
 ᴪ  7466 1D2A GREEK LETTER SMALL CAPITAL PSI
 ᵝ  7517 1D5D MODIFIER LETTER SMALL BETA
 ᵞ  7518 1D5E MODIFIER LETTER SMALL GREEK GAMMA
 ᵟ  7519 1D5F MODIFIER LETTER SMALL DELTA
 ᵠ  7520 1D60 MODIFIER LETTER SMALL GREEK PHI
 ᵡ  7521 1D61 MODIFIER LETTER SMALL CHI
 ᵦ  7526 1D66 GREEK SUBSCRIPT SMALL LETTER BETA
 ᵧ  7527 1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
 ᵨ  7528 1D68 GREEK SUBSCRIPT SMALL LETTER RHO
 ᵩ  7529 1D69 GREEK SUBSCRIPT SMALL LETTER PHI
 ᵪ  7530 1D6A GREEK SUBSCRIPT SMALL LETTER CHI
 ᶿ  7615 1DBF MODIFIER LETTER SMALL THETA
 Ω  8486 2126 OHM SIGN

看到问题了吗？

顺便说一句，您使用 uniprops 不仅仅是列出所有可能的属性。它还可以为您提供任何给定代码点的属性：

% uniprops -a 1dbf 9e6 NEL Greek:Omicron
U+1DBF <ᶿ> \N{ MODIFIER LETTER SMALL THETA }:
    \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InPhoneticExtensionsSupplement Case_Ignorable CI Cased Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase Grek ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    Age:4.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Phonetic_Extensions_Supplement Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
       Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Sup Decomposition_Type=Super Decomposition_Type:Super Dt=Sup East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:4.1
       In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:LO Sentence_Break=Lower Sentence_Break:Lower SB=LO Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter
U+09E6 <০> \N{ BENGALI DIGIT ZERO }:
    \w \d \pN \p{Nd}
    All Any Alnum Assigned Beng Bengali InBengali Is_Bengali Decimal_Number Digit Nd N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC Number Print Word XID_Continue XIDC
    Age:1.1 Script=Bengali Block=Bengali Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Bengali Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None East_Asian_Width=Neutral
       East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U
       Joining_Type=Non_Joining Line_Break:NU Line_Break=Numeric Line_Break:Numeric Lb=NU Numeric_Type:De Numeric_Type=Decimal Numeric_Type:Decimal Nt=De Numeric_Value:0 Nv=0 Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
       Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Beng Script:Bengali Sc=Beng Sentence_Break:NU Sentence_Break=Numeric Sentence_Break:Numeric SB=NU Word_Break:NU Word_Break=Numeric Word_Break:Numeric WB=NU
U+0085 <U+0085> \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace White_Space WSpace
    Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1 Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Script=Common
       Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL Word_Break:NL Word_Break=Newline
U+039F <Ο> \N{ GREEK CAPITAL LETTER OMICRON }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper
       Uppercase Word XID_Continue XIDC XID_Start XIDS
    Age:1.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Greek Block=Greek_And_Coptic Block:Greek_And_Coptic Blk=Greek Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
       East_Asian_Width:A East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter

如果您发现它们有用，您可以下载 uniprops 和 unichars 节目。该组中有第三个，uninames。所有内容均附有说明和示例。

即使其中一些属性在 Java 中还不能直接使用，如果您愿意，也可以使用 Perl 生成 Java 代码；我自己也经常这么做。 :)

Until 1.7, Java has no support for scripts in Unicode. Java has very sketchy Unicode property support, though. It is basically stuck at antemillennial incarnations of Unicode. This is a real problem. They claim they’ll catch up to Unicode 6 with JDK7, but I haven’t seen any evidence yet that they will have proper property support.

In Unicode 6.0, there are 1,051 code points that count as Arabic overall, with 1,020 of those in the Basic Multilingual Plane:

% unichars --bmp  '\p{Script=Arabic}' | wc -l
    1020

% unichars -a '\p{Script=Arabic}' | wc -l
    1051

The reason that works is that the unichars program is written in Perl, and Perl has always had excellent Unicode property support. I’m running that against Unicode 6.0; there were somewhat fewer in previous releases of Unicode. In fact, 17 new Arabic characters were added for Unicode 6.0:

 % unichars -a '\p{Script=Arabic}' '\p{Age:6.0}' | wc -l
         17

You just cannot try to use blocks for this. Scripts are different from blocks. Not all code points in a given block are of the same script. Equally important, you often find characters of a given script scattered all over in strange blocks.

For example, there are 18 non-Greek characters in the Greek block:

% unichars '\p{InGreek}' '\P{IsGreek}'ˋ | wc -l
     18

And 13 non-Arabic characters in the Arabic block:

% unichars '\p{InArabic}' '\P{IsArabic}' | wc -l
     13

Plus there are 4 Greek blocks and 4 (or 5) Arabic ones:

% uniprops -l | grep 'Block:.*Greek'
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block:Greek_And_Coptic
Block:Greek_Extended

% uniprops -l | grep 'Block:.*Arab'
Block:Arabic
Block:Arabic_Presentation_Forms_A
Block:Arabic_Presentation_Forms_B
Block:Arabic_Supplement 
Block:Old_South_Arabian

\p{Block:Greek} and \p{Greek_and_Coptic} are aliases, but the rest are all distinct.

But even if you look at all those blocks, you’ll miss some. For example:

% unichars '\p{IsGreek}' '[^\p{InAncient_Greek_Musical_Notation}\p{InAncient_Greek_Numbers}\p{InGreek}\p{InGreek_Extended}]' 
 ᴦ  7462 1D26 GREEK LETTER SMALL CAPITAL GAMMA
 ᴧ  7463 1D27 GREEK LETTER SMALL CAPITAL LAMDA
 ᴨ  7464 1D28 GREEK LETTER SMALL CAPITAL PI
 ᴩ  7465 1D29 GREEK LETTER SMALL CAPITAL RHO
 ᴪ  7466 1D2A GREEK LETTER SMALL CAPITAL PSI
 ᵝ  7517 1D5D MODIFIER LETTER SMALL BETA
 ᵞ  7518 1D5E MODIFIER LETTER SMALL GREEK GAMMA
 ᵟ  7519 1D5F MODIFIER LETTER SMALL DELTA
 ᵠ  7520 1D60 MODIFIER LETTER SMALL GREEK PHI
 ᵡ  7521 1D61 MODIFIER LETTER SMALL CHI
 ᵦ  7526 1D66 GREEK SUBSCRIPT SMALL LETTER BETA
 ᵧ  7527 1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
 ᵨ  7528 1D68 GREEK SUBSCRIPT SMALL LETTER RHO
 ᵩ  7529 1D69 GREEK SUBSCRIPT SMALL LETTER PHI
 ᵪ  7530 1D6A GREEK SUBSCRIPT SMALL LETTER CHI
 ᶿ  7615 1DBF MODIFIER LETTER SMALL THETA
 Ω  8486 2126 OHM SIGN

See the problem?

BTW, you use uniprops for more than just listing all possible properties. It can also give you the properties of any given code point:

% uniprops -a 1dbf 9e6 NEL Greek:Omicron
U+1DBF <ᶿ> \N{ MODIFIER LETTER SMALL THETA }:
    \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InPhoneticExtensionsSupplement Case_Ignorable CI Cased Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase Grek ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    Age:4.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Phonetic_Extensions_Supplement Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
       Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Sup Decomposition_Type=Super Decomposition_Type:Super Dt=Sup East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:4.1
       In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:LO Sentence_Break=Lower Sentence_Break:Lower SB=LO Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter
U+09E6 <০> \N{ BENGALI DIGIT ZERO }:
    \w \d \pN \p{Nd}
    All Any Alnum Assigned Beng Bengali InBengali Is_Bengali Decimal_Number Digit Nd N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC Number Print Word XID_Continue XIDC
    Age:1.1 Script=Bengali Block=Bengali Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Bengali Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None East_Asian_Width=Neutral
       East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U
       Joining_Type=Non_Joining Line_Break:NU Line_Break=Numeric Line_Break:Numeric Lb=NU Numeric_Type:De Numeric_Type=Decimal Numeric_Type:Decimal Nt=De Numeric_Value:0 Nv=0 Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
       Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Beng Script:Bengali Sc=Beng Sentence_Break:NU Sentence_Break=Numeric Sentence_Break:Numeric SB=NU Word_Break:NU Word_Break=Numeric Word_Break:Numeric WB=NU
U+0085 <U+0085> \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace White_Space WSpace
    Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1 Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Script=Common
       Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL Word_Break:NL Word_Break=Newline
U+039F <Ο> \N{ GREEK CAPITAL LETTER OMICRON }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper
       Uppercase Word XID_Continue XIDC XID_Start XIDS
    Age:1.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Greek Block=Greek_And_Coptic Block:Greek_And_Coptic Blk=Greek Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
       East_Asian_Width:A East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter

If you find them useful, you can download the source for the uniprops and unichars programs. There’s a third in the group, uninames. All come with instructions and examples.

Even if some of those properties aren’t directly available in Java yet, it’s ok to use Perl to generate Java code if you want; I do it all the time myself. :)

回复收藏 0 原文

~没有更多了~