正则表达式 - Unicode 属性参考和示例

发布于 2024-08-18 00:09:29 字数 347 浏览 18 评论 0原文

我对 RegexBuddy 提供的正则表达式 Unicode 属性感到迷失，我无法区分任何数字属性，并且数学符号属性似乎只匹配 + 但不匹配 -，<例如，code>*、/、^。

是否有任何带有正则表达式 Unicode 属性示例的文档/参考？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

栖竹 2024-08-25 00:09:29

Unicode 字符属性

您在示例中列出的属性实际上都是相同的 Unicode 字符属性，即“常规类别”属性。一些正则表达式系统仅提供对这一属性的访问；其他包括访问Block属性（不是很有用）或Script属性（更有用）。

Perl 正则表达式中的 \p{Property Name} 和 \p{Property Name = Property Value} 语法的更完整解释在以下第 209 页的文本中给出：

Unicode Character Properties

The ones that you’ve listed there in your example are actually all the same Unicode character property, the General Category property. Some regex systems provide access only to this one property alone; others include access to the Block property (not very useful) or to the Script property (much more useful).

A more complete explanation of the \p{Property Name} and \p{Property Name = Property Value} syntax in Perl regexes is given in the following text from page 209 of ???? Programming Perl, 4^th edition, here reproduced with the kind permission of its author: ????

All standard Unicode properties are actually composed of two parts, as in
\p{NAME=VALUE}. All one-part properties are therefore additions to official Unicode
properties. Boolean properties whose values are true can always be abbreviated
as one-part properties, which allows you to write \p{Lowercase} for \p{Lowercase=True}. Other types of properties besides Boolean properties take string, numeric,
or enumerated values. Perl also provides one-part aliases for all general
category, script, and block properties, plus the level-one recommendations from
Unicode Technical Standard #18 on Regular Expressions (version 13, from
2008-08), such as \p{Any}.
For example, \p{Armenian}, \p{IsArmenian}, and \p{Script=Armenian} all represent
the same property, as do \p{Lu}, \p{GC=Lu}, \p{Uppercase_Letter}, and
\p{General_Category=Uppercase_Letter}. Other examples of binary properties
(those whose values are implicitly true) include \p{Whitespace}, \p{Alphabetic}, \p{Math}, and \p{Dash}. Examples of properties that aren’t binary properties
include \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and
\p{Numeric_Value=10}. The perluniprops manpage lists all properties and their
aliases that Perl supports, both standard Unicode properties and the Perl specials,
too.

The complete list of Unicode character properties, and their meanings, is documented in section 5 on Properties from UAX#44, the Unicode Character Database. Those eleven properties that must be supported to meet UTS#18’s RL 1.2 on Properties are these:

RL1.2 Properties
To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following:
General_Category
Script
Alphabetic
Uppercase
Lowercase
White_Space
Noncharacter_Code_Point
Default_Ignorable_Code_Point
ANY, ASCII, ASSIGNED

Note that the single-letter character class abbreviations like \w, \d, \s, \b, and their uppercase complements, as well as the POSIX-sounding names like \p{alpha}, are themselves defined in terms of Unicode character properties in UTS#18’s Annex C on Compatibility Properties.

To the best of my knowledge, the only regex engines currently meeting the Level 1 requirements of UTS#18 for Basic Unicode Support are Perl, ICU’s regex library for C and C++, Java 7’s Pattern class, and Matthew Barnett’s excellent regexp library for Python 2 and Python 3. The regexes used in Android are actually ICU’s, not Java’s as one might otherwise imagine, and so work much better with Unicode.

For Java 7, you must use the UNICODE_CHARACTER_CLASS pattern compilation flag, or an embedded (?U), to get the RL1.2a (\w &c) stuff going. For PCRE, you seem to need to embed (*PCRE_UCP), or use that as compilation flag. This may depend on how your version of php was built, which can be a problem.

Russ Cox’s RE2 library, with bindings available for C and C++, plus as Perl regex engine plugin, and now the standard regex library used by Go programming language, supports the two most important properties, both General Category and Script.

PCRE & PHP

I believe that PCRE is still a ways off from meeting RL 1.2’s requirements on properties. It handles both the General Category and the Script properties, which are the two most important and commonly used properties, but does not seem to let you get at the other nine requisite properties. Its POSIX-compatible properties lkike alpha, upper, lower, and space are specifically documented to be 7-bit ASCII only, in contravention to RL 1.2a. However, PCRE also offers these specials:

Xan Alphanumeric: union of properties L and N
Xps POSIX space: property Z or tab, NL, VT, FF, CR
Xsp Perl space: property Z or tab, NL, FF, CR
Xwd Perl word: property Xan or underscore

Note that PCRE’s \p{Xan} is still different from what Unicode says \p{alnum} must mean, because it’s missing combining marks, for example, and certain alphabetic symbols. The Perl \p{alnum} follows the Unicode definition. In the away way, PCRE’s \p{Xwd} differs from Unicode’s (and Perl’s), in that it is missing the extra alphabetics and the rest of the \p{GC=Connector_Punctuation} characters. The next revision to UTS#18 also adds \p{Join_Control} to the set of \p{word} characters.

More Properties

Of those four that meet RL 1.2 and RL 1.2a, all but Java 7 also meet ^{_{(or come extremely close to meeting, sometimes under an alternate syntax like \N{…} in lieu of the \p{name=…} syntax)}} the new RL 2.7 on Full Properties from the proposed update to UTS#18 posted earlier this month, which reads in part:

RL2.7 Full Properties
To meet this requirement, an implementation shall support all of the properties listed below that are in the supported version of Unicode, with values that match the Unicode definitions for that version.
To meet requirement RL2.7, the implementation must satisfy the Unicode definition of the properties for the supported version of Unicode, rather than other possible definitions. However, the names used by the implementation for these properties may differ from the formal Unicode names for the properties. For example, if a regex engine already has a property called "Alphabetic", for backwards compatibility it may need to use a distinct name, such as "Unicode_Alphabetic", for the corresponding property listed in RL1.2.
[table omitted for brevity —tchrist]
The Name and Name_Alias properties are used in \p{name=…} and \N{…}. The data in NamedSequences.txt is also used in \N{…}. For more information see Section 2.5, Name Properties. The Script and Script_Extensions properties are used in \p{scx=…}. For more information, see Section 1.2.2, Script_Property.
The list excludes contributory, obsolete, and deprecated properties, most provisional properties, and the Unicode_1_Name and Unicode_Radical_Stroke properties. The properties in gray are covered by RL1.2 Properties. For more information on properties, see UAX #44, Unicode Character Database [UAX44].

Unicode Property Exploration Tools

Three standalone tools that you might want to keep handy for exploring Unicode character properties are uniprops,
unichars, and *uninames. They’re also available as part of the larger Unicode::Tussle suite from CPAN.

Quick demos:

$ uniprops -a 3b1
U+03B1 ‹α› \N{GREEK SMALL LETTER ALPHA}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base
       Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_
       Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
       X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Greek Block=Greek_And_Coptic BLK=Greek
       Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
       Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=A
       East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
       Script=Greek Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
       Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
       Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
       Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1
       IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
       Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Grek Script=Grek
       Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE Word_Break=LE

$ unichars '\pN' '\D' '\p{Latin}'
 Ⅰ      8544  02160  ROMAN NUMERAL ONE
 Ⅱ      8545  02161  ROMAN NUMERAL TWO
 Ⅲ      8546  02162  ROMAN NUMERAL THREE
 Ⅳ      8547  02163  ROMAN NUMERAL FOUR
 Ⅴ      8548  02164  ROMAN NUMERAL FIVE
 Ⅵ      8549  02165  ROMAN NUMERAL SIX
 Ⅶ      8550  02166  ROMAN NUMERAL SEVEN
 Ⅷ      8551  02167  ROMAN NUMERAL EIGHT
 (etc)

$ uninames Old English
 æ  00E6        LATIN SMALL LETTER AE
        = latin small ligature ae (1.0)
        = ash (from Old English æsc)
        * Danish, Norwegian, Icelandic, Faroese, Old English, French, IPA
        x (latin small ligature oe - 0153)
        x (cyrillic small ligature a ie - 04D5)
 ð  00F0        LATIN SMALL LETTER ETH
        * Icelandic, Faroese, Old English, IPA
        x (latin capital letter eth - 00D0)
        x (greek small letter delta - 03B4)
        x (partial differential - 2202)
 þ  00FE        LATIN SMALL LETTER THORN
        * Icelandic, Old English, phonetics
        * Runic letter borrowed into Latin script
        x (runic letter thurisaz thurs thorn - 16A6)
 œ  0153        LATIN SMALL LIGATURE OE
        = ethel (from Old English eðel)
        * French, IPA, Old Icelandic, Old English, ...
        x (latin small letter ae - 00E6)
        x (latin letter small capital oe - 0276)
 ƿ  01BF        LATIN LETTER WYNN
        = wen
        * Runic letter borrowed into Latin script
        * replaced by "w" in modern transcriptions of Old English
        * uppercase is 01F7
        x (runic letter wunjo wynn w - 16B9)
 ǣ  01E3        LATIN SMALL LETTER AE WITH MACRON
        * Old Norse, Old English
        : 00E6 0304
 ⁊  204A        TIRONIAN SIGN ET
        * Irish Gaelic, Old English, ...
        x (ampersand - 0026)

回复收藏 0 原文