正则表达式 - Unicode 属性参考和示例
我对 RegexBuddy 提供的正则表达式 Unicode 属性感到迷失,我无法区分任何数字属性,并且数学符号属性似乎只匹配 +
但不匹配 -
,<例如,code>*、/
、^
。
是否有任何带有正则表达式 Unicode 属性示例的文档/参考?
I feel lost with the Regex Unicode Properties presented by RegexBuddy, I cannot distinguish between any of the Number properties and the Math symbol property only seems to match +
but not -
, *
, /
, ^
for instance.
Is there any documentation / reference with examples on regular expressions Unicode properties?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Unicode 字符属性
您在示例中列出的属性实际上都是相同的 Unicode 字符属性,即“常规类别”属性。一些正则表达式系统仅提供对这一属性的访问;其他包括访问Block属性(不是很有用)或Script属性(更有用)。
Perl 正则表达式中的
\p{Property Name}
和\p{Property Name = Property Value}
语法的更完整解释在以下第 209 页的文本中给出:Unicode Character Properties
The ones that you’ve listed there in your example are actually all the same Unicode character property, the General Category property. Some regex systems provide access only to this one property alone; others include access to the Block property (not very useful) or to the Script property (much more useful).
A more complete explanation of the
\p{Property Name}
and\p{Property Name = Property Value}
syntax in Perl regexes is given in the following text from page 209 of ???? Programming Perl, 4th edition, here reproduced with the kind permission of its author: ????The complete list of Unicode character properties, and their meanings, is documented in section 5 on Properties from UAX#44, the Unicode Character Database. Those eleven properties that must be supported to meet UTS#18’s RL 1.2 on Properties are these:
Note that the single-letter character class abbreviations like
\w
,\d
,\s
,\b
, and their uppercase complements, as well as the POSIX-sounding names like\p{alpha}
, are themselves defined in terms of Unicode character properties in UTS#18’s Annex C on Compatibility Properties.To the best of my knowledge, the only regex engines currently meeting the Level 1 requirements of UTS#18 for Basic Unicode Support are Perl, ICU’s regex library for C and C++, Java 7’s
Pattern
class, and Matthew Barnett’s excellentregexp
library for Python 2 and Python 3. The regexes used in Android are actually ICU’s, not Java’s as one might otherwise imagine, and so work much better with Unicode.For Java 7, you must use the
UNICODE_CHARACTER_CLASS
pattern compilation flag, or an embedded(?U)
, to get the RL1.2a (\w
&c) stuff going. For PCRE, you seem to need to embed(*PCRE_UCP)
, or use that as compilation flag. This may depend on how your version of php was built, which can be a problem.Russ Cox’s RE2 library, with bindings available for C and C++, plus as Perl regex engine plugin, and now the standard regex library used by Go programming language, supports the two most important properties, both General Category and Script.
PCRE & PHP
I believe that PCRE is still a ways off from meeting RL 1.2’s requirements on properties. It handles both the General Category and the Script properties, which are the two most important and commonly used properties, but does not seem to let you get at the other nine requisite properties. Its POSIX-compatible properties lkike
alpha
,upper
,lower
, andspace
are specifically documented to be 7-bit ASCII only, in contravention to RL 1.2a. However, PCRE also offers these specials:Xan
Alphanumeric: union of properties L and NXps
POSIX space: property Z or tab, NL, VT, FF, CRXsp
Perl space: property Z or tab, NL, FF, CRXwd
Perl word: property Xan or underscoreNote that PCRE’s
\p{Xan}
is still different from what Unicode says\p{alnum}
must mean, because it’s missing combining marks, for example, and certain alphabetic symbols. The Perl\p{alnum}
follows the Unicode definition. In the away way, PCRE’s\p{Xwd}
differs from Unicode’s (and Perl’s), in that it is missing the extra alphabetics and the rest of the\p{GC=Connector_Punctuation}
characters. The next revision to UTS#18 also adds\p{Join_Control}
to the set of\p{word}
characters.More Properties
Of those four that meet RL 1.2 and RL 1.2a, all but Java 7 also meet (or come extremely close to meeting, sometimes under an alternate syntax like
\N{…}
in lieu of the\p{name=…}
syntax) the new RL 2.7 on Full Properties from the proposed update to UTS#18 posted earlier this month, which reads in part:Unicode Property Exploration Tools
Three standalone tools that you might want to keep handy for exploring Unicode character properties are uniprops,
unichars, and *uninames. They’re also available as part of the larger Unicode::Tussle suite from CPAN.
Quick demos:
Unicode 属性列表可以在 http://www.unicode.org/Public/ 中找到UNIDATA/PropList.txt。
每个字符的属性可以在 http://www.unicode.org/Public/ 中找到UNIDATA/UnicodeData.txt (1.2 MB)。
在您的情况下,
+
(加号)是Sm,-
(连字符减号)是Pd,/
(SOLIDUS) 也是 Po,^
(削音重音)是Sk。您最好将它们与
[-+*/^]
匹配。A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt.
The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB).
In your case,
+
(PLUS SIGN) is Sm,-
(HYPHEN-MINUS) is Pd,*
(ASTERISK) is Po,/
(SOLIDUS) is also Po, and^
(CIRCUMFLEX ACCENT) is Sk.You're better off matching them with
[-+*/^]
.