将(例如)Unicode 字母与 Java 正则表达式匹配
StackOverflow 上有很多问题和答案,假设“字母”可以在正则表达式中通过 [a-zA-Z]
进行匹配。然而,对于 Unicode,还有更多的字符被大多数人视为字母(所有希腊字母、西里尔字母……等等。Unicode 定义了许多块,每个块都可能有“字母”。Java
定义 为字母字符等定义了 Posix 类,但指定为仅适用于 US-ASCII。预定义的字符类定义由 组成的单词[a-zA-Z_0-9]
,它也排除了许多字母,
那么如何正确匹配 Unicode 字符串呢?
There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]
. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".
The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9]
, which also excludes many letters.
So how do you properly match against Unicode strings? Is there some other library that gets this right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里有一个非常好的解释:
http://www.regular-expressions.info/unicode。 html
一些提示:
Here you have a very nice explanation:
http://www.regular-expressions.info/unicode.html
Some hints:
您是在谈论 Unicode 类别,例如字母吗?它们通过
\p{CAT}
形式的正则表达式进行匹配,其中“CAT”是类别代码,例如任何字母的L
或子类别,例如Lu
表示大写,Lt
表示标题大写。Are you talking about Unicode categories, like letters? These are matched by a regex of the form
\p{CAT}
, where "CAT" is the category code likeL
for any letter, or a subcategory likeLu
for uppercase orLt
for title-case.引用 java.util.regex 的 JavaDoc。模式。
Quoting from the JavaDoc of java.util.regex.Pattern.