将(例如)Unicode 字母与 Java 正则表达式匹配

发布于 2024-10-22 02:44:10 字数 449 浏览 6 评论 0原文

StackOverflow 上有很多问题和答案,假设“字母”可以在正则表达式中通过 [a-zA-Z] 进行匹配。然而,对于 Unicode,还有更多的字符被大多数人视为字母(所有希腊字母、西里尔字母……等等。Unicode 定义了许多块,每个块都可能有“字母”。Java

定义 为字母字符等定义了 Posix 类,但指定为仅适用于 US-ASCII。预定义的字符类定义由 组成的单词[a-zA-Z_0-9],它也排除了许多字母,

那么如何正确匹配 Unicode 字符串呢?

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".

The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.

So how do you properly match against Unicode strings? Is there some other library that gets this right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

何时共饮酒 2024-10-29 02:44:10

这里有一个非常好的解释:

http://www.regular-expressions.info/unicode。 html

一些提示:

遗憾的是,Java 和 .NET 尚不支持 \X。使用 \P{M}\p{M}* 作为替代。要匹配任意数量的字素,请使用 (?:\P{M}\p{M}*)+ 而不是 \X+

在 Java 中,正则表达式标记 \uFFFF 仅匹配指定的代码点,即使您打开了规范等效性也是如此。但是,相同的语法 \uFFFF 也用于将 Unicode 字符插入到 Java 源代码中的文字字符串中。 Pattern.compile("\u00E0") 将匹配 à 的单代码点和双代码点编码,而 Pattern.compile( "\\u00E0") 仅匹配单代码点版本。请记住,将正则表达式编写为 Java 字符串文字时,必须对反斜杠进行转义。前一个Java代码编译正则表达式à,而后者编译\u00E0。根据您所做的事情,差异可能会很大。

Here you have a very nice explanation:

http://www.regular-expressions.info/unicode.html

Some hints:

Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+.

In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.

赢得她心 2024-10-29 02:44:10

您是在谈论 Unicode 类别,例如字母吗?它们通过 \p{CAT} 形式的正则表达式进行匹配,其中“CAT”是类别代码,例如任何字母的 L 或子类别,例如 Lu 表示大写,Lt 表示标题大写。

Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like L for any letter, or a subcategory like Lu for uppercase or Lt for title-case.

终陌 2024-10-29 02:44:10

引用 java.util.regex 的 JavaDoc。模式

Unicode 支持

该类符合
Unicode 技术标准 #18 的第 1 级:Unicode 正则表达式指南,以及 RL2.1 规范等效项.

Unicode 转义序列,例如
Java源代码中的\u2014是
按照 §3.3 中的描述进行处理
Java 语言规范。这样的
转义序列也被实现
直接通过正则表达式
解析器,以便 Unicode 转义可以
用于读取的表达式
文件或从键盘。因此
字符串“\u2014”和“\\u2014”,而
不相等,编译成相同的
模式,与字符匹配
十六进制值 0x2014。

Unicode 块和类别是
用 \p 和 \P 结构编写
就像在 Perl 中一样。 \p{prop} 匹配,如果
input 有属性 prop,而
如果输入 \P{prop} 则不匹配
有那个属性。块是
用前缀 In 指定,如
蒙古语。类别可能是
使用可选前缀 Is 指定:
\p{L} 和 \p{IsL} 都表示
Unicode 字母的类别。积木
和类别都可以在内部使用
并且在字符类之外。

支持的类别是
版本中的 Unicode 标准
由Character类指定。这
类别名称是那些定义在
标准,既是规范性的又是
内容丰富。支持的块名称
by Pattern 是有效的块名称
接受并定义为
UnicodeBlock.forName。

Quoting from the JavaDoc of java.util.regex.Pattern.

Unicode support

This class is in conformance with
Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

Unicode escape sequences such as
\u2014 in Java source code are
processed as described in §3.3 of the
Java Language Specification. Such
escape sequences are also implemented
directly by the regular-expression
parser so that Unicode escapes can be
used in expressions that are read from
files or from the keyboard. Thus the
strings "\u2014" and "\\u2014", while
not equal, compile into the same
pattern, which matches the character
with hexadecimal value 0x2014.

Unicode blocks and categories are
written with the \p and \P constructs
as in Perl. \p{prop} matches if the
input has the property prop, while
\P{prop} does not match if the input
has that property. Blocks are
specified with the prefix In, as in
InMongolian. Categories may be
specified with the optional prefix Is:
Both \p{L} and \p{IsL} denote the
category of Unicode letters. Blocks
and categories can be used both inside
and outside of a character class.

The supported categories are those of
The Unicode Standard in the version
specified by the Character class. The
category names are those defined in
the Standard, both normative and
informative. The block names supported
by Pattern are the valid block names
accepted and defined by
UnicodeBlock.forName.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文