Java 正则表达式中 \w 和 \b 的 Unicode 等效项?

发布于 2024-10-05 04:26:23 字数 458 浏览 6 评论 0 原文

许多现代正则表达式实现将 \w 字符类简写解释为“任何字母、数字或连接标点符号”(通常:下划线)。这样,像 \w+ 这样的正则表达式就会匹配 helloélèveGOÄ_432gefräßig< 等单词/代码>。

不幸的是,Java 没有。在 Java 中,\w 仅限于 [A-Za-z0-9_]。除其他问题外,这使得匹配上述单词变得困难。

似乎 \b 单词分隔符在不应该匹配的地方匹配。

Java 中类似 .NET、支持 Unicode 的 \w\b 的正确等效项是什么?还有哪些快捷方式需要“重写”才能使它们能够识别 Unicode?

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig.

Unfortunately, Java doesn't. In Java, \w is limited to [A-Za-z0-9_]. This makes matching words like those mentioned above difficult, among other problems.

It also appears that the \b word separator matches in places where it shouldn't.

What would be the correct equivalent of a .NET-like, Unicode-aware \w or \b in Java? Which other shortcuts need "rewriting" to make them Unicode-aware?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情徒 2024-10-12 04:26:23

源代码

我在下面讨论的重写函数的源代码可在此处获取< /a>.

Java 7 中的更新

Sun 为 JDK7 更新的 Pattern 类有一个奇妙的新标志 UNICODE_CHARACTER_CLASS,它使一切再次正常工作。它可以作为模式内部的可嵌入 (?U) 提供,因此您也可以将它与 String 类的包装器一起使用。它还对各种其他属性进行了更正的定义。现在,它在 RL1.2RL1.2a 来自 UTS#18:Unicode 正则表达式。这是一个令人兴奋且显着的改进,开发团队的这一重要努力值得赞扬。


Java 的正则表达式 Unicode 问题

Java 正则表达式的问题是 Perl 1.0 charclass 转义 - 意思是 \w\b\s\d 及其补充 — 未在 Java 中扩展以与 Unicode 一起使用。其中, \b 享有某些扩展语义,但这些都没有映射到 < code>\w,也不是 Unicode 标识符,也不是Unicode 换行属性

此外,Java 中的 POSIX 属性是这样访问的:

POSIX syntax    Java syntax

[[:Lower:]]     \p{Lower}
[[:Upper:]]     \p{Upper}
[[:ASCII:]]     \p{ASCII}
[[:Alpha:]]     \p{Alpha}
[[:Digit:]]     \p{Digit}
[[:Alnum:]]     \p{Alnum}
[[:Punct:]]     \p{Punct}
[[:Graph:]]     \p{Graph}
[[:Print:]]     \p{Print}
[[:Blank:]]     \p{Blank}
[[:Cntrl:]]     \p{Cntrl}
[[:XDigit:]]    \p{XDigit}
[[:Space:]]     \p{Space}

这真是一团糟,因为这意味着像 AlphaLowerSpace 这样的东西> 在 Java 中不要映射到 Unicode AlphabeticLowercaseWhitespace 属性。这实在是太烦人了。 Java 的 Unicode 属性支持完全是千禧年前,我的意思是它不支持过去十年中出现的 Unicode 属性。

无法正确谈论空白是非常烦人的。考虑下表。对于每个代码点,都有一个 J-结果列
对于 Java 和 Perl 或任何其他基于 PCRE 的正则表达式引擎的 P-结果列:

             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

看到了吗?

事实上,每一个 Java 空白结果都是   ̲w̲r̲o̲n̲g̲根据统一码。这是一个非常大的问题。Java 就是一团糟,根据现有实践以及 Unicode 给出的答案都是“错误”的。另外,Java 甚至不让您访问真正的 Unicode 属性!事实上,Java 不支持任何与 Unicode 空白相对应的属性。


所有这些问题以及更多问题的解决方案

为了处理这个问题和许多其他相关问题,昨天我编写了一个 Java 函数来重写一个模式字符串,该模式字符串重写了这 14 个 charclass 转义符:

\w \W \s \S \v \V \h \H \d \D \b \B \X \R

通过将它们替换为实际上可以匹配 Unicode 的内容可预测且一致的时尚。它只是来自一次 hack 会话的 alpha 原型,但它功能齐全。

简而言之,我的代码重写了这 14 个,如下所示:

\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)

需要考虑的一些事情...

  • 它的 \X 定义使用什么 Unicode 现在将称为旧字形簇,而不是扩展字形簇,因为后者是比较复杂的。 Perl 本身现在使用更高级的版本,但旧版本仍然可以完美地适用于最常见的情况。 编辑:请参阅底部的附录。

  • 如何处理 \d 取决于您的意图,但默认是 Uniode 定义。我发现人们并不总是想要 \p{Nd},但有时想要 [0-9]\pN

  • 两个边界定义,\b\B,是专门为使用 \w 定义而编写的。

  • \w 定义过于宽泛,因为它捕获括号内的字母而不仅仅是圈出的字母。 Unicode Other_Alphabetic 属性直到 JDK7 才可用,因此这是您能做的最好的事情。


探索边界

自从 Larry Wall 于 1987 年首次为 Perl 1.0 创造了 \b\B 语法来讨论边界以来,边界就一直是一个问题。理解边界的关键\b\B 的作用都是为了消除关于它们的两个普遍的误解:

  1. 它们只寻找 \w 单词字符,从不表示非单词字符。
  2. 他们并不专门寻找绳子的边缘。

\b 边界意味着:

    IF does follow word
        THEN doesn't precede word
    ELSIF doesn't follow word
        THEN does precede word

这些都被完美地直接定义为:

  • 跟随单词(?<=\w)
  • 前面的单词(?=\w)
  • 不跟随单词(?
  • 不在单词之前(?!\w)

因此,由于 IF-THEN 在正则表达式中被编码为 and 组合在一起的 AB,因此 or 是 < code>X|Y,并且由于 and 的优先级高于 or,因此就是 AB|CD。因此,每个表示边界的 \b 都可以安全地替换为:

    (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

以适当的方式定义 \w

(您可能会认为 AC 组件是相反的,这很奇怪。在完美的世界中,您应该能够编写 AB|D,但有一段时间我一直在寻找 Unicode 属性中的互斥矛盾 - 我认为我已经解决了这一问题,但我在边界中保留了双重条件以防万一。如果您稍后获得额外的想法,它会更具可扩展性。)

对于 \B 非边界,逻辑是:

    IF does follow word
        THEN does precede word
    ELSIF doesn't follow word
        THEN doesn't precede word

允许 \B 的所有实例替换为:

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

这确实这是 \b\B 的行为方式。它们的等效模式是

  • \b 使用 ((IF)THEN|ELSE) 构造是 (?(?<=\w)(?!\ w)|(?=\w))
  • \B 使用 ((IF)THEN|ELSE) 构造是 (?(?= \w)(?<=\w)|(?

但是只有 AB|CD 的版本就可以了,特别是如果你缺少条件模式你的正则表达式语言——比如 Java。 ☹

我已经使用所有三个等效定义和测试套件验证了边界的行为,该测试套件每次运行检查 110,385,408 个匹配项,并且我已经根据以下内容在十几种不同的数据配置上运行:

     0 ..     7F    the ASCII range
    80 ..     FF    the non-ASCII Latin1 range
   100 ..   FFFF    the non-Latin1 BMP (Basic Multilingual Plane) range
 10000 .. 10FFFF    the non-BMP portion of Unicode (the "astral" planes)

但是,人们通常想要不同的排序的边界。他们想要一些空白和字符串边缘感知的东西:

  • 左边缘(?:(?<=^)|(?<=\s))
  • 右边缘(?=$|\s)

用Java修复Java

我在我的其他答案提供了这一点以及其他一些便利。这包括自然语言单词、破折号、连字符和撇号的定义,以及更多内容。

它还允许您在逻辑代码点中指定 Unicode 字符,而不是在愚蠢的 UTF-16 代理项中。 这一点的重要性怎么强调都不为过!而且这还只是字符串扩展。

对于使 Java 正则表达式中的 charclass最终在 Unicode 上工作并正确工作的正则表达式 charclass 替换,请抓取完整源代码来自此处 当然,您可以随意使用它。如果您对其进行修复,我很乐意听到它,但您不必这样做。它很短。主要正则表达式重写函数的核心很简单:

switch (code_point) {

    case 'b':  newstr.append(boundary);
               break; /* switch */
    case 'B':  newstr.append(not_boundary);
               break; /* switch */

    case 'd':  newstr.append(digits_charclass);
               break; /* switch */
    case 'D':  newstr.append(not_digits_charclass);
               break; /* switch */

    case 'h':  newstr.append(horizontal_whitespace_charclass);
               break; /* switch */
    case 'H':  newstr.append(not_horizontal_whitespace_charclass);
               break; /* switch */

    case 'v':  newstr.append(vertical_whitespace_charclass);
               break; /* switch */
    case 'V':  newstr.append(not_vertical_whitespace_charclass);
               break; /* switch */

    case 'R':  newstr.append(linebreak);
               break; /* switch */

    case 's':  newstr.append(whitespace_charclass);
               break; /* switch */
    case 'S':  newstr.append(not_whitespace_charclass);
               break; /* switch */

    case 'w':  newstr.append(identifier_charclass);
               break; /* switch */
    case 'W':  newstr.append(not_identifier_charclass);
               break; /* switch */

    case 'X':  newstr.append(legacy_grapheme_cluster);
               break; /* switch */

    default:   newstr.append('\\');
               newstr.append(Character.toChars(code_point));
               break; /* switch */

}
saw_backslash = false;

无论如何,该代码只是一个 alpha 版本,是我在周末破解的东西。它不会一直这样。

对于测试版,我打算:

  • 将代码重复折叠在一起

  • 提供关于非转义字符串转义与增强正则表达式转义的更清晰的界面

  • \d 扩展中提供一些灵活性,也许 \b

  • 提供方便的方法来处理转向和调用 Pattern.compile 或 String.matches 或其他什么

对于生产版本,它应该有 javadoc 和 JUnit 测试套件。我可能会包括我的 gigatester,但它不是作为 JUnit 测试编写的。


附录

我有好消息和坏消息。

好消息是,我现在已经得到了一个非常近似的扩展字形簇,可用于改进的\X

坏消息 ☺ 是该模式是:

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

在 Java 中您可以将其写为:

String extended_grapheme_cluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";

¡Tschüß!

Source code

The source code for the rewriting functions I discuss below is available here.

Update in Java 7

Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It’s available as an embeddable (?U) for inside the pattern, so you can use it with the String class’s wrappers, too. It also sports corrected definitions for various other properties, too. It now tracks The Unicode Standard, in both RL1.2 and RL1.2a from UTS#18: Unicode Regular Expressions. This is an exciting and dramatic improvement, and the development team is to be commended for this important effort.


Java’s Regex Unicode Problems

The problem with Java regexes is that the Perl 1.0 charclass escapes — meaning \w, \b, \s, \d and their complements — are not in Java extended to work with Unicode. Alone amongst these, \b enjoys certain extended semantics, but these map neither to \w, nor to Unicode identifiers, nor to Unicode line-break properties.

Additionally, the POSIX properties in Java are accessed this way:

POSIX syntax    Java syntax

[[:Lower:]]     \p{Lower}
[[:Upper:]]     \p{Upper}
[[:ASCII:]]     \p{ASCII}
[[:Alpha:]]     \p{Alpha}
[[:Digit:]]     \p{Digit}
[[:Alnum:]]     \p{Alnum}
[[:Punct:]]     \p{Punct}
[[:Graph:]]     \p{Graph}
[[:Print:]]     \p{Print}
[[:Blank:]]     \p{Blank}
[[:Cntrl:]]     \p{Cntrl}
[[:XDigit:]]    \p{XDigit}
[[:Space:]]     \p{Space}

This is a real mess, because it means that things like Alpha, Lower, and Space do not in Java map to the Unicode Alphabetic, Lowercase, or Whitespace properties. This is exceeedingly annoying. Java’s Unicode property support is strictly antemillennial, by which I mean it supports no Unicode property that has come out in the last decade.

Not being able to talk about whitespace properly is super-annoying. Consider the following table. For each of those code points, there is both a J-results column
for Java and a P-results column for Perl or any other PCRE-based regex engine:

             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

See that?

Virtually every one of those Java white space results is   ̲w̲r̲o̲n̲g̲  according to Unicode. It’s a really big problem. Java is just messed up, giving answers that are “wrong” according to existing practice and also according to Unicode. Plus Java doesn’t even give you access to the real Unicode properties! In fact, Java does not support any property that corresponds to Unicode whitespace.


The Solution to All Those Problems, and More

To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:

\w \W \s \S \v \V \h \H \d \D \b \B \X \R

by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It’s only an alpha prototype from a single hack session, but it is completely functional.

The short story is that my code rewrites those 14 as follows:

\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)

Some things to consider...

  • That uses for its \X definition what Unicode now refers to as a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. EDIT: See addendum at bottom.

  • What to do about \d depends on your intent, but the default is the Uniode definition. I can see people not always wanting \p{Nd}, but sometimes either [0-9] or \pN.

  • The two boundary definitions, \b and \B, are specifically written to use the \w definition.

  • That \w definition is overly broad, because it grabs the parenned letters not just the circled ones. The Unicode Other_Alphabetic property isn’t available until JDK7, so that’s the best you can do.


Exploring Boundaries

Boundaries have been a problem ever since Larry Wall first coined the \b and \B syntax for talking about them for Perl 1.0 back in 1987. The key to understanding how \b and \B both work is to dispel two pervasive myths about them:

  1. They are only ever looking for \w word characters, never for non-word characters.
  2. They do not specifically look for the edge of the string.

A \b boundary means:

    IF does follow word
        THEN doesn't precede word
    ELSIF doesn't follow word
        THEN does precede word

And those are all defined perfectly straightforwardly as:

  • follows word is (?<=\w).
  • precedes word is (?=\w).
  • doesn’t follow word is (?<!\w).
  • doesn’t precede word is (?!\w).

Therefore, since IF-THEN is encoded as an and ed-together AB in regexes, an or is X|Y, and because the and is higher in precedence than or, that is simply AB|CD. So every \b that means a boundary can be safely replaced with:

    (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

with the \w defined in the appropriate way.

(You might think it strange that the A and C components are opposites. In a perfect world, you should be able to write that AB|D, but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I think I’ve taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)

For the \B non-boundaries, the logic is:

    IF does follow word
        THEN does precede word
    ELSIF doesn't follow word
        THEN doesn't precede word

Allowing all instances of \B to be replaced with:

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

This really is how \b and \B behave. Equivalent patterns for them are

  • \b using the ((IF)THEN|ELSE) construct is (?(?<=\w)(?!\w)|(?=\w))
  • \B using the ((IF)THEN|ELSE) construct is (?(?=\w)(?<=\w)|(?<!\w))

But the versions with just AB|CD are fine, especially if you lack conditional patterns in your regex language — like Java. ☹

I’ve already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I've run on a dozen different data configurations according to:

     0 ..     7F    the ASCII range
    80 ..     FF    the non-ASCII Latin1 range
   100 ..   FFFF    the non-Latin1 BMP (Basic Multilingual Plane) range
 10000 .. 10FFFF    the non-BMP portion of Unicode (the "astral" planes)

However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:

  • left edge as (?:(?<=^)|(?<=\s))
  • right edge as (?=$|\s)

Fixing Java with Java

The code I posted in my other answer provides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.

It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It’s hard to overstress how important that is! And that’s just for the string expansion.

For regex charclass substitution that makes the charclass in your Java regexes finally work on Unicode, and work correctly, grab the full source from here. You may do with it as you please, of course. If you make fixes to it, I’d love to hear of it, but you don’t have to. It’s pretty short. The guts of the main regex rewriting function is simple:

switch (code_point) {

    case 'b':  newstr.append(boundary);
               break; /* switch */
    case 'B':  newstr.append(not_boundary);
               break; /* switch */

    case 'd':  newstr.append(digits_charclass);
               break; /* switch */
    case 'D':  newstr.append(not_digits_charclass);
               break; /* switch */

    case 'h':  newstr.append(horizontal_whitespace_charclass);
               break; /* switch */
    case 'H':  newstr.append(not_horizontal_whitespace_charclass);
               break; /* switch */

    case 'v':  newstr.append(vertical_whitespace_charclass);
               break; /* switch */
    case 'V':  newstr.append(not_vertical_whitespace_charclass);
               break; /* switch */

    case 'R':  newstr.append(linebreak);
               break; /* switch */

    case 's':  newstr.append(whitespace_charclass);
               break; /* switch */
    case 'S':  newstr.append(not_whitespace_charclass);
               break; /* switch */

    case 'w':  newstr.append(identifier_charclass);
               break; /* switch */
    case 'W':  newstr.append(not_identifier_charclass);
               break; /* switch */

    case 'X':  newstr.append(legacy_grapheme_cluster);
               break; /* switch */

    default:   newstr.append('\\');
               newstr.append(Character.toChars(code_point));
               break; /* switch */

}
saw_backslash = false;

Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won’t stay that way.

For the beta I intend to:

  • fold together the code duplication

  • provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes

  • provide some flexibility in the \d expansion, and maybe the \b

  • provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you

For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it’s not written as JUnit tests.


Addendum

I have good news and bad news.

The good news is that I’ve now got a very close approximation to an extended grapheme cluster to use for an improved \X.

The bad news ☺ is that that pattern is:

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

which in Java you’d write as:

String extended_grapheme_cluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";

¡Tschüß!

酒解孤独 2024-10-12 04:26:23

不幸的是 \w 不起作用。建议的解决方案 \p{Alpha} 也不适合我。

看来 [\p{L}] 捕获了所有 Unicode 字母。因此 \w 的 Unicode 等效项应该是 [\p{L}\p{Digit}_]

It's really unfortunate that \w doesn't work. The proposed solution \p{Alpha} doesn't work for me either.

It seems [\p{L}] catches all Unicode letters. So the Unicode equivalent of \w should be [\p{L}\p{Digit}_].

眼前雾蒙蒙 2024-10-12 04:26:23

在 Java 中,\w\d 不支持 Unicode;它们仅匹配 ASCII 字符 [A-Za-z0-9_][0-9]\p{Alpha} 和朋友也是如此(他们所基于的 POSIX“字符类”应该是区域设置敏感的,但在 Java 中他们只匹配 ASCII 字符) 。如果你想匹配 Unicode“单词字符”,你必须将其拼写出来,例如 [\pL\p{Mn}\p{Nd}\p{Pc}],对于字母,非- 空格修饰符(重音)、小数位和连接标点符号。

然而,Java 的 \b 精通 Unicode;它使用 Character.isLetterOrDigit(ch) 并检查重音字母,但它识别的唯一“连接标点符号”字符是下划线。 编辑:当我尝试您的示例代码时,它会打印 ""élève"在 ideone.com 上查看)。

In Java, \w and \d are not Unicode-aware; they only match the ASCII characters, [A-Za-z0-9_] and [0-9]. The same goes for \p{Alpha} and friends (the POSIX "character classes" they're based on are supposed to be locale-sensitive, but in Java they've only ever matched ASCII characters). If you want to match Unicode "word characters" you you have to spell it out, e.g. [\pL\p{Mn}\p{Nd}\p{Pc}],for letters, non-spacing modifiers (accents), decimal digits, and connecting punctuation.

However, Java's \b is Unicode-savvy; it uses Character.isLetterOrDigit(ch) and checks for accented letters as well, but the only "connecting punctuation" character it recognizes is the underscore. EDIT: when I try your sample code, it prints "" and élève" as it should (see it on ideone.com).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文