包含 unicode 单词的正则表达式

发布于 2024-10-31 17:53:31 字数 480 浏览 0 评论 0原文

我想匹配包含某个单词的所有字符串。就像:

String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)

但是,Pattern 类不会编译它:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

我已经将 unicode_case 设置为编译参数,不确定这里出了什么问题

final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);

感谢您的帮助! :)

I'd like to match all strings containing a certain word. like:

String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)

however, the Pattern class doesn't compile it:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

I already set unicode_case to compile param, not sure what's going wrong here

final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);

Thanks for help! :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

╰◇生如夏花灿烂 2024-11-07 17:53:31

从给出的错误消息来看,它看起来与所示的字符串正则表达式完全不同,我推断原始模式本质上如下所示,我已自行重新格式化,添加符号常量,并以我们可能会使用的行号作为序言更轻松地检查和解决它。

(所有重要的模式都应该始终(?x)模式编写——即使Java在这里与你作对,你仍然应该这样做。)

  1     (?: \P{L} | \W | ^ )
  2     (
  3         (?: \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  4           | \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  5           | \N{KATAKANA LETTER HE} \N{KATAKANA LETTER ZU}
  6         )
  7         (?: \N{KATAKANA LETTER PA} )
  8     |
  9             \N{KATAKANA LETTER PA}
 10     )
 11 |
 12             \N{KATAKANA LETTER HA}
 13     )
 14     \N{COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK}
 15     )
 16     (?: \b | $ )

第一个和最后一个这些行是错误的,但它们在与 Java 损坏的正则表达式相关的语义方面是错误的。它们在语法上并没有错误。

现在应该很明显,语法问题是第 13 行和第 15 行的右括号是假的:它们没有相应的左括号。

尽管有第一行和最后一行,我仍然试图理解您真正想要在这里做什么。为什么第 3 行和第 4 行重复?那没有任何用处。我看不出在第 7 行进行分组的原因。

是否意图允许组合标记应用于前面的任何内容?

至于第一行和最后一行的错误,我是否理解您要寻找的只是一个简单的单词边界?您是否真的打算将这些边界字符作为比赛的一部分包括在内,或者您只是想建立边界?你为什么说非字母或非单词?

你知道,单词字符确实包含字母——至少,根据 Unicode 规范,它们确实包含字母,即使 Java 确实犯了这个错误。唉,由于 Java 正则表达式错误,您刚刚包含了一堆字母,所以一旦我明白您真正想要什么,我们就必须重新编码。

如果你使用的东西实际上符合UTS#18,它就可以正常工作,但我想你没有(我没有听说过 ICU),我们必须按照我的思路来修复它之前概述

对非单词或字符串开头的后向查找适用于第一个单词,对非单词或字符串结尾的前向查找适用于最后一个单词。这就是 \b 当然应该在面对这里的单词字符时所做的事情,并且如果您远离非-,它甚至可能会以这种方式工作。词粒子。

但在我能看到更多的初衷之前,我认为我不应该说更多。

From the error message given, which looks nothing at all like the String regex shown, I infer that the original pattern was essentially as follows, which I have taken the liberty to reformat, add symbolic constants to, and preface with line numbers that we might inspect and address it more easily.

(All non-trivial patterns should always be written in (?x) mode — even though Java fights against you here, you should still do it.)

  1     (?: \P{L} | \W | ^ )
  2     (
  3         (?: \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  4           | \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  5           | \N{KATAKANA LETTER HE} \N{KATAKANA LETTER ZU}
  6         )
  7         (?: \N{KATAKANA LETTER PA} )
  8     |
  9             \N{KATAKANA LETTER PA}
 10     )
 11 |
 12             \N{KATAKANA LETTER HA}
 13     )
 14     \N{COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK}
 15     )
 16     (?: \b | $ )

The first and last lines are wrong, but they are wrong in a semantic way related to Java’s broken regexes. They are not syntactically wrong.

As should now be apparent, the syntactic issue is that the close parentheses at lines 13 and 15 are spurious: they have no corresponding open parentheses.

The first and last lines notwithstanding, I am still trying to understand what it is you are truly trying to do here. Why the duplication of lines 3 and 4? That doesn’t do anything useful. And I can see no reason for the grouping at line 7.

Is the intent to allow the combining mark to apply to any of the preceding things?

As for the errors in the first and last lines, do I understand that a simple word boundary is all that you are looking for? Do you actually mean to include those boundary characters there as part of your match, or are you just trying to establish boundaries? Why are you saying a non-letter or a non-word?

Word characters do include letters, you know — at least, according to the Unicode spec they do, even if Java does get this wrong. Alas, you’ve just included a bunch of letters though because of the Java regex bug, so we will have to recode this once I understand what you really want.

If only you used something that was actually compliant with UTS#18, it would work ok, but as I presume you haven’t (I heard no mention of ICU), we’ll have to fix it along the lines I have previously outlined.

A lookbehind for either a non-word or the start of string would work for the first one, and a lookahead for either a non-word or the end of string would work for the last one. That is what \b is of course supposed to do when facing word characters as you have here, and it might even work out that way provided you stay clear of your non-word particle.

But until I can see more of the original intent, I don’t think I should say more.

榕城若虚 2024-11-07 17:53:31
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)
(            )((              )(   )   )   )  )(      )

错误消息中的模式有两个额外的“)”

(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)
(            )((              )(   )   )   )  )(      )

The pattern in your error message has two extra ')'

能怎样 2024-11-07 17:53:31

正则表达式中的 Unicode 字符是一件棘手的事情

以下是 Pattern 文档中的一段内容:

Unicode 支持

此类遵循 Unicode 技术报告 #18:Unicode 正则表达式指南,尽管具体语法略有不同,但实现了第二级支持。

Java 源代码中的 \u2014 等 Unicode 转义序列按照 Java 语言规范第 3.3 节中的描述进行处理。此类转义序列也直接由正则表达式解析器实现,以便可以在从文件或键盘读取的表达式中使用 Unicode 转义。 因此,字符串 "\u2014""\\u2014" 虽然不相等,但会编译为相同的模式,该模式与十六进制值 0x2014。

因此,既然我们知道:

  • = \u3049
  • su = \u30B9
  • = \u30D1

编写您所追求的模式的正确方法是:

String regex = "(?:\\P{L}|\\W|^)(\\u30d9\\u30B9\\u30D1)(?:\\b|$)";

进一步阅读

Unicode characters in regular expressions is a tricky business.

Here is a paragraph from the documentation of Pattern:

Unicode support

This class follows Unicode Technical Report #18: Unicode Regular Expression Guidelines, implementing its second level of support though with a slightly different concrete syntax.

Unicode escape sequences such as \u2014 in Java source code are processed as described in ?3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Thus, since we know:

  • = \u3049
  • = \u30B9
  • = \u30D1

the proper way to write the pattern you're after is:

String regex = "(?:\\P{L}|\\W|^)(\\u30d9\\u30B9\\u30D1)(?:\\b|$)";

Further reading:

苦行僧 2024-11-07 17:53:31

UNICODE_CHARACTER_CLASS 模式也可以通过嵌入的标志表达式 (?U) 来启用

尝试:

(?U)(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

但是先修复你的括号,因为我不知道你想要在中间组中输入或输出什么

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U)

try:

(?U)(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

But fix your brackets first as I don't know what you want in or out in the middle group

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文