包含 unicode 单词的正则表达式
我想匹配包含某个单词的所有字符串。就像:
String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)
但是,Pattern 类不会编译它:
java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)
我已经将 unicode_case 设置为编译参数,不确定这里出了什么问题
final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);
感谢您的帮助! :)
I'd like to match all strings containing a certain word. like:
String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)
however, the Pattern class doesn't compile it:
java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)
I already set unicode_case to compile param, not sure what's going wrong here
final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);
Thanks for help! :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
从给出的错误消息来看,它看起来与所示的字符串正则表达式完全不同,我推断原始模式本质上如下所示,我已自行重新格式化,添加符号常量,并以我们可能会使用的行号作为序言更轻松地检查和解决它。
(所有重要的模式都应该始终以
(?x)
模式编写——即使Java在这里与你作对,你仍然应该这样做。)第一个和最后一个这些行是错误的,但它们在与 Java 损坏的正则表达式相关的语义方面是错误的。它们在语法上并没有错误。
现在应该很明显,语法问题是第 13 行和第 15 行的右括号是假的:它们没有相应的左括号。
尽管有第一行和最后一行,我仍然试图理解您真正想要在这里做什么。为什么第 3 行和第 4 行重复?那没有任何用处。我看不出在第 7 行进行分组的原因。
是否意图允许组合标记应用于前面的任何内容?
至于第一行和最后一行的错误,我是否理解您要寻找的只是一个简单的单词边界?您是否真的打算将这些边界字符作为比赛的一部分包括在内,或者您只是想建立边界?你为什么说非字母或非单词?
你知道,单词字符确实包含字母——至少,根据 Unicode 规范,它们确实包含字母,即使 Java 确实犯了这个错误。唉,由于 Java 正则表达式错误,您刚刚包含了一堆字母,所以一旦我明白您真正想要什么,我们就必须重新编码。
如果你使用的东西实际上符合UTS#18,它就可以正常工作,但我想你没有(我没有听说过 ICU),我们必须按照我的思路来修复它之前概述。
对非单词或字符串开头的后向查找适用于第一个单词,对非单词或字符串结尾的前向查找适用于最后一个单词。这就是
\b
当然应该在面对这里的单词字符时所做的事情,并且如果您远离非-,它甚至可能会以这种方式工作。词粒子。但在我能看到更多的初衷之前,我认为我不应该说更多。
From the error message given, which looks nothing at all like the String regex shown, I infer that the original pattern was essentially as follows, which I have taken the liberty to reformat, add symbolic constants to, and preface with line numbers that we might inspect and address it more easily.
(All non-trivial patterns should always be written in
(?x)
mode — even though Java fights against you here, you should still do it.)The first and last lines are wrong, but they are wrong in a semantic way related to Java’s broken regexes. They are not syntactically wrong.
As should now be apparent, the syntactic issue is that the close parentheses at lines 13 and 15 are spurious: they have no corresponding open parentheses.
The first and last lines notwithstanding, I am still trying to understand what it is you are truly trying to do here. Why the duplication of lines 3 and 4? That doesn’t do anything useful. And I can see no reason for the grouping at line 7.
Is the intent to allow the combining mark to apply to any of the preceding things?
As for the errors in the first and last lines, do I understand that a simple word boundary is all that you are looking for? Do you actually mean to include those boundary characters there as part of your match, or are you just trying to establish boundaries? Why are you saying a non-letter or a non-word?
Word characters do include letters, you know — at least, according to the Unicode spec they do, even if Java does get this wrong. Alas, you’ve just included a bunch of letters though because of the Java regex bug, so we will have to recode this once I understand what you really want.
If only you used something that was actually compliant with UTS#18, it would work ok, but as I presume you haven’t (I heard no mention of ICU), we’ll have to fix it along the lines I have previously outlined.
A lookbehind for either a non-word or the start of string would work for the first one, and a lookahead for either a non-word or the end of string would work for the last one. That is what
\b
is of course supposed to do when facing word characters as you have here, and it might even work out that way provided you stay clear of your non-word particle.But until I can see more of the original intent, I don’t think I should say more.
错误消息中的模式有两个额外的“)”
The pattern in your error message has two extra ')'
正则表达式中的 Unicode 字符是一件棘手的事情。
以下是
Pattern
文档中的一段内容:因此,既然我们知道:
ベ
=\u3049
su
=\u30B9
パ
=\u30D1
编写您所追求的模式的正确方法是:
进一步阅读:
Unicode characters in regular expressions is a tricky business.
Here is a paragraph from the documentation of
Pattern
:Thus, since we know:
ベ
=\u3049
ス
=\u30B9
パ
=\u30D1
the proper way to write the pattern you're after is:
Further reading:
UNICODE_CHARACTER_CLASS
模式也可以通过嵌入的标志表达式 (?U) 来启用尝试:
但是先修复你的括号,因为我不知道你想要在中间组中输入或输出什么
The
UNICODE_CHARACTER_CLASS
mode can also be enabled via the embedded flag expression (?U)try:
But fix your brackets first as I don't know what you want in or out in the middle group