我是否应该能够在 Java 正则表达式的单词边界内引用前导或尾随美元符号 ($)?

发布于 2024-09-10 19:26:16 字数 931 浏览 0 评论 0原文

我在获取带有前导/尾随 $ 的正则表达式以匹配 Java (1.6.20) 时遇到问题。

从这段代码中:

System.out.println( "$40".matches("\\b\\Q$40\\E\\b") );
System.out.println( "$40".matches(".*\\Q$40\\E.*") );
System.out.println( "$40".matches("\\Q$40\\E") );
System.out.println( " ------ " );
System.out.println( "40$".matches("\\b\\Q40$\\E\\b") );
System.out.println( "40$".matches(".*\\Q40$\\E.*") );
System.out.println( "40$".matches("\\Q40$\\E") );
System.out.println( " ------ " );
System.out.println( "4$0".matches("\\b\\Q4$0\\E\\b") );
System.out.println( "40".matches("\\b\\Q40\\E\\b") );

我得到了这些结果:

false
true
true
 ------ 
false
true
true
 ------ 
true
true

前两个块中的领先 false 似乎是问题所在。也就是说,在 \b(单词边界)标记的上下文中未正确拾取前导/尾随 $(美元符号)。

块中的真实结果显示它不是引用的美元符号本身,因为用 .* 替换 \b 或一起删除所有内容可以获得所需的结果。

最后两个“true”结果表明问题既不在于内部引用的 $,也不在于引用表达式“\Q ... \E”内的单词边界 (\b) 匹配。

这是 Java 错误还是我遗漏了什么?

I'm having trouble getting regular expressions with leading / trailing $'s to match in Java (1.6.20).

From this code:

System.out.println( "$40".matches("\\b\\Q$40\\E\\b") );
System.out.println( "$40".matches(".*\\Q$40\\E.*") );
System.out.println( "$40".matches("\\Q$40\\E") );
System.out.println( " ------ " );
System.out.println( "40$".matches("\\b\\Q40$\\E\\b") );
System.out.println( "40$".matches(".*\\Q40$\\E.*") );
System.out.println( "40$".matches("\\Q40$\\E") );
System.out.println( " ------ " );
System.out.println( "4$0".matches("\\b\\Q4$0\\E\\b") );
System.out.println( "40".matches("\\b\\Q40\\E\\b") );

I get these results:

false
true
true
 ------ 
false
true
true
 ------ 
true
true

The leading false in the first two blocks seem to be the problem. That is, the leading/trailing $ (dollar sign) is not picked up properly in the context of the \b (word boundary) marker.

The true results in the blocks show it's not the quoted dollar sign itself, since replacing the \b with a .* or removing all together get the desired result.

The last two "true" results show that the issue is neither with an internally quoted $ nor with matching on word boundaries (\b) within quoted expression "\Q ... \E".

Is this a Java bug or am I missing something?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风吹雪碎 2024-09-17 19:26:18

Tomalak 做到了——它是关于单词边界匹配的。我已经弄清楚并删除了这个问题,但威尔对其他人保持开放的建议是合理的。

事实上, \b 才是罪魁祸首。

一个结论可能是,除了最基本的(即 ASCII)用途之外,Java 内置的便利表达式实际上毫无用处。例如。 \w 仅匹配 ASCII 字符,\b 基于此,等等。FWIW

,我的正则表达式最终是:

   (?:^|[\p{P}\p{Z}])(\QThe $earch Term\E)(?:[\p{P}\p{Z}]|$)

其中 The $earch Term是我要匹配的文本。

\p{} 是 Unicode 类别。基本上,我在标点符号 (P) 或分隔符 (Z) Unicode 字符类别中的任何字符上都违反了我的诺言。此外,输入的开头和结尾受到尊重(使用 ^$),并且边界标记被标记为非捕获组((? :...) 位),而实际搜索词用 \Q\E & 引用。放置在匹配的组中。

Tomalak nailed it - it's about word boundary matching. I had figured it out and deleted the question, but Will's advice to keep open for others is sound.

The \b was, in fact, the culprit.

One conclusion could be that for anything but the most rudimentary (i.e. ASCII) uses, the built-in convenience expressions from Java are effectively useless. Eg. \w only matches ASCII characters, \b is based on that, etc.

FWIW, my RegExp ended up being:

   (?:^|[\p{P}\p{Z}])(\QThe $earch Term\E)(?:[\p{P}\p{Z}]|$)

where The $earch Term is the text I'm trying to match.

The \p{} are the Unicode categories. Basically, I'm breaking my word on any character in the Punctuation (P) or Separator (Z) Unicode character categories. As well, the start and end of the input are respected (with ^ and $) and the boundary markers are tagged as non-capturing groups (the (?:...) bits) while the actual search term is quoted with \Q and \E & placed in a matching group.

怕倦 2024-09-17 19:26:17

这是因为 \b 匹配单词边界。并且紧邻 $ 字符之前或之后的位置不一定算作单词边界。

字边界是 \w\W 之间的位置,并且 $ 不是 \w 的一部分。以字符串“bla$”为例,单词边界为:

" b l a $ "
 ^----------- here

" b l a $ "
       ^----- here

" b l a $ "
         ^--- but not here

This is because \b matches word boundaries. And the position immediately before or after a $ character does not necessarily count as a word boundary.

A word boundary is the position between \w and \W, and $ is not part of \w. On the example of the string "bla$", word boundaries are:

" b l a $ "
 ^----------- here

" b l a $ "
       ^----- here

" b l a $ "
         ^--- but not here
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文