\b 的正则表达式
我正在用 Java 编写 unicode 文本的正则表达式。然而,对于我正在使用的特定脚本 - 天城文 (0900 - 097F),单词边界存在问题。 \b 匹配从属元音字符(如 093E-094C),因为它们被视为空格字符。
例子: 假设我有字符串:“कमल कमाल कम्हल कम्हाल” 请注意,第二个单词中的“मा”是由 म 和 ा(被识别为空格字符)组合而成。最后一句话也是如此。 这导致 \b 将 'कमाल' 中的 'ल' 与正则表达式 \b\w\b 匹配,根据语言,这是不正确的。
我希望这个例子有帮助。
我可以编写一个行为类似于 \b 的正则表达式,只是它不匹配某些字符吗?任何反馈将不胜感激。
I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.
Example:
Suppose I have the string: "कमल कमाल कम्हल कम्हाल"
Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word.
This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.
I hope the example helps.
Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该能够使用以下正则表达式运算符完成您想要的操作:(
以上引用自 Java 6 Pattern API 文档。)
使用
(? 代替单词之前的
\b
,以及之后的(?<=[foo])(?![foo])
代替\b
一个单词,其中“[foo]
”是您的“单词字符”集You should be able to accomplish what you want with the following regex operators:
(The above is quoted from the Java 6 Pattern API docs.)
Use
(?<![foo])(?=[foo])
in place of\b
before a word, and(?<=[foo])(?![foo])
in place of\b
after a word, where "[foo]
" is your set of "word characters"单词边界的等价物(如果边界不是您所期望的)将是:
那是因为“单词边界”意味着“一侧有字符而不是另一侧有字符的位置)
所以用look-在后面和前瞻表达式中,您可以定义自己的字符类 [xy] 来检查何时要隔离“单词边界”
The equivalent for word boundaries (if the boundaries are not what you were expecting for) would be:
That is because a "word boundary" means "a location where there is a character on one side and not on the other)
So with look-behind and look-ahead expressions, you can define you own class of characters [x-y] to check when you want to isolate a "word boundary"