\b 的正则表达式

发布于 2024-08-05 21:31:41 字数 340 浏览 2 评论 0原文

我正在用 Java 编写 unicode 文本的正则表达式。然而,对于我正在使用的特定脚本 - 天城文 (0900 - 097F),单词边界存在问题。 \b 匹配从属元音字符(如 093E-094C),因为它们被视为空格字符。

例子: 假设我有字符串:“कमल कमाल कम्हल कम्हाल” 请注意,第二个单词中的“मा”是由 म 和 ा(被识别为空格字符)组合而成。最后一句话也是如此。 这导致 \b 将 'कमाल' 中的 'ल' 与正则表达式 \b\w\b 匹配,根据语言,这是不正确的。

我希望这个例子有帮助。

我可以编写一个行为类似于 \b 的正则表达式,只是它不匹配某些字符吗?任何反馈将不胜感激。

I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.

Example:
Suppose I have the string: "कमल कमाल कम्हल कम्हाल"
Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word.
This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.

I hope the example helps.

Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

嘿哥们儿 2024-08-12 21:31:41

您应该能够使用以下正则表达式运算符完成您想要的操作:(

(?=X)   X, via zero-width positive lookahead
(?!X)   X, via zero-width negative lookahead
(?<=X)  X, via zero-width positive lookbehind
(?<!X)  X, via zero-width negative lookbehind

以上引用自 Java 6 Pattern API 文档。)

使用 (? 代替单词之前的 \b,以及之后的 (?<=[foo])(?![foo]) 代替 \b一个单词,其中“[foo]”是您的“单词字符”集

You should be able to accomplish what you want with the following regex operators:

(?=X)   X, via zero-width positive lookahead
(?!X)   X, via zero-width negative lookahead
(?<=X)  X, via zero-width positive lookbehind
(?<!X)  X, via zero-width negative lookbehind

(The above is quoted from the Java 6 Pattern API docs.)

Use (?<![foo])(?=[foo]) in place of \b before a word, and (?<=[foo])(?![foo]) in place of \b after a word, where "[foo]" is your set of "word characters"

下雨或天晴 2024-08-12 21:31:41

单词边界的等价物(如果边界不是您所期望的)将是:

 (?<!=[x-y])(<?=[x-y])...(?<=[x-y])(?![x-y])

那是因为“单词边界”意味着“一侧有字符而不是另一侧有字符的位置)

所以用look-在后面和前瞻表达式中,您可以定义自己的字符类 [xy] 来检查何时要隔离“单词边界”

The equivalent for word boundaries (if the boundaries are not what you were expecting for) would be:

 (?<!=[x-y])(<?=[x-y])...(?<=[x-y])(?![x-y])

That is because a "word boundary" means "a location where there is a character on one side and not on the other)

So with look-behind and look-ahead expressions, you can define you own class of characters [x-y] to check when you want to isolate a "word boundary"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文