带有 unicode 变音符号的正则表达式断字

发布于 2024-08-06 17:33:31 字数 656 浏览 1 评论 0原文

我正在开发一个应用程序,该应用程序根据用户的输入使用正则表达式搜索文本。用户的一种选择是使用星号包含“匹配 0 个或多个字符”通配符。我需要这个只在单词边界之间匹配。我的第一次尝试是将所有星号转换为 (?:(?=\B).)*,这在大多数情况下都可以正常工作。它失败的地方显然是 .Net 认为带有变音符号的 unicode 字符与另一个字符之间的位置是断词。我认为这是一个错误,并已将其提交到 Microsoft 反馈网站< /a>.

然而,与此同时,我需要实现功能并交付产品。我正在考虑使用 [\p{L}\p{M}\p{N}\p{Pc}]* 作为替换文本,但是,坦率地说,我处于“我不真正明白这将要做什么”土地。我的意思是,我可以阅读规范,但不相信我可以充分测试它以确保它符合我的预期。我根本不知道要测试的所有边界条件。该应用程序由跨文化工作者使用,其中许多人居住在部落地区,因此需要支持任何和所有书写系统,包括一些使用零宽度分词的系统。

有谁有更优雅的解决方案,或者可以确认/更正上面的代码,或者提供一些指示?

感谢您的帮助。

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.

In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.

Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?

Thanks for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌无敌 2024-08-13 17:33:31

在 unicode 上下文中, /(?:(?=\B).)*/ 的等价物将是:

/
(?:
  (?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
  |   (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
  )
  .
)*
/

...或稍微简化:

/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/

这将匹配单词或非单词 (空格、标点符号等)序列,可能是一个空序列。

正常或否定的字边界(\b\B)基本上是双重环顾。向后看,确保当前位置之前的字符类型。同样,一个人向前看。

在第二个正则表达式中,我删除了环视并使用简单的字符类代替。

The equivalent of /(?:(?=\B).)*/ in a unicode context would be:

/
(?:
  (?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
  |   (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
  )
  .
)*
/

...or somewhat simplified:

/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/

This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.

A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.

In the second regex, I removed the look-arounds and used simple character classes instead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文