带有 unicode 变音符号的正则表达式断字
我正在开发一个应用程序,该应用程序根据用户的输入使用正则表达式搜索文本。用户的一种选择是使用星号包含“匹配 0 个或多个字符”通配符。我需要这个只在单词边界之间匹配。我的第一次尝试是将所有星号转换为 (?:(?=\B).)*
,这在大多数情况下都可以正常工作。它失败的地方显然是 .Net 认为带有变音符号的 unicode 字符与另一个字符之间的位置是断词。我认为这是一个错误,并已将其提交到 Microsoft 反馈网站< /a>.
然而,与此同时,我需要实现功能并交付产品。我正在考虑使用 [\p{L}\p{M}\p{N}\p{Pc}]*
作为替换文本,但是,坦率地说,我处于“我不真正明白这将要做什么”土地。我的意思是,我可以阅读规范,但不相信我可以充分测试它以确保它符合我的预期。我根本不知道要测试的所有边界条件。该应用程序由跨文化工作者使用,其中许多人居住在部落地区,因此需要支持任何和所有书写系统,包括一些使用零宽度分词的系统。
有谁有更优雅的解决方案,或者可以确认/更正上面的代码,或者提供一些指示?
感谢您的帮助。
I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*
, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]*
as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 unicode 上下文中,
/(?:(?=\B).)*/
的等价物将是:...或稍微简化:
这将匹配单词或非单词 (空格、标点符号等)序列,可能是一个空序列。
正常或否定的字边界(
\b
或\B
)基本上是双重环顾。向后看,确保当前位置之前的字符类型。同样,一个人向前看。在第二个正则表达式中,我删除了环视并使用简单的字符类代替。
The equivalent of
/(?:(?=\B).)*/
in a unicode context would be:...or somewhat simplified:
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (
\b
or\B
) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.In the second regex, I removed the look-arounds and used simple character classes instead.