如何为正则表达式词干制作通用前缀?
我有一个单词数组,需要通过正则表达式操作进行查找和替换,有时这个数组可能有数千个单词长。我经过测试发现,使用通用前缀来提取单词比单独搜索它们要快得多。也就是说,^where|why$
比 ^wh(ere|y)$
慢。显然,在如此短的示例中,这并不是一个明显的差异,但在有数千个替代方案并且主题字符串很长的情况下,它的速度要快得多。
所以我正在寻找一种自动执行此词干提取的方法,例如转换 string[] { "what", "why", "where", "when", "which" }
进入 wh(at|y|e(re|n)|i(ch))
是否已经有一个公认的算法可以做到这一点?如果没有,你会怎么做?这似乎需要递归完成,但我不太清楚如何做到这一点。我写了一个方法,它的工作范围有限,但它很不优雅,有 60 行长,并且使用多个嵌套的 foreach 循环,所以这是未来维护的噩梦。我确信有更好的方法,如果有人能指出我正确的方向,我将不胜感激......
I have an array of words I need to do a find-and-replace by regex operation on, and sometimes this array can be thousands of words long. I've tested and found that stemming the words using common prefixes is much faster than searching for them individually. That is, ^where|why$
is slower than ^wh(ere|y)$
. Obviously it's not a noticeable difference in such a short example, but it's considerably faster where there are thousands of alternatives and the subject string is long.
So I'm looking for a way to do this stemming automatically, for instance to convert a string[] { "what", "why", "where", "when", "which" }
into wh(at|y|e(re|n)|i(ch))
Is there already a recognized algorithm out there that does this ? If not, how would you go about it ? It seems to need to be done recursively but I can't quite get my head round how to do it. I have a method I wrote that works to a limited extent, but it's inelegant, 60 lines longs and uses multiple nested foreach loops so it's a future maintenance nightmare. I'm sure there's a much better way, if anyone could point me in the right direction that'd be much appreciated...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此代码应该有效:
用法:
编辑:
要获得
reg2 = "wh(y|at|e(re|n))|a(bc|pple)"
即没有第一个括号,只需取消注释BuildRexp 中的标记行方法。
This code should work:
Usage:
EDIT:
to get
reg2 = "wh(y|at|e(re|n))|a(bc|pple)"
i.e. without the first wrapping brackets, just uncomment the marked line inBuildRexp
method.