模糊文本搜索:正则表达式通配符搜索生成器?
我想知道是否有某种方法可以在 PHP 中进行模糊字符串匹配。在长字符串中查找单词,即使拼写错误也能找到潜在的匹配项;如果由于 OCR 错误而偏离一个字符,它会找到它。
我在想正则表达式生成器也许能够做到这一点。因此,如果输入“crazy”,它将生成此正则表达式:
.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*
然后它将返回该单词或该单词的变体的所有匹配项。
如何构建生成器: 我可能会将搜索字符串/单词拆分为一个字符数组,并构建正则表达式,对新创建的数组执行 foreach 操作,用“.+”替换键值(字符串中字母的位置)。
这是进行模糊文本搜索的好方法还是有更好的方法?是否可以进行某种字符串比较,根据其接近程度来给我一个分数?我正在尝试查看某些转换错误的 OCR 文本是否包含简短的单词。
I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.
I was thinking a regex generator might be able to do it. So given an input of "crazy" it would generate this regex:
.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*
It would then return all matches for that word or variations of that word.
How to build the generator:
I would probably split the search string/word up into an array of characters and build the regex expression doing a foreach the newly created array replacing the key value (the position of the letter in the string) with ".+".
Is this a good way to do fuzzy text search or is there a better way? What about some kind of string comparison that gives me a score based on how close it is? I'm trying to see if some badly converted OCR text contains a word in short.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当您不知道正确的单词是什么时,字符串距离函数毫无用处。我建议 pspell 函数:
http://www.php.net/手册/en/function.pspell-suggest.php
String distance functions are useless when you don't know what the right word is. I'd suggest pspell functions:
http://www.php.net/manual/en/function.pspell-suggest.php
Levenshtein 是字符串 编辑距离 的一个示例。不同的目的有不同的指标。熟悉它们并找到适合您的那个。
Levenshtein is one example of a String Edit-distance. There are different metrics for different purposes. Familiarize yourself with them and find the one that works for you.