克服 Bitap 算法的搜索模式长度
我是近似字符串匹配领域的新手。
我正在探索 Bitap 算法 的用途,但到目前为止,其有限的模式长度让我感到困扰。 我正在使用 Flash,并且处理 32 位无符号整数和 IEEE-754 双精度浮点数类型,该类型最多可以为整数提供 53 位。 尽管如此,我还是宁愿有一个模糊匹配算法,它可以处理超过 50 个字符的模式。
Bitap 算法的 维基百科页面 提到了 libbitap,据称它演示了该算法的无限模式长度实现算法,但我很难从其来源中获得这个想法。
您是否对如何将 Bitap 泛化为无限长度的模式有任何建议,或者对另一种可以在大海捞针中建议位置附近执行针模糊字符串匹配的算法有什么建议吗?
I am new to the field of approximate string matching.
I am exploring uses for the Bitap algorithm, but so far its limited pattern length has me troubled. I am working with Flash, and I dispose of 32 bit unsigned integers and a IEEE-754 double-precision floating-point Number type, which can devote up to 53 bits for integers. Still, I would rather have a fuzzy matching algorithm which can handle longer patterns than 50 chars.
The Wikipedia page of the Bitap algorithm mentions libbitap, which supposedly demonstrates an unlimited pattern length implementation of the algorithm, but I have trouble getting the idea from its sources.
Have you got any suggestions about how to generalise Bitap for patterns of unlimited length, or about another algorithm that can perform fuzzy string matching of a needle near a suggested location in the haystack?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Google 代码提供了该算法的相当清晰的实现。
尝试一下。 虽然我不明白如何获得模糊匹配的确切位置(文本中的起点和终点)。 如果您有任何想法如何获得起点和终点,请分享。
There's a pretty crear implementation of this algorithm available at google code.
Try it. Though I can't understand how to get an exact location (the beginning and ending point in text) of fuzzy match. If you have any idea how to get both beginning and ending points, please share.
模糊匹配最简单的形式可能是“匹配与不匹配”。 不匹配有时称为替换。 关键是我们没有考虑删除或插入。
Bitapp 多个版本的作者 Ricardo Baeza-Yates 还与 Chris Perleberg 一起编写了“匹配与不匹配”的算法。 该算法使用链表而不是位数组,但算法的精神是相同的。 该论文在评论中被引用。
下面是使用 GLib 的 Baeza-Yates-Perleberg“匹配与不匹配”算法的 C 实现。 它的优化程度低于原始实现,但对图案或文本的大小没有限制。
https://gist.github.com/angstyloop/e4ca495542cd469790ca926ade2fc072
输出
这是简单编译示例程序的输出:
这是一个使用此代码的小型 GTK4 应用程序:
https ://gist.github.com/angstyloop/2281191a3e7fd7e4c615698661fbac24
通过动态选择模式的最大长度,如果您正在搜索的字符串是,您可以免费获得完整的模糊匹配就汉明距离而言,大多数情况下相距很远。 即使进行插入和删除,汉明距离最接近的字符串与其他字符串相比也会有少量的不匹配。 用户将不得不犯很多错误,或者两个字符串必须非常接近,才能破坏这种良好的行为。 这是一个例子:
The simplest form of fuzzy match is probably "match with mismatches". Mismatches are sometimes called substitutions. The point is we are not considering deletions or insertions.
Ricardo Baeza-Yates, the author of many versions of Bitapp, also authored an algorithm for "match with mismatches" with Chris Perleberg. The algorithm uses linked lists instead of bit arrays, but the spirit of the algorithm is the same. The paper is cited in the comments.
Here is a C implementation of the Baeza-Yates-Perleberg "match with mismatches" algorithm that uses GLib. It is less optimized than the original implementation, but there are no limits on the size of the pattern or the text.
https://gist.github.com/angstyloop/e4ca495542cd469790ca926ade2fc072
Output
Here is the output of the simple compiled example program:
Here is a small GTK4 application that uses this code:
https://gist.github.com/angstyloop/2281191a3e7fd7e4c615698661fbac24
By dynamically picking the max length of the pattern, you can get a full fuzzy match for free if the strings you are searching are mostly far apart in terms of Hamming distance. Even with insertions and deletions, the string that is closest in terms of Hamming distance will have a small number of mismatches compared to the other strings. The user will have to make many errors, or two of the strings will have to be very close, in order to break that nice behavior. Here is an example: