如何通过模糊匹配查找字符串中子字符串的位置
我遇到过匹配 OCR 识别文本中的字符串并找到它的位置的问题,考虑到可以任意容忍错误、丢失或额外的字符。结果应该是最佳匹配位置,可能(不一定)具有匹配子字符串的长度。
例如:
String: 9912, 1.What is your name?
Substring: 1. What is your name?
Tolerance: 1
Result: match on character 7
String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10
String: Tolerance is t0o h1gh.
Substring: Tolerance is too high;
Tolerance: 1
Result: no match
我尝试采用 Levenstein 算法,但它对于子字符串不能正常工作,并且不返回位置。
Delphi 中的算法将是首选,但任何实现或伪逻辑都可以。
I have come across a problem of matching a string in an OCR recognized text and find the position of it considering there can be arbitrary tolerance of wrong, missing or extra characters. The result should be a best match position, possibly (not necessarily) with length of matching substring.
For example:
String: 9912, 1.What is your name?
Substring: 1. What is your name?
Tolerance: 1
Result: match on character 7
String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10
String: Tolerance is t0o h1gh.
Substring: Tolerance is too high;
Tolerance: 1
Result: no match
I have tried to adapt Levenstein algorithm, but it doesn't work properly for substrings and doesn't return position.
Algorithm in Delphi would be preferred, yet any implementation or pseudo logic would do.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个有效的递归实现,但可能不够快。最坏的情况是找不到匹配项,并且“What”中除最后一个字符之外的所有字符都在“Where”中的每个索引处匹配。在这种情况下,算法将对 Where 中的每个字符进行 Length(What)-1 + Tolerance 比较,并对每个 Tolerance 进行一次递归调用。由于 Tolerance 和 What 的长度都是常量,所以我认为该算法是 O(n)。它的性能将随着“What”和“Where”的长度线性下降。
我使用以下代码来测试该函数:
对于情况:
它显示了字符 9 的匹配,长度为 6。对于其他两个示例,它给出了预期结果。
Here's a recursive implementation that works, but might not be fast enough. The worst case scenario is when a match can't be found, and all but the last char in "What" gets matched at every index in Where. In that case the algorithm will make Length(What)-1 + Tolerance comparasions for each char in Where, plus one recursive call per Tolerance. Since both Tolerance and the length of What are constnats, I'd say the algorithm is O(n). It's performance will degrade linearly with the length of both "What" and "Where".
I've used the following code to test the function:
For case:
it shows a match on character 9, of length 6. For the other two examples it gives the expected result.
这是模糊匹配(近似搜索)的完整示例,您可以根据需要使用/更改算法!
https://github.com/alidehban/FuzzyMatch
Here is a complete sample of fuzzy match (approximate search), and you can use/change the algorithm as you wish!
https://github.com/alidehban/FuzzyMatch