超过 100k+ 的快速动态模糊搜索C# 中的字符串

发布于 2024-11-06 19:45:14 字数 340 浏览 7 评论 0原文

假设它们是预先加载的股票代码,输入到文本框中。我正在寻找可以复制的代码,而不是要安装的库。

这是受到这个问题的启发:

是否有为 C# 编写的模糊搜索或字符串相似性函数库?

Levenstein 距离算法似乎运行良好,但计算需要时间。 当用户输入额外的字母时,查询需要重新运行,是否有任何优化?我有兴趣最多显示每个输入的前 10 个匹配项。

Let's say they are pre-loaded stock symbols, typed into a text box. I am looking for code that I can copy, not a library to install.

This was inspired by this question:

Are there any Fuzzy Search or String Similarity Functions libraries written for C#?

The Levenstein distance algorithm seems to work well, but it takes time to compute.
Are there any optimizations around the fact that the query will need to re-run as the user types in an extra letter? I am interested in showing at most the top 10 matches for each input.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

阪姬 2024-11-13 19:45:14

您需要确定字符串周围的匹配规则。是什么决定了“相似字符串”

  • 匹配字符数
  • 不匹配字符数
  • 相似长度
  • 拼写或语音错误
  • 业务特定缩写
  • 必须以相同的子字符串开头
  • 必须以相同的子字符串结尾

我在字符串匹配方面做了很多工作算法,并且尚未找到任何满足我的特定要求的现有库或代码。查看它们,借鉴它们的想法,但您总是必须自定义并编写自己的代码。

Levenstein 算法很好,但有点慢。我在史密斯-沃特曼和史密斯-沃特曼的帮助下都取得了一些成功。 Jaro-Winkler 算法,但我发现最适合我的目的的是 Monge(凭记忆)。然而,阅读原始研究并确定他们编写算法和目标数据集的原因是值得的。

如果您没有正确定义要匹配和衡量的内容,那么您会发现意外匹配的得分高,而预期匹配的得分低。字符串匹配非常是特定于域的。如果你没有正确定义你的领域,那么你就像一个毫无头绪的渔夫,到处扔鱼钩,希望得到最好的结果。

You need to determine the matching rules around your strings. What determines a 'similar string'

  • number of matching characters
  • number of non-matching characters
  • similar length
  • typos or phonetic errors
  • business specific abbreviations
  • must start with the same substring
  • must end with the same substring

I've done quite a lot of work with string matching algorithms, and am yet to find any existing library or code that meets my specific requirements. Review them, borrow ideas from them, but you will invariably have to customize and write your own code.

The Levenstein algorithm is good but a bit slow. I've had some success with both Smith-Waterman & Jaro-Winkler algorithms, but the best I found for my purpose was Monge (from memory). However it pays to read the original research and determine why they've written their algorithms and their target dataset.

If you don't properly define what you want to match and measure then you'll find high scores on unexpected matches and low scores on expected matches. String matching is very domain specific. If you don't properly define your domain then you are like a fisherman without a clue, throwing hooks around and hoping for the best.

灯角 2024-11-13 19:45:14

这篇博文描述了一些工作进入Lucene这方面。他们能够使用有限状态转换器(自动机)非常有效地实现 Levenshtein 距离模糊匹配,编辑距离可达 2。代码全部采用 Java 编写,虽然是开源的,但有点复杂。

但基本思想很简单:将你的字典想象成一棵巨大的字母状态树。在 state0,你没有字母。在 state1,您承认任何可能是单词第一个字母的字母。 State2 以 state1 为条件;如果第一个字母是“x”,则下一个状态只接受 x 后面的字母(位置 2)。现在

,对于 Levenshtein 匹配,您遍历字母树,同时允许一些错误:删除、插入(一个字母通配符)和可能的转置(Levenshtein 的一个很好的扩展是将转置视为单个编辑而不是 2)。您必须维护状态才能跟踪允许的编辑次数。这可以非常有效地完成 - 对于交互式“当您键入时”拼写建议器来说当然足够快。

This blog post describes some work that went into Lucene in this area. They were able to implement Levenshtein distance fuzzy matching using a Finite State Transducer (automaton) quite efficiently for up to an edit distance of 2. The code is all in Java though, and a little complex, although it is open source.

But the basic idea is simple enough: think of your dictionary as a giant tree of letter-states. At state0, you have no letters. At state1, you admit any letter that could be the first letter of a word. State2 is conditioned by state1; if the first letter was 'x', the next state admits only letters that could follow x (in position 2). etc.

Now for Levenshtein matching, you traverse the letter-tree while allowing some errors: deletions, insertions (one letter wildcard), and possibly transposal (a nice extension of Levenshtein is to consider a transposal as a single edit rather than 2). You have to maintain state in order to keep track of how many edits have been allowed. This can be done very efficiently - certainly fast enough for an interactive "while you type" spelling suggester.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文