.NET 中的编辑 DFA

发布于 2024-09-28 10:21:56 字数 464 浏览 11 评论 0原文

下午好,

有谁知道 .NET 中 Levenshtein DFA(确定性有限自动机)的“开箱即用”实现(或者很容易翻译成它)?我有一本非常大的字典,其中包含超过 160000 个不同的单词,并且我想,给定一个首字母 w,在 Levenshtein 距离最多 2 个 w 处找到所有已知单词以有效的方式。

当然,拥有一个计算给定单词的编辑距离处所有可能的编辑并将其再次应用于每个编辑的函数可以解决问题(并且以一种非常简单的方式)。问题是效率 --- 给定一个 7 个字母的单词,这可能已经需要 1 秒以上才能完成,我需要更高效的东西 --- 如果可能的话,就像 Levenshtein DFA 一样,需要 O(|w|) 步骤的解决方案。

编辑:我知道我可以通过一点点学习来构建自己的解决问题的方法,但目前我无法阅读 Schulz 和 Mihov 的 60 页长的文章。

非常感谢。

Good afternoon,

Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w, find all known words at Levenshtein distance at most 2 of w in an efficient way.

Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these edits solves the problem (and in a pretty straightforwad way). The problem is effiency --- given a 7 letter word, this can already take over 1 second to complete, and I need something much more efficient --- if possible, as it is with Levenshtein DFAs, a solution that takes O(|w|) steps.

Edit: I know I can construct my own approach to the problem with a little bit of studying, but at the moment I can't afford reading Schulz and Mihov's 60-page-long articles.

Thank you very much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

我不吻晚风 2024-10-05 10:21:56

我们为 apache lucene java 实现了这个,也许您可​​以将其转换为 C# 并节省时间。

主类在这里:它只是一个使用 Schulz 和 Mihov 算法从字符串中获取 Levenshtein DFA 的构建器。

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

参数描述(预先计算的Lev1 和 Lev2 的表)在这里:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

http://svn .apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

你可能会注意到这些是用计算机生成的,我们生成他们使用这个脚本,使用 Jean-Phillipe Barrette 伟大的 moman 实现 (python)
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

我们生成参数描述为打包 long[] 数组,这样就不会让我们的 jar 文件太大。

只需修改 toAutomaton(int n) 以满足您的需求/DFA 包。在我们的例子中,我们使用的是金砖四国自动机包的修改形式,其中转换表示为 unicode 代码点范围。

高效的单元测试对于这类事情来说是很困难的,但这就是我们想出的……它似乎很彻底,甚至在 moman 实现中发现了一个错误(作者立即修复了!)。

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.

the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python)
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.

just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.

efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

盗梦空间 2024-10-05 10:21:56

干得好。

/// <summary>
/// Levenshtein Distance Calculator
/// </summary>
public static int DistanceFrom(this string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    // Step 1
    if (n == 0)
        return m;

    if (m == 0)
        return n;

    // Step 2
    for(int i = 0; i <= n; d[i, 0] = i++) ;
    for(int j = 0; j <= m; d[0, j] = j++) ;

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            // Step 6
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}

Here you go.

/// <summary>
/// Levenshtein Distance Calculator
/// </summary>
public static int DistanceFrom(this string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    // Step 1
    if (n == 0)
        return m;

    if (m == 0)
        return n;

    // Step 2
    for(int i = 0; i <= n; d[i, 0] = i++) ;
    for(int j = 0; j <= m; d[0, j] = j++) ;

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            // Step 6
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}
滴情不沾 2024-10-05 10:21:56

我按照 Robert Muir 的建议将相关的 Lucene Java 代码移植到了 C#。就问题而言,“开箱即用”:这是一项正在进行的工作,但代码似乎可以工作,并且可能可以进一步优化,尽管它确实表现得很好。

您可以在这里找到它: https://github.com/mjvh80/LevenshteinDFA/

更新:看来 Lucene.NET 实际上并没有消亡(还?),我注意到他们现在也有这个代码的移植版本。因此,我建议查看那里( https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/LevenshteinAutomata.cs)的实现。


¹ 代码需要更多测试
² 因为它可能是 Java 移植到 C# 的,而且因为我编写了一些类(例如 bitset)的简单替换。

I ported the relevant Lucene Java code as suggested by Robert Muir to C#. As far as the question goes and "out of the box": it is a work in progress but the code appears¹ to work and can probably be optimized² further, although it performs very well indeed.

You can find it here: https://github.com/mjvh80/LevenshteinDFA/ .

UPDATE: It appears that Lucene.NET is not in fact dead (yet?) and I noticed they now have a ported version of this code too. I would thus recommend looking there (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/LevenshteinAutomata.cs) for an implementation of this.


¹ the code needs more tests
² because it's java ported to C# perhaps and because I wrote naive replacements of some classes (e.g. bitset).

烟若柳尘 2024-10-05 10:21:56

我知道您想在一本大字典中找到相近的匹配项。
这是我的做法。 链接。

据我所知就 DFA 而言,我看不出它有什么更好的地方,甚至实际上有什么不同。
NFA 可能更快,但那是因为它们不存在。
也许我错了。

I understand you want to find near matches in a big dictionary.
Here's the way I do it. link.

From what I'm able to figure out about DFA, I can't see how it's any better, or even actually any different, under the skin.
NFAs might be faster, but that's because they don't exist.
Maybe I'm wrong.

梦萦几度 2024-10-05 10:21:56

Nick Johnson 有一篇关于构建的非常详细的博客文章 Python 中的 Levenshtein 自动机的示例,代码位于此处。这是一本很好的读物,我使用了代码的稍微修改过的版本,我发现它很有效。

迈克邓拉维的回答也很好。我想知道在这种情况下什么是最有效的,trie 搜索还是 Levenshtein DFA?

Nick Johnson has a very detailed blog post about the construction of a Levenshtein automaton in Python, and the code is here. It is a good read, and I have used a slightly modified version of the code that I found efficient.

The answer of Mike Dunlavey is good too. I wonder what is the most efficient in this case, a trie search or a Levenshtein DFA?

白况 2024-10-05 10:21:56

我只想指出,到目前为止,Lucene 和 Lucene.Net 中的 Levenshtein Automaton 实现都使用包含使用 Moman

如果您想要一个能够在内存中从头开始构建此类表的解决方案,您可能需要看看 LevenshteinAutomaton。它是用 Java 编写的,但结构良好、易于理解并且有广泛的注释,因此应该比当前的 Lucene 实现更容易移植到 C#。它也由moi维护。

* 有趣的事实:我提交了 LevenshteinAutomaton 作为 Lucene 中当前 Levenshtein Automaton 实现的替代品或替代品的参考...3 年前

I'd just like to point out that as of now, the Levenshtein Automaton implementations in both Lucene and Lucene.Net make use of files containing parametric state tables (tables of abstract states which describe the concrete states in an automaton) created using Moman.

If you want a solution capable of constructing such tables from scratch in memory, you might want to have a look at LevenshteinAutomaton. It's in Java, but it is well-structured, easy to follow, and extensively commented, and as such should be easier to port to C# than the current Lucene implementation. It is also maintained by moi.

* Fun fact: I submitted LevenshteinAutomaton as a replacement, or as a reference for a replacement, to the current Levenshthein Automaton implementation in Lucene... 3 years ago.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文