快速计算大字符串之间距离的 C# 代码或算法?

发布于 2024-12-25 17:10:39 字数 412 浏览 1 评论 0原文

您好,感谢您的浏览!

背景

我有一个包含 1900 个节点的 XML 文件,这些节点本身包含一串大约 3400 个字符的编码数据。

作为我正在开发的应用程序用例的一部分,我需要能够在运行时获取“基准”字符串,并从 XML 文件中找到最接近匹配。

请注意,XML 与应用程序没有密切关系,我可能会继续使用 SQL,但今天,我只需要一个简单的地方来存储数据并证明这个概念。

我正在使用 .NET 4.0、C#、表单应用程序、LINQ 等。

问题

如何找到最接近的匹配项?汉明?编辑?网上有很多代码示例,但大多数都是针对小字符串比较(“ant”与“aunt”)或精确匹配。我很少有完全匹配;我只需要最接近匹配。

提前致谢!

马特

Hi and thanks for looking!

Background

I have an XML file that contains 1900 nodes which themselves contain a string of encoded data of about 3400 characters.

As part of a use case for an application I am developing, I need to be able to take a "benchmark" string at runtime, and find the closest match from the XML file.

Please note that XML is not germane to the app and that I may go with SQL moving forward, but for today, I just needed an easy place to store the data and prove the concept.

I am using .NET 4.0, C#, forms app, LINQ, etc.

Question

How do I find the closest match? Hamming? Levenshtein? There are plenty of code samples online, but most are geared towards small string comparisons ("ant" vs. "aunt") or exact match. I will rarely have exact matches; I just need closest match.

Thanks in advance!

Matt

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

淤浪 2025-01-01 17:10:39

您提到使用 Levenhstein 的编辑距离,并且您的字符串长度约为 3400 个字符。

我做了一个快速尝试,并使用 Levenhstein 的编辑距离的动态编程版本,它似乎相当快并且没有引起任何问题。

我这样做了:

        final StringBuilder sb1 = new StringBuilder();
        final StringBuilder sb2 = new StringBuilder();
        final Random r = new Random(42);
        final int n = 3400;
        for (int i = 0; i < n; i++) {
            sb1.append( (char) ('a' + r.nextInt(26)) );
            sb2.append( (char) ('a' + r.nextInt(26)) );
        }
        final long t0 = System.currentTimeMillis();
        System.out.println("LED: " + getLevenshteinDistance(sb1.toString(), sb2.toString()) );
        final long te = System.currentTimeMillis() - t0;
        System.out.println("Took: " + te + " ms");

它在 2006 年左右的 Core 2 Duo 上以 215 毫秒的时间计算出距离。

这对你有用吗?

(顺便说一句,我不确定是否可以粘贴我这里的 DP LED 实现的代码,因此您可能应该在 Internet 上搜索一个 Java 实现)

You mentioned using Levenhstein's Edit Distance and that your strings were about 3400 characters long.

I did a quick try and using the dynamic programming version of Levenhstein's Edit Distance it seems to be quite fast and cause no issue.

I did this:

        final StringBuilder sb1 = new StringBuilder();
        final StringBuilder sb2 = new StringBuilder();
        final Random r = new Random(42);
        final int n = 3400;
        for (int i = 0; i < n; i++) {
            sb1.append( (char) ('a' + r.nextInt(26)) );
            sb2.append( (char) ('a' + r.nextInt(26)) );
        }
        final long t0 = System.currentTimeMillis();
        System.out.println("LED: " + getLevenshteinDistance(sb1.toString(), sb2.toString()) );
        final long te = System.currentTimeMillis() - t0;
        System.out.println("Took: " + te + " ms");

And it's finding the distance in 215 ms on a Core 2 Duo from 2006 or so.

Would that work for you?

(btw I'm not sure I can paste the code for the DP LED implementation I've got here so you probably should search the Internet for one Java implementation)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文