快速计算大字符串之间距离的 C# 代码或算法?
您好,感谢您的浏览!
背景
我有一个包含 1900 个节点的 XML 文件,这些节点本身包含一串大约 3400 个字符的编码数据。
作为我正在开发的应用程序用例的一部分,我需要能够在运行时获取“基准”字符串,并从 XML 文件中找到最接近匹配。
请注意,XML 与应用程序没有密切关系,我可能会继续使用 SQL,但今天,我只需要一个简单的地方来存储数据并证明这个概念。
我正在使用 .NET 4.0、C#、表单应用程序、LINQ 等。
问题
如何找到最接近的匹配项?汉明?编辑?网上有很多代码示例,但大多数都是针对小字符串比较(“ant”与“aunt”)或精确匹配。我很少有完全匹配;我只需要最接近匹配。
提前致谢!
马特
Hi and thanks for looking!
Background
I have an XML file that contains 1900 nodes which themselves contain a string of encoded data of about 3400 characters.
As part of a use case for an application I am developing, I need to be able to take a "benchmark" string at runtime, and find the closest match from the XML file.
Please note that XML is not germane to the app and that I may go with SQL moving forward, but for today, I just needed an easy place to store the data and prove the concept.
I am using .NET 4.0, C#, forms app, LINQ, etc.
Question
How do I find the closest match? Hamming? Levenshtein? There are plenty of code samples online, but most are geared towards small string comparisons ("ant" vs. "aunt") or exact match. I will rarely have exact matches; I just need closest match.
Thanks in advance!
Matt
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您提到使用 Levenhstein 的编辑距离,并且您的字符串长度约为 3400 个字符。
我做了一个快速尝试,并使用 Levenhstein 的编辑距离的动态编程版本,它似乎相当快并且没有引起任何问题。
我这样做了:
它在 2006 年左右的 Core 2 Duo 上以 215 毫秒的时间计算出距离。
这对你有用吗?
(顺便说一句,我不确定是否可以粘贴我这里的 DP LED 实现的代码,因此您可能应该在 Internet 上搜索一个 Java 实现)
You mentioned using Levenhstein's Edit Distance and that your strings were about 3400 characters long.
I did a quick try and using the dynamic programming version of Levenhstein's Edit Distance it seems to be quite fast and cause no issue.
I did this:
And it's finding the distance in 215 ms on a Core 2 Duo from 2006 or so.
Would that work for you?
(btw I'm not sure I can paste the code for the DP LED implementation I've got here so you probably should search the Internet for one Java implementation)