更快的 C#(或其他 .NET)Levenshtein 距离实现
晚安,
我已经研究模糊字符串匹配有一段时间了,并且使用 C 和一些指针,我可以编写一个非常快速的(根据我的需要)实现两个字符串之间的 Levenshtein 距离。我尝试使用不安全代码和 fixed
关键字将代码移植到 C#,但性能要慢得多。因此,我选择构建一个 C++ dll 并使用 C# 中的 [DllImport]
,自动编组每个字符串。问题是,在分析之后,这仍然是我的程序中最耗时的部分,占用了程序总运行时间的 50-57%。因为我认为我需要对来自大约 300 万条数据库记录的文本字段的大量子字符串进行一些繁重的工作,所以我认为 Levenshtein 距离所花费的时间几乎是不可接受的。也就是说,我想知道您是否对下面的代码有任何算法或编程相关的建议,或者您是否知道有更好的算法来计算这个距离?
#define Inicio1 (*(BufferVar))
#define Inicio2 (*(BufferVar+1))
#define Fim1 (*(BufferVar+2))
#define Fim2 (*(BufferVar+3))
#define IndLinha (*(BufferVar+4))
#define IndCol (*(BufferVar+5))
#define CompLinha (*(BufferVar+6))
#define TamTmp (*(BufferVar+7))
int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar)
{
*(BufferVar) = *(BufferVar + 1) = 0;
*(BufferVar + 2) = TamTermo1 - 1;
*(BufferVar + 3) = TamTermo2 - 1;
while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2))
Inicio1 = ++Inicio2;
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2))
{ Fim1--; Fim2--;}
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
TamTermo1 = Fim1 - Inicio1 + 1;
TamTermo2 = Fim2 - Inicio2 + 1;
CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1;
for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++);
for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++);
for (IndCol = 1; IndCol <= TamTermo1; IndCol++)
for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++)
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
return *(BufferTab + CompLinha * TamTermo2 + TamTermo1);
}
请注意,BufferVar 和 BufferTab 是两个外部 int *
(在这种情况下,int[]
变量是从 C# 编组的),我不会在每个函数调用中实例化它们整个过程更快。尽管如此,这段代码对于我的需求来说还是相当慢。谁能给我一些建议,或者如果可能的话,提供一些更好的代码?
编辑:距离没有限制,我需要实际距离。
非常感谢,
Good night,
I have been working with fuzzy string matching for some time now, and using C with some pointers I could write a very fast (for my needs) implementation of the Levenshtein distance between two strings. I tried to port the code to C# using unsafe code and the fixed
keyword, but the performance was way slower. So I chose to build a C++ dll and use [DllImport]
from C#, automatically marshalling every string. The problem is that, after profiling, this keeps being the most time-consuming part of my program, taking between 50-57% of the total running time of the program. Since I think I will need to do some heavy work with lots of substrings of a text field coming from some 3-million database records, I think the time the Levenshtein distance is taking is almost unacceptable. That being, I would like to know if you have any suggestions, both algorithmic or programming-related, to the code below, or if you know of any better algorithm to calculate this distance?
#define Inicio1 (*(BufferVar))
#define Inicio2 (*(BufferVar+1))
#define Fim1 (*(BufferVar+2))
#define Fim2 (*(BufferVar+3))
#define IndLinha (*(BufferVar+4))
#define IndCol (*(BufferVar+5))
#define CompLinha (*(BufferVar+6))
#define TamTmp (*(BufferVar+7))
int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar)
{
*(BufferVar) = *(BufferVar + 1) = 0;
*(BufferVar + 2) = TamTermo1 - 1;
*(BufferVar + 3) = TamTermo2 - 1;
while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2))
Inicio1 = ++Inicio2;
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2))
{ Fim1--; Fim2--;}
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
TamTermo1 = Fim1 - Inicio1 + 1;
TamTermo2 = Fim2 - Inicio2 + 1;
CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1;
for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++);
for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++);
for (IndCol = 1; IndCol <= TamTermo1; IndCol++)
for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++)
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
return *(BufferTab + CompLinha * TamTermo2 + TamTermo1);
}
Please note that BufferVar and BufferTab are two external int *
(in th case, int[]
variables being marshalled from C#) which I do not instantiate in every function call to make the whole process faster. Still, this code is pretty slow for my needs. Can anyone please give me some suggestions, or, if possible, provide some better code?
Edit: The distance can't be bounded, I need the actual distance.
Thank you very much,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1.暴力破解
这是 Levenshtein Distance 在 Python 中的实现。
这是典型的动态规划算法,只需利用由于只需要最后一行,因此一次只保留两行就足够了。
对于 C++ 实现,您可以查看 LLVM 的一个(第 70-130 行),注意使用固定大小的堆栈分配数组,仅在必要时替换为动态分配数组。
我只是无法跟踪你的代码来尝试诊断它......所以让我们改变攻击角度。我们不会对距离进行微观优化,而是完全改变算法。
2.做得更好:使用字典
您面临的问题之一是您可以做得更好。
第一个注释是距离是对称的,尽管它不会改变整体复杂性,但会将所需时间减半。
第二个是,由于您实际上有一个已知单词的字典,因此您可以在此基础上构建:“actor”和“actual”共享一个共同的前缀(“act”),因此您无需重新计算第一阶段。
可以使用 Trie(或任何其他排序结构)来存储您的单词。接下来,您将采用一个单词,并利用前缀计算其相对于字典中存储的所有单词的距离。
举个例子,
dic = ["actor", "actual", "addict", "atchoum"]
,我们要计算word = "atchoum"
的距离(此时我们将其从字典中删除)matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
[[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5 , 6]]
[[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4 , 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
[[..], [..], [..], [..]]
[[..], [..], [..], [..], [..]]
这里重要的是倒带步骤,通过保留对前一个单词所做的计算(与该单词共享一个良好长度的前缀),您可以有效地节省大量工作。
请注意,这是通过简单的堆栈轻松实现的,不需要任何递归调用。
1. Brute Force
Here is an implementation of the Levenshtein Distance in Python.
It's the typical dynamic programming algorithm, just taking advantage that since one only need the last row, keeping only two rows at a time is sufficient.
For a C++ implementation, you might look at LLVM's one (line 70-130), note the use of a stack allocated array of fixed size, replaced only when necessary by a dynamically allocated array.
I just can't follow up your code to try and diagnose it... so let's change the angle of attack. Instead of micro-optimizing the distance, we'll change the algorithm altogether.
2. Doing better: using a Dictionary
One of the issue you face is that you could do much better.
The first remark is that the distance is symmetric, though it doesn't change the overall complexity it will halve the time necessary.
The second is that since you actually have a dictionary of known words, you can build on that: "actor" and "actual" share a common prefix ("act") and thus you need not recompute the first stages.
This can be exploited using a Trie (or any other sorted structure) to store your words. Next you will take one word, and compute its distance relatively to all of the words stored in the dictionary, taking advantage of the prefixes.
Let's take an example
dic = ["actor", "actual", "addict", "atchoum"]
and we want to compute the distance forword = "atchoum"
(we remove it from the dictionary at this point)matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
[[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6]]
[[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
[[..], [..], [..], [..]]
[[..], [..], [..], [..], [..]]
What's important here is the rewind step, by preserving the computation done for the previous word, with which you share a good-length prefix, you effectively save a lot of work.
Note that this is trivially implemented with a simple stack and does not require any recursive call.
首先尝试简单的方法 - 不要使用指针和不安全代码 - 只需编写普通的 C# 代码...但使用正确的算法。
Wikipedia 上有一个简单高效的算法,它使用动态规划并运行 O(n* m) 其中 n 和 m 是输入的长度。我建议您首先尝试实现该算法,正如那里所描述的那样,只有在实现它、测量性能并发现它不足之后才开始优化它。
另请参阅可能的改进部分,其中显示:
如果我不得不猜测问题出在哪里,我可能会首先查看在 two 循环内运行的这一行:
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ( (*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol) ) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
似乎有一个 那里有很多重复,尽管我很难准确地发现发生了什么。你能把其中一些因素排除掉吗?而且您肯定需要使其更具可读性。
Try the simple approach first - don't use pointers and unsafe code - just code plain ordinary C#... but use the correct algorithm.
There is a simple and efficient algorithm on Wikipedia that uses dynamic programming and runs O(n*m) where n and m are the lengths of the inputs. I suggest you try implementing that algorithm first, as it is described there and only start optimizing it after you've implemented it, measured the performance and found it to be insufficient.
See also the section Possible improvements where it says:
If I had to guess where the problem is I'd probably start by looking at this line that runs inside two loops:
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
There appears to be a lot of duplication there though it's hard for me to spot exactly what's going on. Could you factor some of that out? And you definitely need to make it more readable.
您不应该使用 Levenshtein 距离算法尝试所有可能的单词。您应该使用另一个更快的指标来过滤掉可能的候选者,然后才使用编辑来消除歧义。第一个筛选可以基于 n 元语法(三元语法通常效果很好)频率直方图或哈希函数。
You shouldn't try all your possible words with the Levenshtein distance algortihm. You should use another faster metric to filter out the likely candidates and only on then use the Levenshtein to remove ambiguity. The first sieve can be based on a n-gram (trigram works often well) frequency histogram or a hash function.