关于编辑距离的问题

发布于 2024-07-19 02:18:12 字数 573 浏览 7 评论 0原文

1)为什么我们要在这一行加1?

    d[i-1, j] + 1, // deletion 
    d[i, j-1] + 1, // insertion 

该行

if s[i] = t[j] then cost := 0

        else cost := 1 

应该考虑删除/较低的字长,或者我遗漏了什么?

2)此外,注释状态为删除和插入。 我是否正确地认为它正在检查两个单词中的已删除字符(整数 j/i 代表单词的长度),因为较低的值将代表已删除的字符。

使用的代码在这里(因为它是伪代码,并且我没有特定于语言的问题,该线程不属于任何语言类别):

http://www.iterasi.net/openviewer.aspx?sqrlitid=z0cloj7xhk-ce0f72v4cjq

1) Why do we add 1 on these line?

    d[i-1, j] + 1, // deletion 
    d[i, j-1] + 1, // insertion 

The line

if s[i] = t[j] then cost := 0

        else cost := 1 

should take into account deleted/lower word lengths, or am I missing something?

2) Also, the comments state deletion and insertion. Am I right in thinking that it's checking for deleted characters in both words (the integers j/i representing the length of words), because a lower value will represent deleted characters.

The code used is here (because it is pseudo code and I have no language specific issues, this thread is not in any language category):

http://www.iterasi.net/openviewer.aspx?sqrlitid=z0cloj7xhk-ce0f72v4cjq

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

各空 2024-07-26 02:18:13

您读过 http://www.merriampark.com/ld.htm 吗?

您正在计算将一个字符串转换为另一个字符串所需的转换成本(插入和删除的次数)。

这种变换的“成本”表示两个字符串之间的距离。

交换呢? 这是 Damerau–Levenshtein 算法,这是不同的。 包括交流并没有多大改善。

本质是在两个单词之间创建一个矩阵,并逐列计算每个单词的每个字母到另一个单词的每个字母的“距离”。 该矩阵的右下角是考虑到所有字母的总距离。

问题1)

“上面”的单元格反映了更改的历史记录,并且该行的字符(通常)与此不同,因此该单元格是相对于它的删除。

“左”单元格反映了更改的历史记录,并且该列的字符(通常)与此不同,因此该单元格是相对于它的插入。

唯一一次通常会出现错误的是具有三个字母序列的单词。 英语中很少见。

行列比较的成本为 0 或 1。

“历史加一次更改”与更改的实际成本中的最小值为适用成本。

问题 2)

变量 ij 不是任何长度。 它们是比较矩阵中的位置。 “插入”和“删除”是将一个单词转换为另一个单词所需的操作。 插入/删除操作的计数是单词之间的距离。

Have you read http://www.merriampark.com/ld.htm ?

You're computing the cost of transformation -- the number of inserts and deletes -- required to make one string into another.

This "cost" to transform is indicative of the distance between the two strings.

What about exchanges? That's the Damerau–Levenshtein algorithm, which is different. Including exchanges doesn't improve things much.

The essence is to create a matrix between the two words and compute -- column by column -- the "distance" from each letter of each word to each letter of the other word. The lower right hand corner of that matrix is the total distance, taking into account all of the letters.

Question 1)

The cell "above" reflects a history of changes, and the character for that row is (usually) different from this, so this cell is a deletion relative to it.

The cell "left" reflects a history of changes, and the character for that column is is (usually) different from this, so this cell is an insertion relative to it.

The only time this usually would be way wrong is words with a triple-letter sequence. Rare in English.

The row-column comparison has a cost of 0 or 1.

The minimum of "history plus one change" and the actual cost of a change is the applicable cost.

Question 2)

The variables i and j aren't lengths of anything. They're positions in the comparison matrix. The "insertion" and "deletion" is the action required to transform one word into the other. The count of insert/delete actions is the distance between the words.

五里雾 2024-07-26 02:18:13

1)这些行计算删除情况下的距离,插入情况下的距离,以及替换情况下使用“成本”的距离...

删除和插入在距离计算中实际上算作“1”,因此+1。

我们可以相信,仅当字符不同时才存在替换,因此如果两个字符相等,则“cost=0”...

新距离就是这 3 个假设之间的最小距离,因此您不需要并不总是添加 1 ...

2)如果我计算“FooBar”和“FoBaWhatever”之间的距离,即使第二个字符串比第一个字符串长,我也会删除一些字符...

当然,如果第二个字符串较短比第二个( FooBar -> FoBa )我会发现一些删除,但无法提前知道它们在哪里......

1) These lines compute the distance in the case of deletion, in the case of insertion, and the one using "cost" in case of a substitution...

deletion and insertion effectively count as "1" in the distance calculation, hence the +1.

We can believe there was a substitution only if the characters are different hence the "cost=0" if both chars are equal...

The new distance is then the minimum distance between these 3 hypothesis so you don't always add 1 ...

2) if I compute the distance between "FooBar" and "FoBaWhatever" I have some character deletions even if the second string is longer than the first one ...

Of course if the second string is shorter than the second ( FooBar -> FoBa ) I will find some deletions but cannot know in advance where they are...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文