在 Python 中编辑距离

发布于 2024-08-25 14:46:19 字数 361 浏览 11 评论 0原文

我正在用 Python 编写一个拼写检查程序。我有一个有效单词列表（字典），我需要从该字典中输出一个单词列表，这些单词与给定的无效单词的编辑距离为 2。

我知道我需要首先生成一个与无效单词的编辑距离为 1 的列表（然后在所有生成的单词上再次运行该列表）。我有三种方法，inserts(...)、deletions(...) 和changes(...)，它们应该输出编辑距离为1 的单词列表，其中inserts 输出所有有效单词，其中多一个字母对于给定的单词，删除输出所有少一个字母的有效单词，更改输出所有带有一个不同字母的有效单词。

我检查了很多地方，但似乎找不到描述此过程的算法。我提出的所有想法都涉及多次循环字典列表，这将非常耗时。如果有人能够提供一些见解，我将非常感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绿光 2024-09-01 14:46:20

您需要最小编辑距离来完成此任务。

以下是我的 MED 版本，又名 Levenshtein Distance。

def MED_character(str1,str2):
    cost=0
    len1=len(str1)
    len2=len(str2)

    #output the length of other string in case the length of any of the string is zero
    if len1==0:
        return len2
    if len2==0:
        return len1

    accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix

    # initializing the base cases
    for i in range(0,len1):
        accumulator[i][0] = i;
    for i in range(0,len2):
        accumulator[0][i] = i;

    # we take the accumulator and iterate through it row by row. 
    for i in range(1,len1):
        char1=str1[i]
        for j in range(1,len2):
            char2=str2[j]
            cost1=0
            if char1!=char2:
                cost1=2 #cost for substitution
            accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )

    cost=accumulator[len1-1][len2-1]
    return cost

You need Minimum Edit Distance for this task.

Following is my version of MED a.k.a Levenshtein Distance.

def MED_character(str1,str2):
    cost=0
    len1=len(str1)
    len2=len(str2)

    #output the length of other string in case the length of any of the string is zero
    if len1==0:
        return len2
    if len2==0:
        return len1

    accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix

    # initializing the base cases
    for i in range(0,len1):
        accumulator[i][0] = i;
    for i in range(0,len2):
        accumulator[0][i] = i;

    # we take the accumulator and iterate through it row by row. 
    for i in range(1,len1):
        char1=str1[i]
        for j in range(1,len2):
            char2=str2[j]
            cost1=0
            if char1!=char2:
                cost1=2 #cost for substitution
            accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )

    cost=accumulator[len1-1][len2-1]
    return cost

回复收藏 0 原文

苏璃陌 2024-09-01 14:46:20

根据@Santosh 的版本微调代码，并应解决@Artur Krajewski 提出的问题；最大的区别是替换了有效的二维矩阵


def edit_distance(s1, s2):
# add a blank character for both strings
    m=len(s1)+1
    n=len(s2)+1
# launch a matrix
    tbl = [[0] * n for i in range(m)] 
    for i in range(m): tbl[i][0]=i
    for j in range(n): tbl[0][j]=j

    for i in range(1, m):
        for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
            cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
            tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
    return tbl

edit_distance("birthday", "Birthdayyy")

Fine tuned codes based on the version from @Santosh and should address the issue brought up by @Artur Krajewski; The biggest difference is replacing an effective 2d matrix


def edit_distance(s1, s2):
# add a blank character for both strings
    m=len(s1)+1
    n=len(s2)+1
# launch a matrix
    tbl = [[0] * n for i in range(m)] 
    for i in range(m): tbl[i][0]=i
    for j in range(n): tbl[0][j]=j

    for i in range(1, m):
        for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
            cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
            tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
    return tbl

edit_distance("birthday", "Birthdayyy")

回复收藏 0 原文

甜味拾荒者 2024-09-01 14:46:20

跟进@krassowski的回答

from difflib import SequenceMatcher

def sequence_matcher_edits(word_a, word_b):
  required_edits = [code for code in (
      SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
    )
    if code[0] != 'equal'
  ]
  return len(required_edits)

print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

following up on @krassowski's answer

from difflib import SequenceMatcher

def sequence_matcher_edits(word_a, word_b):
  required_edits = [code for code in (
      SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
    )
    if code[0] != 'equal'
  ]
  return len(required_edits)

print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

回复收藏 0 原文

燕归巢 2024-09-01 14:46:19

您所看到的东西称为编辑距离，这里有一个 wiki 上的很好的解释。有很多方法可以定义两个单词之间的距离，并且您想要的距离称为编辑距离，这里是 python 中的 DP（动态编程）实现。

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

这里还有更多实现。

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

And a couple of more implementations are here.

回复收藏 0 原文

岁月静好 2024-09-01 14:46:19

标准库中的 difflib 有各种用于序列的实用程序匹配，包括您可以使用的 get_close_matches 方法。它使用改编自 Ratcliff 和 Obershelp 的算法。

来自文档

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.

From the docs

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

回复收藏 0 原文

美人骨 2024-09-01 14:46:19

我建议不要自己创建此类代码。有一些库可以做到这一点。

例如 Levenshtein 库。


In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3

In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6

In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3

In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3

In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11

In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

I would recommend not creating this kind of code on your own. There are libraries for that.

For instance the Levenshtein library.


In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3

In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6

In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3

In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3

In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11

In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

回复收藏 0 原文

柳若烟 2024-09-01 14:46:19

这是我的编辑距离版本

def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

print(edit_distance("Helloworld", "HalloWorld"))

Here is my version for Levenshtein distance

def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

print(edit_distance("Helloworld", "HalloWorld"))

回复收藏 0 原文

并安 2024-09-01 14:46:19

#this calculates edit distance not levenstein edit distance
word1="rice"

word2="ice"

len_1=len(word1)

len_2=len(word2)

x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance

for i in range(0,len_1+1): #initialization of base case values

    x[i][0]=i
for j in range(0,len_2+1):

    x[0][j]=j
for i in range (1,len_1+1):

    for j in range(1,len_2+1):

        if word1[i-1]==word2[j-1]:
            x[i][j] = x[i-1][j-1] 

        else :
            x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

print x[i][j]

#this calculates edit distance not levenstein edit distance
word1="rice"

word2="ice"

len_1=len(word1)

len_2=len(word2)

x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance

for i in range(0,len_1+1): #initialization of base case values

    x[i][0]=i
for j in range(0,len_2+1):

    x[0][j]=j
for i in range (1,len_1+1):

    for j in range(1,len_2+1):

        if word1[i-1]==word2[j-1]:
            x[i][j] = x[i-1][j-1] 

        else :
            x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

print x[i][j]

回复收藏 0 原文

和影子一齐双人舞 2024-09-01 14:46:19

使用 Python 构建的 SequenceMatcher -in difflib 是另一种方法，但是（正如注释中正确指出的那样），结果与编辑距离的定义不完全匹配。奖励：它支持忽略“垃圾”部分（例如空格或标点符号）。

from difflib import SequenceMatcher

a = 'kitten'
b = 'sitting'

required_edits = [
    code
    for code in (
        SequenceMatcher(a=a, b=b, autojunk=False)
        .get_opcodes()
    )
    if code[0] != 'equal'
]
required_edits
# [
#    # (tag, i1, i2, j1, j2)
#    ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
#    ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
#    ('insert', 6, 6, 6, 7),  # insert b[6:7]="g" after a[6:6]="n"
# ]


# the edit distance:
len(required_edits)  # == 3

Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).

from difflib import SequenceMatcher

a = 'kitten'
b = 'sitting'

required_edits = [
    code
    for code in (
        SequenceMatcher(a=a, b=b, autojunk=False)
        .get_opcodes()
    )
    if code[0] != 'equal'
]
required_edits
# [
#    # (tag, i1, i2, j1, j2)
#    ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
#    ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
#    ('insert', 6, 6, 6, 7),  # insert b[6:7]="g" after a[6:6]="n"
# ]


# the edit distance:
len(required_edits)  # == 3

回复收藏 0 原文

甜宝宝 2024-09-01 14:46:19

与上面 Santoshi 的解决方案类似，但我做了三处更改：

一行初始化而不是五行初始化
不需要单独定义成本（只需使用 int(boolean) 0 或 1）
而不是双 for 循环使用乘积，（最后一个只是装饰性的，双循环似乎是不可避免的）

from itertools import product

def edit_distance(s1,s2):      
   d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
   for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)): 
       d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
   return d[i,j]

Similar to Santoshi's solution above but I made three changes:

One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)

from itertools import product

def edit_distance(s1,s2):      
   d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
   for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)): 
       d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
   return d[i,j]

回复收藏 0 原文