使用单词列表计算编辑距离

发布于 2024-10-30 06:40:24 字数 1296 浏览 8 评论 0原文

首先我想说我是Python新手。我试图计算许多单词列表的编辑距离。到目前为止，我成功地为一对单词编写了代码，但在为列表编写代码时遇到了一些问题。我只有两个列表，单词一个在另一个下面，如下所示：卡洛斯斯蒂夫 Peter

我想使用 Levenshtein 距离来实现相似性方法。有人可以告诉我如何加载列表，然后使用函数来计算距离吗？

我将不胜感激！

这是我的代码，仅适用于两个字符串：

#!/usr/bin/env python
# -*- coding=utf-8 -*-

def lev_dist(source, target):
    if source == target:
        return 0

#words = open(test_file.txt,'r').read().split();

    # Prepare matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in xrange(slen+1):
        dist[i][0] = i
    for j in xrange(tlen+1):
        dist[0][j] = j

    # Counting distance
    for i in xrange(slen):
        for j in xrange(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]
    print lev_dist(source, target)

原文

first i want to say that i am a newbie in python. I trying to calculate the Levenshtein Distance for many lists of word. Until now i succeed writing the code for a pair of word, but i'm having some problems doing it for lists. I just habe two lists with words one below the other like this:
carlos
stiv
peter

I want to use the Levenshtein distance for a similarity approach. Could somebady tell me how i can load the lists and then use a function to calculate de distance?

I'll appreciated!

Here is my code just for two strings:

#!/usr/bin/env python
# -*- coding=utf-8 -*-

def lev_dist(source, target):
    if source == target:
        return 0

#words = open(test_file.txt,'r').read().split();

    # Prepare matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in xrange(slen+1):
        dist[i][0] = i
    for j in xrange(tlen+1):
        dist[0][j] = j

    # Counting distance
    for i in xrange(slen):
        for j in xrange(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]
    print lev_dist(source, target)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薔薇婲 2024-11-06 06:40:24

在朋友的帮助下，我终于得到了代码：）
您可以计算编辑距离并将其与第二个列表中的每个单词进行比较，更改脚本中的最后一行，即： print(list1[0], list2[i])，以将 list1 中的第一个单词与每个单词进行比较在列表2中。

谢谢

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import codecs

def lev_dist(source, target):
    if source == target:
        return 0


    # Prepare a matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in range(slen+1):
        dist[i][0] = i
    for j in range(tlen+1):
        dist[0][j] = j

    # Counting distance, here is my function
    for i in range(slen):
        for j in range(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

# load words from a file into a list
def loadWords(file):
    list = [] # create an empty list to hold the file contents
    file_contents = codecs.open(file, "r", "utf-8") # open the file
    for line in file_contents: # loop over the lines in the file
        line = line.strip() # strip the line breaks and any extra spaces
        list.append(line) # append the word to the list
    return list

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]

    # create two lists, one of each file by calling the loadWords() function on the file
    list1 = loadWords(source)
    list2 = loadWords(target)

    # now you have two lists; each file has to have the words you are comparing on the same lines
    # now call you lev_distance function on each pair from those lists

    for i in range(0, len(list1)): # so now you are looping over a range of numbers, not lines
        print lev_dist(list1[0], list2[i])


#    print lev_dist(source, target)

I finally got the code working with some help from a friend :)
You can compute the Levenshtein distance and compare it to every word from the second list changing the last line in the script, i.e: print(list1[0], list2[i]), to compare the first word from the list1 to every word in list2.

Thanks

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import codecs

def lev_dist(source, target):
    if source == target:
        return 0


    # Prepare a matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in range(slen+1):
        dist[i][0] = i
    for j in range(tlen+1):
        dist[0][j] = j

    # Counting distance, here is my function
    for i in range(slen):
        for j in range(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

# load words from a file into a list
def loadWords(file):
    list = [] # create an empty list to hold the file contents
    file_contents = codecs.open(file, "r", "utf-8") # open the file
    for line in file_contents: # loop over the lines in the file
        line = line.strip() # strip the line breaks and any extra spaces
        list.append(line) # append the word to the list
    return list

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]

    # create two lists, one of each file by calling the loadWords() function on the file
    list1 = loadWords(source)
    list2 = loadWords(target)

    # now you have two lists; each file has to have the words you are comparing on the same lines
    # now call you lev_distance function on each pair from those lists

    for i in range(0, len(list1)): # so now you are looping over a range of numbers, not lines
        print lev_dist(list1[0], list2[i])


#    print lev_dist(source, target)

回复收藏 0 原文