如何优化此 Python 代码以生成单词距离为 1 的所有单词?

发布于 2024-07-18 16:08:06 字数 9425 浏览 6 评论 0原文

分析显示,这是我编写的一个小文字游戏的代码中最慢的部分:

def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

注意:

  • distance() 被调用超过 500 万次,其中大部分来自 getchildren,它应该获取所有内容单词列表中与 word 恰好有 1 个字母不同的单词。
  • wordlist 经过预过滤,只包含与 word 包含相同数量字母的单词,因此保证 word1word2 具有相同数量的单词字符。
  • 我对 Python 相当陌生(三天前开始学习),所以对命名约定或其他风格事物的评论也很感激。
  • 对于单词列表,请使用“2+2lemma.txt”获取12dict 单词列表文件

结果:

谢谢大家,结合不同的建议,我现在让程序运行速度提高了一倍(除了我在询问之前自己做的优化之外,所以速度比我最初的实现大约提高了 4 倍)

我用 2 进行了测试我将其称为 A 和 B 的输入集

优化 1:迭代 word1,2 的索引 ... 从

for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

使用 zip(word1, word2) 迭代字母对

for x,y in zip (word1, word2):
        if x != y:
            difference += 1
    return difference

输入 A 的执行时间从 11.92 到 9.18,输入 B 的执行时间从 79.30 到 74.59

优化2: 除了距离方法(我在其他地方仍然需要 A* 启发式方法)之外,还添加了一种单独的差一方法。

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in zip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

输入 A 的执行时间从 9.18 到 8.83,输入 B 的执行时间从 74.59 到 70.14

优化3: 这里的大赢家是使用 izip 而不是 zip

输入 A 的执行时间从 8.83 到 5.02,输入 B 的执行时间从 70.14 到 41.69

我可能可以做得更好一种较低级别的语言,但我现在对此感到满意。 感谢大家!

再次编辑:更多结果 使用马克的方法检查第一个字母不匹配的情况,从 5.02 -> 得到它。 3.59 和 41.69 --> 29.82

合并 izip 而不是 range,我最终得到了这样的结果:

def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different

在此基础上, 3.59-> 3.38 和 29.82 --> 27.88

更多结果!

尝试 Sumudu 的建议,即我生成与“word”相差 1 个字母的所有字符串的列表,然后检查单词列表中包含哪些字符串,而不是 is_neighbor 函数,我最终得到以下结果:

def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return ( w for w in oneoff if w in wordlist )

最终速度较慢(3.38 -> 3.74 和 27.88 -> 34.40),但看起来很有希望。 起初,我认为我需要优化的部分是“one_letter_off_strings”,但分析显示并非如此,而且缓慢的部分实际上是

( w for w in oneoff if w in wordlist )

我认为如果我切换“oneoff”和“wordlist”并执行以下操作是否会有任何区别当我意识到我正在寻找两个列表的交集时,以另一种方式进行比较。 我将其替换为字母上的交集

return set(oneoff) & set(wordlist)

Bam! 3.74-> 0.23和34.40→ 2.25

这确实是令人惊奇的,与我最初的天真的实现的总速度差异: 23.79 -> 23.79 0.23和180.07→ 2.25,因此比原始实现快大约 80 到 100 倍。

如果有人感兴趣,我发表了博客文章 描述程序描述所做的优化包括此处未提及的优化(因为它位于不同的代码部分)。

大辩论:

好吧,我和 Unknown 正在进行一场大辩论,你可以在 他的答案。 他声称,如果将其移植到 C,使用原始方法(使用 is_neighbor 而不是使用集合)会更快。我尝试了 2 个小时来获取我编写的用于构建和可链接的 C 模块,但在尝试后没有取得太大成功遵循这个示例,看起来 Windows 中的过程有点不同? 我不知道,但我已经放弃了。 无论如何,这是程序的完整代码,文本文件来自 12dict 单词列表 使用“2+2lemma.txt”文件。 抱歉,如果代码有点混乱,这只是我拼凑的东西。 另外,我忘记从单词列表中删除逗号,因此这实际上是一个错误,您可以为了相同的比较而保留它,或者通过在 cleanentries 的字符列表中添加逗号来修复它。

from itertools import izip
def unique(seq):  
    seen = {} 
    result = [] 
    for item in seq: 
        if item in seen:
            continue 
        seen[item] = 1 
        result.append(item) 
    return result
def cleanentries(li):
    pass
    return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
    difference = 0
    for x,y in izip (word1, word2):
        if x != y:
            difference += 1
    return difference
def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different
def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
    import Queue
    closedset = []
    openset = [start]
    pqueue = Queue.PriorityQueue(0)
    g_score = {start:0}         #Distance from start along optimal path.
    h_score = {start:distance(start, goal)}
    f_score = {start:h_score[start]}
    pqueue.put((f_score[start], start))
    parent_dict = {}
    while len(openset) > 0:
        x = pqueue.get(False)[1]
        if x == goal:
            return reconstruct_path(parent_dict,goal)
        openset.remove(x)
        closedset.append(x)
        sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
        sortedOpen.sort()
        for y in getchildren(x, wordlist):
            if y in closedset:
                continue
            temp_g_score = g_score[x] + 1
            temp_is_better = False
            appended = False
            if (not y in openset): 
                openset.append(y)
                appended = True
                h_score[y] = distance(y, goal)
                temp_is_better = True
            elif temp_g_score < g_score[y] :
                temp_is_better = True
            else :
                pass
            if temp_is_better:
                parent_dict[y] = x
                g_score[y] = temp_g_score
                f_score[y] = g_score[y] + h_score[y]
                if appended :
                    pqueue.put((f_score[y], y))
    return None


def reconstruct_path(parent_dict,node):
     if node in parent_dict.keys():
         p = reconstruct_path(parent_dict,parent_dict[node])
         p.append(node)
         return p
     else:
         return []        

wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
    userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
    words = [w.lower() for w in userentry.split()]
    if(len(words) == 2 and len(words[0]) == len(words[1])):
        break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
    print "Minimum number of steps is %s" % (len(answer))
    reply = raw_input("Would you like the answer(y/n)? ")
    if(reply.lower() == "y"):
        answer.insert(0, words[0])
        print "\n".join(answer)
    else:
        print "Good luck!"
else:
    print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")

我保留了 is_neighbors 方法,尽管它没有被使用。 这是建议移植到 C 的方法。要使用它,只需将 getchildren 替换为:

def getchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w))

至于让它作为 C 模块工作,我还没有做到这一点,但这就是我想出的:

#include "Python.h"

static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
    int length;
    const char *word1, *word2;
    if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
        return NULL;

    int i;
    int different = 0;
    for (i =0; i < length; i++)
    {
        if (*(word1 + i) != *(word2 + i))
        {
            if (different)
            {
                return Py_BuildValue("i", different);
            }
            different = 1;
        }
    }
    return Py_BuildValue("i", different);
}

PyMethodDef methods[] = {
    {"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initIsNeighbor(void)
{
    Py_InitModule("isneighbor", methods);
}

我使用以下方法对此进行了分析:

python -m cProfile“Wordgame.py”

并且记录的时间是AStar方法调用的总时间。 快速输入集是“诗句诗人”,长输入集是“诗人诗句”。 不同机器之间的时序显然会有所不同,因此如果有人最终尝试这样做,请按原样给出程序以及与 C 模块的结果比较。

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

Notes:

  • distance() is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from word by exactly 1 letter.
  • wordlist is pre-filtered to only have words containing the same number of letters as word so it's guaranteed that word1 and word2 have the same number of chars.
  • I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.
  • for wordlist, take the 12dict word list using the "2+2lemma.txt" file

Results:

Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)

I tested with 2 sets of inputs which I'll call A and B

Optimization1: iterate over indices of word1,2 ... from

for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

to iterate on letter-pairs using zip(word1, word2)

for x,y in zip (word1, word2):
        if x != y:
            difference += 1
    return difference

Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B

Optimization2:
Added a separate method for differs-by-one in addition to the distance-method (which I still needed elsewhere for the A* heuristics)

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in zip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B

Optimization3:
Big winner here was to use izip instead of zip

Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B

I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!

Edit again: More results
Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82

Building on that and incorporating izip instead of range, I ended up with this:

def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different

Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88

Even more results!

Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:

def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return ( w for w in oneoff if w in wordlist )

Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact

( w for w in oneoff if w in wordlist )

I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with set-intersection on the letters:

return set(oneoff) & set(wordlist)

Bam! 3.74 -> 0.23 and 34.40 -> 2.25

This is truely amazing, total speed difference from my original naive implementation:
23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.

If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).

The Great Debate:

Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.

from itertools import izip
def unique(seq):  
    seen = {} 
    result = [] 
    for item in seq: 
        if item in seen:
            continue 
        seen[item] = 1 
        result.append(item) 
    return result
def cleanentries(li):
    pass
    return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
    difference = 0
    for x,y in izip (word1, word2):
        if x != y:
            difference += 1
    return difference
def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different
def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
    import Queue
    closedset = []
    openset = [start]
    pqueue = Queue.PriorityQueue(0)
    g_score = {start:0}         #Distance from start along optimal path.
    h_score = {start:distance(start, goal)}
    f_score = {start:h_score[start]}
    pqueue.put((f_score[start], start))
    parent_dict = {}
    while len(openset) > 0:
        x = pqueue.get(False)[1]
        if x == goal:
            return reconstruct_path(parent_dict,goal)
        openset.remove(x)
        closedset.append(x)
        sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
        sortedOpen.sort()
        for y in getchildren(x, wordlist):
            if y in closedset:
                continue
            temp_g_score = g_score[x] + 1
            temp_is_better = False
            appended = False
            if (not y in openset): 
                openset.append(y)
                appended = True
                h_score[y] = distance(y, goal)
                temp_is_better = True
            elif temp_g_score < g_score[y] :
                temp_is_better = True
            else :
                pass
            if temp_is_better:
                parent_dict[y] = x
                g_score[y] = temp_g_score
                f_score[y] = g_score[y] + h_score[y]
                if appended :
                    pqueue.put((f_score[y], y))
    return None


def reconstruct_path(parent_dict,node):
     if node in parent_dict.keys():
         p = reconstruct_path(parent_dict,parent_dict[node])
         p.append(node)
         return p
     else:
         return []        

wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
    userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
    words = [w.lower() for w in userentry.split()]
    if(len(words) == 2 and len(words[0]) == len(words[1])):
        break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
    print "Minimum number of steps is %s" % (len(answer))
    reply = raw_input("Would you like the answer(y/n)? ")
    if(reply.lower() == "y"):
        answer.insert(0, words[0])
        print "\n".join(answer)
    else:
        print "Good luck!"
else:
    print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")

I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:

def getchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w))

As for getting it to work as a C module I didn't get that far, but this is what I came up with:

#include "Python.h"

static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
    int length;
    const char *word1, *word2;
    if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
        return NULL;

    int i;
    int different = 0;
    for (i =0; i < length; i++)
    {
        if (*(word1 + i) != *(word2 + i))
        {
            if (different)
            {
                return Py_BuildValue("i", different);
            }
            different = 1;
        }
    }
    return Py_BuildValue("i", different);
}

PyMethodDef methods[] = {
    {"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initIsNeighbor(void)
{
    Py_InitModule("isneighbor", methods);
}

I profiled this using:

python -m cProfile "Wordgame.py"

And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

捶死心动 2024-07-25 16:08:06

如果您的单词列表很长,那么从“单词”生成所有可能的 1 字母差异,然后检查列表中的哪些单词是否会更有效? 我不懂Python,但应该有一个适合单词列表​​的数据结构,允许日志时间查找。

我建议这样做是因为,如果您的单词长度合理(~10 个字母),那么您只需寻找 250 个潜在单词,如果您的单词列表大于几百个单词,这可能会更快。

If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.

一城柳絮吹成雪 2024-07-25 16:08:06

你的函数 distance 正在计算总距离,而你实际上只关心 distance=1。 大多数情况下,您会知道它在几个字符内 >1,因此您可以提前返回并节省大量时间。

除此之外,可能还有更好的算法,但我想不到。

编辑:另一个想法。

您可以制作 2 种情况,具体取决于第一个字符是否匹配。 如果不匹配,则单词的其余部分必须完全匹配,您可以一次性测试这一点。 否则,请按照与您之前所做的类似的方式进行操作。 您甚至可以递归地执行此操作,但我认为这不会更快。

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    same = True
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if same:
                same = False
            else:
                return False
    return not same

编辑2:我删除了检查字符串是否相同长度的检查,因为你说它是多余的。 在我自己的代码和is_neighbors 函数 由 MizardX 提供,我得到以下内容:

  • 原始距离():3.7秒
  • 我的DifferentByOne():1.1秒
  • MizardX的is_neighbors():3.7秒

编辑3:(可能在这里进入社区维基领域,但是......)

尝试你的最终定义is_neighbors() 与 izip 而不是 zip:2.9 秒。

这是我的最新版本,时间仍然为 1.1 秒:

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if different:
                return False
            different = True
    return different

Your function distance is calculating the total distance, when you really only care about distance=1. The majority of cases you'll know it's >1 within a few characters, so you could return early and save a lot of time.

Beyond that, there might be a better algorithm, but I can't think of it.

Edit: Another idea.

You can make 2 cases, depending on whether the first character matches. If it doesn't match, the rest of the word has to match exactly, and you can test for that in one shot. Otherwise, do it similarly to what you were doing. You could even do it recursively, but I don't think that would be faster.

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    same = True
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if same:
                same = False
            else:
                return False
    return not same

Edit 2: I've deleted the check to see if the strings are the same length, since you say it's redundant. Running Ryan's tests on my own code and on the is_neighbors function provided by MizardX, I get the following:

  • Original distance(): 3.7 seconds
  • My DifferentByOne(): 1.1 seconds
  • MizardX's is_neighbors(): 3.7 seconds

Edit 3: (Probably getting into community wiki territory here, but...)

Trying your final definition of is_neighbors() with izip instead of zip: 2.9 seconds.

Here's my latest version, which still times at 1.1 seconds:

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if different:
                return False
            different = True
    return different
泪之魂 2024-07-25 16:08:06
from itertools import izip

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in izip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

或者可能内联 izip 代码:

def is_neighbors(word1,word2):
    different = False
    next1 = iter(word1).next
    next2 = iter(word2).next
    try:
        while 1:
            if next1() != next2():
                if different:
                    return False
                different = True
    except StopIteration:
        pass
    return different

以及重写的 getchildren :

def iterchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w) )
from itertools import izip

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in izip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Or maybe in-lining the izip code:

def is_neighbors(word1,word2):
    different = False
    next1 = iter(word1).next
    next2 = iter(word2).next
    try:
        while 1:
            if next1() != next2():
                if different:
                    return False
                different = True
    except StopIteration:
        pass
    return different

And a rewritten getchildren:

def iterchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w) )
  • izip(a,b) returns an iterator over pairs of values from a and b.
  • zip(a,b) returns a list of pairs from a and b.
最佳男配角 2024-07-25 16:08:06

人们主要通过尝试编写更快的函数来解决这个问题,但可能还有另一种方法。

“距离”被调用超过 500 万次

这是为什么? 也许更好的优化方法是尝试减少对 distance 的调用次数,而不是减少 distance 执行时间的毫秒数。 如果不看完整的脚本就无法判断,但优化特定功能通常是不必要的。

如果这是不可能的,也许你可以把它写成一个 C 模块?

People are mainly going about this by trying to write a quicker function, but there might be another way..

"distance" is called over 5 million times

Why is this? Perhaps a better way to optimise is to try and reduce the number of calls to distance, rather than shaving milliseconds of distance's execution time. It's impossible to tell without seeing the full script, but optimising a specific function is generally unnecessary.

If that is impossible, perhaps you could write it as a C module?

长不大的小祸害 2024-07-25 16:08:06

使用相同参数调用距离函数的频率是多少? 实现优化的一个简单方法是使用记忆

您可能还可以创建某种字典,其中包含冻结的字母集和互不相同的单词列表,并在其中查找值。 该数据结构可以通过 pickle 存储和加载,也可以在启动时从头开始生成。

仅当您使用的单词非常长时,短路评估才会给您带来收益,因为您使用的汉明距离算法基本上是 O(n),其中 n 是单词长度。

我用 timeit 做了一些实验,寻找一些可能具有说明性的替代方法。

Timeit 结果

您的解决方案

d = """\
def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391

One Liner

d = """\
from itertools import izip
def hamdist(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179

Shortcut 评估

d = """\
def distance_is_one(word1, word2):
    diff = 0
    for i in xrange(len(word1)):
        if word1[i] != word2[i]:
            diff += 1
        if diff > 1:
            return False
    return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337

How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization.

You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.

Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length.

I did some experiments with timeit for some alternative approaches that may be illustrative.

Timeit Results

Your Solution

d = """\
def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391

One Liner

d = """\
from itertools import izip
def hamdist(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179

Shortcut Evaluation

d = """\
def distance_is_one(word1, word2):
    diff = 0
    for i in xrange(len(word1)):
        if word1[i] != word2[i]:
            diff += 1
        if diff > 1:
            return False
    return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337
似最初 2024-07-25 16:08:06

如果差值是 2 或更多,你可以先让循环中断。

您也可以更改

for i in range(len(word1)):

for i in xrange(len(word1)):

因为 xrange 按需生成序列,而不是一次生成整个数字范围。

您也可以尝试比较字长,这样会更快。 另请注意,如果 word1 大于 word2,您的代码将无法工作。

此后您在算法上无法做太多其他事情,也就是说,通过将该部分移植到 C,您可能会发现更多的加速。

编辑2

尝试解释我对 Sumudu 算法与逐个字符验证差异的分析。

当你有一个长度为 L 的单词时,你将生成的“相差一”单词的数量将为 25L。 从现代计算机上集合的实现中我们知道,搜索速度大约为 log(n) base 2,其中 n 是要搜索的元素数量。

看到你测试的 500 万个单词中的大多数都不在集合中,大多数时候,你将遍历整个集合,这意味着它真的变成了 log(25L) 而不是仅 log(25L)/2。 (这是假设集合的最佳情况,逐个字符串比较相当于逐个字符比较)

现在我们看一下确定“逐个差异”的时间复杂度。 如果我们假设你必须检查整个单词,那么每个单词的操作次数就变成L。 我们知道大多数单词很快就会相差 2。 并且知道大多数前缀占据单词的一小部分,我们可以逻辑地假设您大部分时间都会中断 L/2,即单词的一半(这是一个保守的估计) 。

因此,现在我们绘制两个搜索的时间复杂度,L/2 和 log(25L),并记住这甚至考虑了字符串匹配与字符匹配相同的速度(高度支持套)。 你有方程 log(25*L) > L/2,可以简化为log(25)> L/2 - log(L)。 从图中可以看出,使用字符匹配算法应该会更快,直到达到非常大量的 L。

此外,我不知道您是否在优化中计算出 2 或更多差异的中断,但从 Mark 的回答来看,我已经在 2 或更多差异上中断,实际上,如果第一个差异字母,它在第一个字母后就中断了,即使进行了所有这些优化,更改为使用集合也只会让它们出局。 不过我有兴趣尝试你的想法

我是这个问题中第一个建议打破 2 或更多差异的人。 问题是,Mark 的字符串切片思想 (if word1[0] != word2[0]: return word1[1:] == word2[1:]) 只是将我们正在做的事情放入 C 中。你认为 word1[1:] == word2[1:] 是计算出来的吗? 我们正在做同样的事情。

我读了几次你的解释,但我不太明白,你介意解释得更深入一点吗? 另外,我对 C 也不是很熟悉,过去几年我一直在使用高级语言(最接近的是 6 年前在高中学习 C++

至于生成 C 代码,我有点忙。我相信您能够做到这一点,因为您以前用 C 编写过,您也可以尝试 C#,它可能具有类似的性能特征

更多说明

的更深入的解释。

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)

这是对 Davy8 Your one_letter_off_strings 函数将创建一组 25L 字符串(其中 L 是字母数)。

从单词列表创建一组将创建一组 D 字符串(其中 D 是字典的长度)。 必须遍历每个oneoff并查看它是否存在于单词列表中。

此操作的时间复杂度如上所述。此操作的效率较低。将你想要的单词单词列表中的每个单词进行比较,Sumudu的方法是C语言而不是算法的优化。

更多说明2

总共只有 4500 个单词(因为单词列表在传递给算法之前已预先过滤了 5 个字母的单词),与 125 个单字母单词相交。 您似乎是在说交集是 log(smaller) 或换句话说 log(125, 2)。 再次假设你所说的,比较一个单词在 L/2 个字母中的断裂,我会将其四舍五入为 2,即使对于 5 个字母的单词,它更有可能是 3。这种比较进行了 4500 次,所以 9000。log(125,2) 约为 6.9,log(4500,2) 约为 12。请告诉我我是否误解了您的数字。

要创建 125 个单字母单词与 4500 个字典的交集,您需要进行 125 * 4500 次比较。 这不是 log(125,2)。 假设字典已预先排序,则最多为 125 * log(4500, 2)。 集合没有神奇的捷径。您在这里还进行了字符串与字符串的比较,而不是字符与字符的比较。

Well you can start by having your loop break if the difference is 2 or more.

Also you can change

for i in range(len(word1)):

to

for i in xrange(len(word1)):

Because xrange generates sequences on demand instead of generating the whole range of numbers at once.

You can also try comparing word lengths which would be quicker. Also note that your code doesn't work if word1 is greater than word2

There's not much else you can do algorithmically after that, which is to say you'll probably find more of a speedup by porting that section to C.

Edit 2

Attempting to explain my analysis of Sumudu's algorithm compared to verifying differences char by char.

When you have a word of length L, the number of "differs-by-one" words you will generate will be 25L. We know from implementations of sets on modern computers, that the search speed is approximately log(n) base 2, where n is the number of elements to search for.

Seeing that most of the 5 million words you test against is not in the set, most of the time, you will be traversing the entire set, which means that it really becomes log(25L) instead of only log(25L)/2. (and this is assuming best case scenario for sets that comparing string by string is equivalent to comparing char by char)

Now we take a look at the time complexity for determining a "differs-by-one". If we assume that you have to check the entire word, then the number of operations per word becomes L. We know that most words differ by 2 very quickly. And knowing that most prefixes take up a small portion of the word, we can logically assume that you will break most of the time by L/2, or half the word (and this is a conservative estimate).

So now we plot the time complexities of the two searches, L/2 and log(25L), and keeping in mind that this is even considering string matching the same speed as char matching (highly in favor of sets). You have the equation log(25*L) > L/2, which can be simplified down to log(25) > L/2 - log(L). As you can see from the graph, it should be quicker to use the char matching algorithm until you reach very large numbers of L.

alt text

Also, I don't know if you're counting breaking on difference of 2 or more in your optimization, but from Mark's answer I already break on a difference of 2 or more, and actually, if the difference in the first letter, it breaks after the first letter, and even in spite of all those optimizations, changing to using sets just blew them out of the water. I'm interested in trying your idea though

I was the first person in this question to suggest breaking on a difference of 2 or more. The thing is, that Mark's idea of string slicing (if word1[0] != word2[0]: return word1[1:] == word2[1:]) is simply putting what we are doing into C. How do you think word1[1:] == word2[1:] is calculated? The same way that we are doing.

I read your explanation a few times but I didn't quite follow it, would you mind explaining it a little more indepth? Also I'm not terribly familiar with C and I've been working in high-level languages for the past few years (closest has been learning C++ in high school 6 years ago

As for producing the C code, I am a bit busy. I am sure you will be able to do it since you have written in C before. You could also try C#, which probably has similar performance characteristics.

More Explanation

Here is a more indepth explanation to Davy8

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)

Your one_letter_off_strings function will create a set of 25L strings(where L is the number of letters).

Creating a set from the wordlist will create a set of D strings (where D is the length of your dictionary). By creating an intersection from this, you MUST iterate over each oneoff and see if it exists in wordlist.

The time complexity for this operation is detailed above. This operation is less efficient than comparing the word you want with each word in wordlist. Sumudu's method is an optimization in C rather than in algorithm.

More Explanation 2

There's only 4500 total words (because the wordlist is pre-filtered for 5 letter words before even being passed to the algorithm), being intersected with 125 one-letter-off words. It seemed that you were saying intersection is log(smaller) or in otherwords log(125, 2). Compare this to again assuming what you said, where comparing a word breaks in L/2 letters, I'll round this down to 2, even though for a 5 letter word it's more likely to be 3. This comparison is done 4500 times, so 9000. log(125,2) is about 6.9, and log(4500,2) is about 12. Lemme know if I misinterpreted your numbers.

To create the intersection of 125 one-letter-off words with a dictionary of 4500, you need to make 125 * 4500 comparisons. This is not log(125,2). It is at best 125 * log(4500, 2) assuming that the dictionary is presorted. There is no magic shortcut to sets. You are also doing a string by string instead of char by char comparison here.

人间☆小暴躁 2024-07-25 16:08:06

对于这样一个具有如此大性能影响的简单函数,我可能会创建一个 C 库并使用 ctypes。 Reddit 的一位创始人声称,他们使用这种技术使网站速度提高了 2 倍。

您还可以在此函数上使用psyco,但要注意它会占用大量内存。

For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes. One of reddit's founders claims they made the website 2x as fast using this technique.

You can also use psyco on this function, but beware that it can eat up a lot of memory.

风启觞 2024-07-25 16:08:06

我不知道它是否会显着影响您的速度,但您可以从将列表理解转换为生成器表达式开始。 它仍然是可迭代的,所以在用法上应该没有太大不同:

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

to

def getchildren(word, wordlist):
    return ( w for w in wordlist if distance(word, w) == 1 )

主要问题是列表理解会在内存中构建自身并占用相当多的空间,而生成器将动态创建列表,因此无需存储整个内容。

另外,根据未知的答案,这可能是一种更“Pythonic”的编写 distance() 的方式:

def distance(word1, word2):
    difference = 0
    for x,y in zip (word1, word2):
        if x == y:
            difference += 1
    return difference

但是当 len (word1) != len (word2) 时,它的意图令人困惑,在 zip 的情况下,它只会返回尽可能多的值字符作为最短的单词。 (这可能是一种优化......)

I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression. It's still iterable so it shouldn't be much different in usage:

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

to

def getchildren(word, wordlist):
    return ( w for w in wordlist if distance(word, w) == 1 )

The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.

Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():

def distance(word1, word2):
    difference = 0
    for x,y in zip (word1, word2):
        if x == y:
            difference += 1
    return difference

But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)

半城柳色半声笛 2024-07-25 16:08:06

试试这个:

def distance(word1, word2):
  return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])

另外,你有你的游戏的链接吗? 我喜欢被文字游戏摧毁

Try this:

def distance(word1, word2):
  return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])

Also, do you have a link to your game? I like being destroyed by word games

我是男神闪亮亮 2024-07-25 16:08:06

我首先想到的是:

from operator import ne

def distance(word1, word2):
    return sum(map(ne, word1, word2))

它很有可能比人们发布的其他函数运行得更快,因为它没有解释循环,只是调用 Python 原语。 而且它足够短,您可以合理地将其内联到调用者中。

对于您的更高级别的问题,我将研究为度量空间中的相似性搜索而开发的数据结构,例如 这篇论文这本书,我都没有读过(它们是在搜索我读过但不记得的论文时出现的)。

First thing to occur to me:

from operator import ne

def distance(word1, word2):
    return sum(map(ne, word1, word2))

which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.

For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).

残月升风 2024-07-25 16:08:06

对于这个片段:

for x,y in zip (word1, word2):
    if x != y:
        difference += 1
return difference

我会使用这个:

return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])

相同的模式将遵循所提供的代码......

for this snippet:

for x,y in zip (word1, word2):
    if x != y:
        difference += 1
return difference

i'd use this one:

return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])

the same pattern would follow all around the provided code...

烟雨凡馨 2024-07-25 16:08:06

其他人都只专注于显式距离计算,而没有做任何关于构建距离 1 候选的事情。
您可以通过使用名为 Trie隐式距离计算生成所有距离1邻居词的任务合并。 Trie 是一个链表,其中每个节点代表一个字母,“下一个”字段是一个最多包含 26 个条目的字典,指向下一个节点。

这是伪代码:针对给定的单词迭代遍历 Trie; 在每个节点将所有距离 0 和距离 1 的邻居添加到结果中; 保持距离计数器并减少它。 您不需要递归,只需要一个需要额外的 distance_so_far 整数参数的查找函数。

可以通过为长度 3、长度 4、长度 5 等单词构建单独的尝试来获得 O(N) 空间增加的额外速度的较小权衡。

Everyone else focused just on explicit distance-calculation without doing anything about constructing the distance-1 candidates.
You can improve by using a well-known data-structure called a Trie to merge the implicit distance-calculation with the task of generating all distance-1 neighbor words. A Trie is a linked-list where each node stands for a letter, and the 'next' field is a dict with up to 26 entries, pointing to the next node.

Here's the pseudocode: walk the Trie iteratively for your given word; at each node add all distance-0 and distance-1 neighbors to the results; keep a counter of distance and decrement it. You don't need recursion, just a lookup function which takes an extra distance_so_far integer argument.

A minor tradeoff of extra speed for O(N) space increase can be gotten by building separate Tries for length-3, length-4, length-5 etc. words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文