如何优化此 Python 代码以生成单词距离为 1 的所有单词?
分析显示,这是我编写的一个小文字游戏的代码中最慢的部分:
def distance(word1, word2):
difference = 0
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
def getchildren(word, wordlist):
return [ w for w in wordlist if distance(word, w) == 1 ]
注意:
distance()
被调用超过 500 万次,其中大部分来自 getchildren,它应该获取所有内容单词列表中与word
恰好有 1 个字母不同的单词。- wordlist 经过预过滤,只包含与
word
包含相同数量字母的单词,因此保证word1
和word2
具有相同数量的单词字符。 - 我对 Python 相当陌生(三天前开始学习),所以对命名约定或其他风格事物的评论也很感激。
- 对于单词列表,请使用“2+2lemma.txt”获取12dict 单词列表文件
结果:
谢谢大家,结合不同的建议,我现在让程序运行速度提高了一倍(除了我在询问之前自己做的优化之外,所以速度比我最初的实现大约提高了 4 倍)
我用 2 进行了测试我将其称为 A 和 B 的输入集
优化 1:迭代 word1,2 的索引 ... 从
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
到 使用 zip(word1, word2) 迭代字母对
for x,y in zip (word1, word2):
if x != y:
difference += 1
return difference
输入 A 的执行时间从 11.92 到 9.18,输入 B 的执行时间从 79.30 到 74.59
优化2: 除了距离方法(我在其他地方仍然需要 A* 启发式方法)之外,还添加了一种单独的差一方法。
def is_neighbors(word1,word2):
different = False
for c1,c2 in zip(word1,word2):
if c1 != c2:
if different:
return False
different = True
return different
输入 A 的执行时间从 9.18 到 8.83,输入 B 的执行时间从 74.59 到 70.14
优化3: 这里的大赢家是使用 izip
而不是 zip
输入 A 的执行时间从 8.83 到 5.02,输入 B 的执行时间从 70.14 到 41.69
我可能可以做得更好一种较低级别的语言,但我现在对此感到满意。 感谢大家!
再次编辑:更多结果 使用马克的方法检查第一个字母不匹配的情况,从 5.02 -> 得到它。 3.59 和 41.69 --> 29.82
合并 izip
而不是 range
,我最终得到了这样的结果:
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
在此基础上, 3.59-> 3.38 和 29.82 --> 27.88
更多结果!
尝试 Sumudu 的建议,即我生成与“word”相差 1 个字母的所有字符串的列表,然后检查单词列表中包含哪些字符串,而不是 is_neighbor 函数,我最终得到以下结果:
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return ( w for w in oneoff if w in wordlist )
最终速度较慢(3.38 -> 3.74 和 27.88 -> 34.40),但看起来很有希望。 起初,我认为我需要优化的部分是“one_letter_off_strings”,但分析显示并非如此,而且缓慢的部分实际上是
( w for w in oneoff if w in wordlist )
我认为如果我切换“oneoff”和“wordlist”并执行以下操作是否会有任何区别当我意识到我正在寻找两个列表的交集时,以另一种方式进行比较。 我将其替换为字母上的交集:
return set(oneoff) & set(wordlist)
Bam! 3.74-> 0.23和34.40→ 2.25
这确实是令人惊奇的,与我最初的天真的实现的总速度差异: 23.79 -> 23.79 0.23和180.07→ 2.25,因此比原始实现快大约 80 到 100 倍。
如果有人感兴趣,我发表了博客文章 描述程序和描述所做的优化包括此处未提及的优化(因为它位于不同的代码部分)。
大辩论:
好吧,我和 Unknown 正在进行一场大辩论,你可以在 他的答案。 他声称,如果将其移植到 C,使用原始方法(使用 is_neighbor 而不是使用集合)会更快。我尝试了 2 个小时来获取我编写的用于构建和可链接的 C 模块,但在尝试后没有取得太大成功遵循此和这个示例,看起来 Windows 中的过程有点不同? 我不知道,但我已经放弃了。 无论如何,这是程序的完整代码,文本文件来自 12dict 单词列表 使用“2+2lemma.txt”文件。 抱歉,如果代码有点混乱,这只是我拼凑的东西。 另外,我忘记从单词列表中删除逗号,因此这实际上是一个错误,您可以为了相同的比较而保留它,或者通过在 cleanentries 的字符列表中添加逗号来修复它。
from itertools import izip
def unique(seq):
seen = {}
result = []
for item in seq:
if item in seen:
continue
seen[item] = 1
result.append(item)
return result
def cleanentries(li):
pass
return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
difference = 0
for x,y in izip (word1, word2):
if x != y:
difference += 1
return difference
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
import Queue
closedset = []
openset = [start]
pqueue = Queue.PriorityQueue(0)
g_score = {start:0} #Distance from start along optimal path.
h_score = {start:distance(start, goal)}
f_score = {start:h_score[start]}
pqueue.put((f_score[start], start))
parent_dict = {}
while len(openset) > 0:
x = pqueue.get(False)[1]
if x == goal:
return reconstruct_path(parent_dict,goal)
openset.remove(x)
closedset.append(x)
sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
sortedOpen.sort()
for y in getchildren(x, wordlist):
if y in closedset:
continue
temp_g_score = g_score[x] + 1
temp_is_better = False
appended = False
if (not y in openset):
openset.append(y)
appended = True
h_score[y] = distance(y, goal)
temp_is_better = True
elif temp_g_score < g_score[y] :
temp_is_better = True
else :
pass
if temp_is_better:
parent_dict[y] = x
g_score[y] = temp_g_score
f_score[y] = g_score[y] + h_score[y]
if appended :
pqueue.put((f_score[y], y))
return None
def reconstruct_path(parent_dict,node):
if node in parent_dict.keys():
p = reconstruct_path(parent_dict,parent_dict[node])
p.append(node)
return p
else:
return []
wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
words = [w.lower() for w in userentry.split()]
if(len(words) == 2 and len(words[0]) == len(words[1])):
break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
print "Minimum number of steps is %s" % (len(answer))
reply = raw_input("Would you like the answer(y/n)? ")
if(reply.lower() == "y"):
answer.insert(0, words[0])
print "\n".join(answer)
else:
print "Good luck!"
else:
print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")
我保留了 is_neighbors 方法,尽管它没有被使用。 这是建议移植到 C 的方法。要使用它,只需将 getchildren 替换为:
def getchildren(word, wordlist):
return ( w for w in wordlist if is_neighbors(word, w))
至于让它作为 C 模块工作,我还没有做到这一点,但这就是我想出的:
#include "Python.h"
static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
int length;
const char *word1, *word2;
if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
return NULL;
int i;
int different = 0;
for (i =0; i < length; i++)
{
if (*(word1 + i) != *(word2 + i))
{
if (different)
{
return Py_BuildValue("i", different);
}
different = 1;
}
}
return Py_BuildValue("i", different);
}
PyMethodDef methods[] = {
{"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC
initIsNeighbor(void)
{
Py_InitModule("isneighbor", methods);
}
我使用以下方法对此进行了分析:
python -m cProfile“Wordgame.py”
并且记录的时间是AStar方法调用的总时间。 快速输入集是“诗句诗人”,长输入集是“诗人诗句”。 不同机器之间的时序显然会有所不同,因此如果有人最终尝试这样做,请按原样给出程序以及与 C 模块的结果比较。
Profiling shows this is the slowest segment of my code for a little word game I wrote:
def distance(word1, word2):
difference = 0
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
def getchildren(word, wordlist):
return [ w for w in wordlist if distance(word, w) == 1 ]
Notes:
distance()
is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ fromword
by exactly 1 letter.- wordlist is pre-filtered to only have words containing the same number of letters as
word
so it's guaranteed thatword1
andword2
have the same number of chars. - I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.
- for wordlist, take the 12dict word list using the "2+2lemma.txt" file
Results:
Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)
I tested with 2 sets of inputs which I'll call A and B
Optimization1: iterate over indices of word1,2 ... from
for i in range(len(word1)):
if word1[i] != word2[i]:
difference += 1
return difference
to iterate on letter-pairs using zip(word1, word2)
for x,y in zip (word1, word2):
if x != y:
difference += 1
return difference
Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B
Optimization2:
Added a separate method for differs-by-one in addition to the distance-method (which I still needed elsewhere for the A* heuristics)
def is_neighbors(word1,word2):
different = False
for c1,c2 in zip(word1,word2):
if c1 != c2:
if different:
return False
different = True
return different
Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B
Optimization3:
Big winner here was to use izip
instead of zip
Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B
I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!
Edit again: More results
Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82
Building on that and incorporating izip
instead of range
, I ended up with this:
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88
Even more results!
Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return ( w for w in oneoff if w in wordlist )
Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact
( w for w in oneoff if w in wordlist )
I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with set-intersection on the letters:
return set(oneoff) & set(wordlist)
Bam! 3.74 -> 0.23 and 34.40 -> 2.25
This is truely amazing, total speed difference from my original naive implementation:
23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.
If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).
The Great Debate:
Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.
from itertools import izip
def unique(seq):
seen = {}
result = []
for item in seq:
if item in seen:
continue
seen[item] = 1
result.append(item)
return result
def cleanentries(li):
pass
return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
difference = 0
for x,y in izip (word1, word2):
if x != y:
difference += 1
return difference
def is_neighbors(word1,word2):
if word1[0] != word2[0]:
return word1[1:] == word2[1:]
different = False
for x,y in izip(word1[1:],word2[1:]):
if x != y:
if different:
return False
different = True
return different
def one_letter_off_strings(word):
import string
dif_list = []
for i in xrange(len(word)):
dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
return dif_list
def getchildren(word, wordlist):
oneoff = one_letter_off_strings(word)
return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
import Queue
closedset = []
openset = [start]
pqueue = Queue.PriorityQueue(0)
g_score = {start:0} #Distance from start along optimal path.
h_score = {start:distance(start, goal)}
f_score = {start:h_score[start]}
pqueue.put((f_score[start], start))
parent_dict = {}
while len(openset) > 0:
x = pqueue.get(False)[1]
if x == goal:
return reconstruct_path(parent_dict,goal)
openset.remove(x)
closedset.append(x)
sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
sortedOpen.sort()
for y in getchildren(x, wordlist):
if y in closedset:
continue
temp_g_score = g_score[x] + 1
temp_is_better = False
appended = False
if (not y in openset):
openset.append(y)
appended = True
h_score[y] = distance(y, goal)
temp_is_better = True
elif temp_g_score < g_score[y] :
temp_is_better = True
else :
pass
if temp_is_better:
parent_dict[y] = x
g_score[y] = temp_g_score
f_score[y] = g_score[y] + h_score[y]
if appended :
pqueue.put((f_score[y], y))
return None
def reconstruct_path(parent_dict,node):
if node in parent_dict.keys():
p = reconstruct_path(parent_dict,parent_dict[node])
p.append(node)
return p
else:
return []
wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
words = [w.lower() for w in userentry.split()]
if(len(words) == 2 and len(words[0]) == len(words[1])):
break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
print "Minimum number of steps is %s" % (len(answer))
reply = raw_input("Would you like the answer(y/n)? ")
if(reply.lower() == "y"):
answer.insert(0, words[0])
print "\n".join(answer)
else:
print "Good luck!"
else:
print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")
I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:
def getchildren(word, wordlist):
return ( w for w in wordlist if is_neighbors(word, w))
As for getting it to work as a C module I didn't get that far, but this is what I came up with:
#include "Python.h"
static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
int length;
const char *word1, *word2;
if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
return NULL;
int i;
int different = 0;
for (i =0; i < length; i++)
{
if (*(word1 + i) != *(word2 + i))
{
if (different)
{
return Py_BuildValue("i", different);
}
different = 1;
}
}
return Py_BuildValue("i", different);
}
PyMethodDef methods[] = {
{"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC
initIsNeighbor(void)
{
Py_InitModule("isneighbor", methods);
}
I profiled this using:
python -m cProfile "Wordgame.py"
And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
如果您的单词列表很长,那么从“单词”生成所有可能的 1 字母差异,然后检查列表中的哪些单词是否会更有效? 我不懂Python,但应该有一个适合单词列表的数据结构,允许日志时间查找。
我建议这样做是因为,如果您的单词长度合理(~10 个字母),那么您只需寻找 250 个潜在单词,如果您的单词列表大于几百个单词,这可能会更快。
If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.
I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.
你的函数
distance
正在计算总距离,而你实际上只关心 distance=1。 大多数情况下,您会知道它在几个字符内 >1,因此您可以提前返回并节省大量时间。除此之外,可能还有更好的算法,但我想不到。
编辑:另一个想法。
您可以制作 2 种情况,具体取决于第一个字符是否匹配。 如果不匹配,则单词的其余部分必须完全匹配,您可以一次性测试这一点。 否则,请按照与您之前所做的类似的方式进行操作。 您甚至可以递归地执行此操作,但我认为这不会更快。
编辑2:我删除了检查字符串是否相同长度的检查,因为你说它是多余的。 在我自己的代码和is_neighbors 函数 由 MizardX 提供,我得到以下内容:
编辑3:(可能在这里进入社区维基领域,但是......)
尝试你的最终定义is_neighbors() 与 izip 而不是 zip:2.9 秒。
这是我的最新版本,时间仍然为 1.1 秒:
Your function
distance
is calculating the total distance, when you really only care about distance=1. The majority of cases you'll know it's >1 within a few characters, so you could return early and save a lot of time.Beyond that, there might be a better algorithm, but I can't think of it.
Edit: Another idea.
You can make 2 cases, depending on whether the first character matches. If it doesn't match, the rest of the word has to match exactly, and you can test for that in one shot. Otherwise, do it similarly to what you were doing. You could even do it recursively, but I don't think that would be faster.
Edit 2: I've deleted the check to see if the strings are the same length, since you say it's redundant. Running Ryan's tests on my own code and on the is_neighbors function provided by MizardX, I get the following:
Edit 3: (Probably getting into community wiki territory here, but...)
Trying your final definition of is_neighbors() with izip instead of zip: 2.9 seconds.
Here's my latest version, which still times at 1.1 seconds:
或者可能内联 izip 代码:
以及重写的 getchildren :
izip(a,b)
返回来自a
和b.
zip(a,b)
返回一个列表来自a
和b
的对。Or maybe in-lining the
izip
code:And a rewritten
getchildren
:izip(a,b)
returns an iterator over pairs of values froma
andb
.zip(a,b)
returns a list of pairs froma
andb
.人们主要通过尝试编写更快的函数来解决这个问题,但可能还有另一种方法。
这是为什么? 也许更好的优化方法是尝试减少对
distance
的调用次数,而不是减少distance
执行时间的毫秒数。 如果不看完整的脚本就无法判断,但优化特定功能通常是不必要的。如果这是不可能的,也许你可以把它写成一个 C 模块?
People are mainly going about this by trying to write a quicker function, but there might be another way..
Why is this? Perhaps a better way to optimise is to try and reduce the number of calls to
distance
, rather than shaving milliseconds ofdistance's
execution time. It's impossible to tell without seeing the full script, but optimising a specific function is generally unnecessary.If that is impossible, perhaps you could write it as a C module?
使用相同参数调用距离函数的频率是多少? 实现优化的一个简单方法是使用记忆。
您可能还可以创建某种字典,其中包含冻结的字母集和互不相同的单词列表,并在其中查找值。 该数据结构可以通过 pickle 存储和加载,也可以在启动时从头开始生成。
仅当您使用的单词非常长时,短路评估才会给您带来收益,因为您使用的汉明距离算法基本上是 O(n),其中 n 是单词长度。
我用 timeit 做了一些实验,寻找一些可能具有说明性的替代方法。
Timeit 结果
您的解决方案
One Liner
Shortcut 评估
How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization.
You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.
Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length.
I did some experiments with timeit for some alternative approaches that may be illustrative.
Timeit Results
Your Solution
One Liner
Shortcut Evaluation
如果差值是 2 或更多,你可以先让循环中断。
您也可以更改
为
因为 xrange 按需生成序列,而不是一次生成整个数字范围。
您也可以尝试比较字长,这样会更快。 另请注意,如果 word1 大于 word2,您的代码将无法工作。
此后您在算法上无法做太多其他事情,也就是说,通过将该部分移植到 C,您可能会发现更多的加速。
编辑2
尝试解释我对 Sumudu 算法与逐个字符验证差异的分析。
当你有一个长度为 L 的单词时,你将生成的“相差一”单词的数量将为 25L。 从现代计算机上集合的实现中我们知道,搜索速度大约为 log(n) base 2,其中 n 是要搜索的元素数量。
看到你测试的 500 万个单词中的大多数都不在集合中,大多数时候,你将遍历整个集合,这意味着它真的变成了 log(25L) 而不是仅 log(25L)/2。 (这是假设集合的最佳情况,逐个字符串比较相当于逐个字符比较)
现在我们看一下确定“逐个差异”的时间复杂度。 如果我们假设你必须检查整个单词,那么每个单词的操作次数就变成L。 我们知道大多数单词很快就会相差 2。 并且知道大多数前缀占据单词的一小部分,我们可以逻辑地假设您大部分时间都会中断 L/2,即单词的一半(这是一个保守的估计) 。
因此,现在我们绘制两个搜索的时间复杂度,L/2 和 log(25L),并记住这甚至考虑了字符串匹配与字符匹配相同的速度(高度支持套)。 你有方程 log(25*L) > L/2,可以简化为log(25)> L/2 - log(L)。 从图中可以看出,使用字符匹配算法应该会更快,直到达到非常大量的 L。
我是这个问题中第一个建议打破 2 或更多差异的人。 问题是,Mark 的字符串切片思想 (if word1[0] != word2[0]: return word1[1:] == word2[1:]) 只是将我们正在做的事情放入 C 中。你认为 word1[1:] == word2[1:] 是计算出来的吗? 我们正在做同样的事情。
至于生成 C 代码,我有点忙。我相信您能够做到这一点,因为您以前用 C 编写过,您也可以尝试 C#,它可能具有类似的性能特征
更多说明
的更深入的解释。
这是对 Davy8 Your one_letter_off_strings 函数将创建一组 25L 字符串(其中 L 是字母数)。
从单词列表创建一组将创建一组 D 字符串(其中 D 是字典的长度)。 必须遍历每个oneoff并查看它是否存在于单词列表中。
此操作的时间复杂度如上所述。此操作的效率较低。将你想要的单词与单词列表中的每个单词进行比较,Sumudu的方法是C语言而不是算法的优化。
更多说明2
要创建 125 个单字母单词与 4500 个字典的交集,您需要进行 125 * 4500 次比较。 这不是 log(125,2)。 假设字典已预先排序,则最多为 125 * log(4500, 2)。 集合没有神奇的捷径。您在这里还进行了字符串与字符串的比较,而不是字符与字符的比较。
Well you can start by having your loop break if the difference is 2 or more.
Also you can change
to
Because xrange generates sequences on demand instead of generating the whole range of numbers at once.
You can also try comparing word lengths which would be quicker. Also note that your code doesn't work if word1 is greater than word2
There's not much else you can do algorithmically after that, which is to say you'll probably find more of a speedup by porting that section to C.
Edit 2
Attempting to explain my analysis of Sumudu's algorithm compared to verifying differences char by char.
When you have a word of length L, the number of "differs-by-one" words you will generate will be 25L. We know from implementations of sets on modern computers, that the search speed is approximately log(n) base 2, where n is the number of elements to search for.
Seeing that most of the 5 million words you test against is not in the set, most of the time, you will be traversing the entire set, which means that it really becomes log(25L) instead of only log(25L)/2. (and this is assuming best case scenario for sets that comparing string by string is equivalent to comparing char by char)
Now we take a look at the time complexity for determining a "differs-by-one". If we assume that you have to check the entire word, then the number of operations per word becomes L. We know that most words differ by 2 very quickly. And knowing that most prefixes take up a small portion of the word, we can logically assume that you will break most of the time by L/2, or half the word (and this is a conservative estimate).
So now we plot the time complexities of the two searches, L/2 and log(25L), and keeping in mind that this is even considering string matching the same speed as char matching (highly in favor of sets). You have the equation log(25*L) > L/2, which can be simplified down to log(25) > L/2 - log(L). As you can see from the graph, it should be quicker to use the char matching algorithm until you reach very large numbers of L.
I was the first person in this question to suggest breaking on a difference of 2 or more. The thing is, that Mark's idea of string slicing (if word1[0] != word2[0]: return word1[1:] == word2[1:]) is simply putting what we are doing into C. How do you think word1[1:] == word2[1:] is calculated? The same way that we are doing.
As for producing the C code, I am a bit busy. I am sure you will be able to do it since you have written in C before. You could also try C#, which probably has similar performance characteristics.
More Explanation
Here is a more indepth explanation to Davy8
Your one_letter_off_strings function will create a set of 25L strings(where L is the number of letters).
Creating a set from the wordlist will create a set of D strings (where D is the length of your dictionary). By creating an intersection from this, you MUST iterate over each oneoff and see if it exists in wordlist.
The time complexity for this operation is detailed above. This operation is less efficient than comparing the word you want with each word in wordlist. Sumudu's method is an optimization in C rather than in algorithm.
More Explanation 2
To create the intersection of 125 one-letter-off words with a dictionary of 4500, you need to make 125 * 4500 comparisons. This is not log(125,2). It is at best 125 * log(4500, 2) assuming that the dictionary is presorted. There is no magic shortcut to sets. You are also doing a string by string instead of char by char comparison here.
对于这样一个具有如此大性能影响的简单函数,我可能会创建一个 C 库并使用 ctypes。 Reddit 的一位创始人声称,他们使用这种技术使网站速度提高了 2 倍。
您还可以在此函数上使用psyco,但要注意它会占用大量内存。
For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes. One of reddit's founders claims they made the website 2x as fast using this technique.
You can also use psyco on this function, but beware that it can eat up a lot of memory.
我不知道它是否会显着影响您的速度,但您可以从将列表理解转换为生成器表达式开始。 它仍然是可迭代的,所以在用法上应该没有太大不同:
to
主要问题是列表理解会在内存中构建自身并占用相当多的空间,而生成器将动态创建列表,因此无需存储整个内容。
另外,根据未知的答案,这可能是一种更“Pythonic”的编写 distance() 的方式:
但是当 len (word1) != len (word2) 时,它的意图令人困惑,在 zip 的情况下,它只会返回尽可能多的值字符作为最短的单词。 (这可能是一种优化......)
I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression. It's still iterable so it shouldn't be much different in usage:
to
The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.
Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():
But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)
试试这个:
另外,你有你的游戏的链接吗? 我喜欢被文字游戏摧毁
Try this:
Also, do you have a link to your game? I like being destroyed by word games
我首先想到的是:
它很有可能比人们发布的其他函数运行得更快,因为它没有解释循环,只是调用 Python 原语。 而且它足够短,您可以合理地将其内联到调用者中。
对于您的更高级别的问题,我将研究为度量空间中的相似性搜索而开发的数据结构,例如 这篇论文或这本书,我都没有读过(它们是在搜索我读过但不记得的论文时出现的)。
First thing to occur to me:
which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.
For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).
对于这个片段:
我会使用这个:
相同的模式将遵循所提供的代码......
for this snippet:
i'd use this one:
the same pattern would follow all around the provided code...
其他人都只专注于显式距离计算,而没有做任何关于构建距离 1 候选的事情。
您可以通过使用名为 Trie 将隐式距离计算与生成所有距离1邻居词的任务合并。 Trie 是一个链表,其中每个节点代表一个字母,“下一个”字段是一个最多包含 26 个条目的字典,指向下一个节点。
这是伪代码:针对给定的单词迭代遍历 Trie; 在每个节点将所有距离 0 和距离 1 的邻居添加到结果中; 保持距离计数器并减少它。 您不需要递归,只需要一个需要额外的 distance_so_far 整数参数的查找函数。
可以通过为长度 3、长度 4、长度 5 等单词构建单独的尝试来获得 O(N) 空间增加的额外速度的较小权衡。
Everyone else focused just on explicit distance-calculation without doing anything about constructing the distance-1 candidates.
You can improve by using a well-known data-structure called a Trie to merge the implicit distance-calculation with the task of generating all distance-1 neighbor words. A Trie is a linked-list where each node stands for a letter, and the 'next' field is a dict with up to 26 entries, pointing to the next node.
Here's the pseudocode: walk the Trie iteratively for your given word; at each node add all distance-0 and distance-1 neighbors to the results; keep a counter of distance and decrement it. You don't need recursion, just a lookup function which takes an extra distance_so_far integer argument.
A minor tradeoff of extra speed for O(N) space increase can be gotten by building separate Tries for length-3, length-4, length-5 etc. words.