python 中的整数比较使一切变得缓慢

发布于 2024-08-23 04:50:16 字数 866 浏览 4 评论 0原文

下面的代码让我非常头疼：

def extract_by_letters(letters, dictionary):

    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)

    return dictionary

首先：“if word indictionary”行。为什么我不能忽略这个？我收到一条错误消息 ValueError: list.remove(x): x not in list

但显然是这样。

第二：字典是一个由换行符分隔的大约 50,000 个单词的文件。上面的代码大约需要 2 分钟才能运行……太长了。我玩了一下代码，发现这一行：

if word.count(letter) != letters.count(letter):

似乎导致了我所有的问题。如果我去掉该行（这会完全搞乱输出），该函数需要大约 2 秒的时间来浏览字典。

我以为是计数函数，但事实并非如此。

如果我将 if 语句更改为：

print word.count(letter) 
print letters.count(letter)

该函数运行大约需要 3 秒。

我确信这就是对比。还有其他建议吗？有更好的方法吗？

提前致谢！

原文

The following code is causing me enormous headaches:

def extract_by_letters(letters, dictionary):

    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)

    return dictionary

First of all: the 'if word in dictionary' line. Why can't I leave this out? I get an error saying ValueError: list.remove(x): x not in list

But it is, obviously.

Second: Dictionary is a file of about 50,000 words separated by linebreaks. The above code takes about 2 minutes to run... Wayyy too long. I played with the code for a bit, and I found that the line:

if word.count(letter) != letters.count(letter):

seems to be causing all my problems. If I take that line out (which totally screws up the output), the function takes about 2 seconds to go through the dictionary.

I thought it was the count functions, but it's not.

if I change the if statement to something like:

print word.count(letter) 
print letters.count(letter)

the function takes about 3 seconds to run.

I'm convinced it's the comparison. Any other suggestions? Is there a better way to do this?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

懷念過去 2024-08-30 04:50:16

出现异常的原因是，如果字母计数匹配多个字母，则您将多次尝试删除该单词。速度

如此之慢的原因是循环内循环。

如果您写一两句话来说明该函数应该做什么，那么重构它就会容易得多。同时，一旦您已经删除了某个单词，这将阻止您检查是否需要删除该单词。

def extract_by_letters(letters, dictionary):
    for word in dictionary[:]:  # bad idea to change this while you iterate over it
        for letter in letters:
            if word.count(letter) != letters.count(letter):
                dictionary.remove(word)
                break
    return dictionary

如果字典是一个集合，你应该会得到一些速度的提升。如果字典是一个列表，这应该会带来巨大的加速

The reason you get the exception is that if the letter count matches for more than one letter, you are trying to remove the word more than once

The reason it is so slow is that you have loops inside loops inside loops.

If you would write a sentence or two about what the function is supposed to do, it would be much easier to refactor it. In the mean time, this would stop you checking whether a word needs to be removed once you have already removed it.

def extract_by_letters(letters, dictionary):
    for word in dictionary[:]:  # bad idea to change this while you iterate over it
        for letter in letters:
            if word.count(letter) != letters.count(letter):
                dictionary.remove(word)
                break
    return dictionary

If dictionary is a set you should get some speed up. If dictionary is a list this should give a huge speedup

回复收藏 0 原文

小…楫夜泊 2024-08-30 04:50:16

尝试构建输出而不是从中删除：

def extract_by_letters(letters, dictionary):
    d = []
    for word in dictionary:
       for letter in letters:
           if word.count(letter)>0:
               d.append(word)
               break
    return d

或者，您可以使用正则表达式：

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    d=[]
    for word in dictionary:
       if regex.search(word):
           d.append(word)
    return d

或者，也许是最简单的方法：

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    return [word for word in dictionary if regex.search(word)]

最后一个方法不需要花费明显的时间来扫描我的 Mac 上的 /usr/share/dict/words，这是一个列表共 234936 个字。

Try building the output instead of deleting from it:

def extract_by_letters(letters, dictionary):
    d = []
    for word in dictionary:
       for letter in letters:
           if word.count(letter)>0:
               d.append(word)
               break
    return d

Or, you could use regular expressions:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    d=[]
    for word in dictionary:
       if regex.search(word):
           d.append(word)
    return d

Or, perhaps the simplest way:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    return [word for word in dictionary if regex.search(word)]

This last one takes no noticeable time to scan /usr/share/dict/words on my Mac, which is a list of 234936 words.

回复收藏 0 原文

挽容 2024-08-30 04:50:16

这是一个应该提供主要加速的函数：

def extract_by_letters(letters,dictionary):
    letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
    outarr = []
    for word in dictionary:
        goodword = True
        for letter in letdict:
            if word.count(letter) != letdict[letter]:
                goodword = False
                break
        if goodword:
            outarr.append(word)
    return outarr

这是我所做的优化：

制作了字母及其相应频率的字典。这样，当您只需要执行一次此过程并存储结果时，您就不会一遍又一遍地使用 letter.count。
我不是从字典中删除单词，而是将它们添加到从函数返回的数组中。如果你有一本很大的字典，很可能只有几个单词可以匹配。另外，如果字典变量是一个数组（我怀疑），那么每次调用remove时，它都必须首先在字典中搜索单词（从头开始线性），然后删除它。通过使用要删除的单词的索引弹出来删除速度更快。
每当发现不匹配时，就跳出循环检查字母计数。当我们已经有了答案时，这可以防止我们进行不必要的检查。

我不确定 letters 变量中是否有重复的字母，所以我确保它可以通过使用 letdict 来处理这个问题。如果您之前的字母变量中有重复的字母，那么您会重复检查单词中这些字母的计数。

Here's a function that should offer a major speed-up:

def extract_by_letters(letters,dictionary):
    letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
    outarr = []
    for word in dictionary:
        goodword = True
        for letter in letdict:
            if word.count(letter) != letdict[letter]:
                goodword = False
                break
        if goodword:
            outarr.append(word)
    return outarr

Here are the optimizations I did:

Made a dictionary of the letters with their corresponding frequencies. This way, you aren't using letters.count over and over again when you only need to do this process once and store the results.
Rather than removing the words from dictionary, I add them to an array that is returned from the function. If you have a huge dictionary, chances are that only a few words will match. Also, if the dictionary variable is an array (which I suspect), then every time you called remove it had to first search for the word in dictionary (linearly starting at the beginning) and then remove it. It is faster to remove by popping using the index of the word to be removed.
Breaking out of the loop checking the letter counts whenever a mismatch is found. This prevents us from doing needless checks when we already have our answer.

I wasn't sure if you had repeated letters in the letters variable or not so I made sure that it could handle that by using letdict. If you had letters repeated in your letters variable before, then you were checking the counts of those letters in the word repeatedly.

回复收藏 0 原文

め可乐爱微笑 2024-08-30 04:50:16

import pprint
from collections import defaultdict

# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.

def extract_by_letters_0(letters, input_list):
    dictionary = input_list.copy()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)
    return dictionary

# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)

def extract_by_letters_1(letters, input_list):
    dictionary = set(input_list)
    ripouts = set()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               ripouts.add(word)
    return dictionary - ripouts

def anagram_key(strg):
    return ''.join(sorted(list(strg)))

def check_anagrams(str1, str2):
    return sorted(list(str1)) == sorted(list(str2))

# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.

raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""

lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
    anagram_dict[anagram_key(word)].add(word)

print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# now purge trivial entries

temp = {}
for k, v in anagram_dict.iteritems():
    if len(v) != 1:
        temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# Test cases

tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
    print
    results = extract_by_letters_1(test, lexicon)
    print test, [(result, check_anagrams(test, result)) for result in results]
    # In the following statement, you can use set([test]) as the default
    # if that produces a more useful or orthogonal result.
    results = anagram_dict.get(anagram_key(test), default_set)
    print test, [(result, check_anagrams(test, result)) for result in results]

输出：

lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']

anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})

anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}

sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []

same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]

mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]

sameness [('sameness', True)]
sameness []

samehood [('samehood', True)]
samehood []

xsame []
xsame []

samex []
samex []

import pprint
from collections import defaultdict

# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.

def extract_by_letters_0(letters, input_list):
    dictionary = input_list.copy()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)
    return dictionary

# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)

def extract_by_letters_1(letters, input_list):
    dictionary = set(input_list)
    ripouts = set()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               ripouts.add(word)
    return dictionary - ripouts

def anagram_key(strg):
    return ''.join(sorted(list(strg)))

def check_anagrams(str1, str2):
    return sorted(list(str1)) == sorted(list(str2))

# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.

raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""

lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
    anagram_dict[anagram_key(word)].add(word)

print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# now purge trivial entries

temp = {}
for k, v in anagram_dict.iteritems():
    if len(v) != 1:
        temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# Test cases

tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
    print
    results = extract_by_letters_1(test, lexicon)
    print test, [(result, check_anagrams(test, result)) for result in results]
    # In the following statement, you can use set([test]) as the default
    # if that produces a more useful or orthogonal result.
    results = anagram_dict.get(anagram_key(test), default_set)
    print test, [(result, check_anagrams(test, result)) for result in results]

Output:

lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']

anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})

anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}

sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []

same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]

mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]

sameness [('sameness', True)]
sameness []

samehood [('samehood', True)]
samehood []

xsame []
xsame []

samex []
samex []

回复收藏 0 原文

烟沫凡尘 2024-08-30 04:50:16

您想找到“字母”的所有字谜吗？

def anagrams(letters, words):
    letters = sorted(letters)
    result = []
    for word in words:
        if sorted(word.strip()) == letters:
            result.append(word)
    return result

You're trying to find all anagrams of 'letters'?

def anagrams(letters, words):
    letters = sorted(letters)
    result = []
    for word in words:
        if sorted(word.strip()) == letters:
            result.append(word)
    return result

回复收藏 0 原文

~没有更多了~