Python-代码优化帮助-查找单词的所有字典有效的字谜

发布于 2024-11-09 00:23:40 字数 795 浏览 4 评论 0原文

我用这个（极其低效的方法）解决了这个问题：

def createList(word, wordList):
    #Made a set, because for some reason permutations was returning duplicates.  
    #Returns all permutations if they're in the wordList

    return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])

def main():

    a = open('C:\\Python32\\megalist.txt', 'r+')
    wordList = [line.strip() for line in a]
    maximum = 0
    length = 0
    maxwords = ""

    for words in wordList:
        permList = createList(words, wordList)
        length = len(permList)
        if length > maximum:
            maximum = length
            maxwords = permList
    print (maximum, maxwords)

花了大约 10 分钟才找到具有最多字典有效字谜的五个字母单词。我想在没有字母限制的单词上运行这个，但这会花费大量的时间。有什么办法可以优化这个吗？

原文

I solved the problem using this (horribly inefficient method):

def createList(word, wordList):
    #Made a set, because for some reason permutations was returning duplicates.  
    #Returns all permutations if they're in the wordList

    return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])

def main():

    a = open('C:\\Python32\\megalist.txt', 'r+')
    wordList = [line.strip() for line in a]
    maximum = 0
    length = 0
    maxwords = ""

    for words in wordList:
        permList = createList(words, wordList)
        length = len(permList)
        if length > maximum:
            maximum = length
            maxwords = permList
    print (maximum, maxwords)

It took something like 10 minutes to find the five-letter word that has the most dictionary-valid anagrams. I want to run this on words without a letter constraint, but it would take a ludicrous amount of time. Is there anyway to optimize this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蹲墙角沉默 2024-11-16 00:23:40

以下内容似乎在小型词典上可以正常工作。通过对单词中的字母进行排序，可以很容易地测试两个单词是否是字谜词。从这个起点开始，只需以某种方式积累单词即可。修改它以报告所有匹配项，而不仅仅是第一个匹配项并不难。

如果您确实需要添加对字母数量的限制，那么使用迭代器是过滤掉某些单词的便捷方法。

def wordIterator(dictionaryFilename):
    with open(dictionaryFilename,'r') as f:
        for line in f:
            word = line.strip()
            yield word

def largestAnagram(words):
    import collections
    d = collections.defaultdict(list)
    for word in words:
        sortedWord = str(sorted(word))
        d[ hash(sortedWord) ].append(word)
    maxKey = max( d.keys(), key = lambda k : len(d[k]) )
    return d[maxKey]

iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)

编辑：

响应评论，hash(sortedWord)是一种节省空间的优化，在这种情况下可能为时过早，将sortedWord减少回整数，因为我们并不真正关心密钥，只要我们总能唯一地恢复所有相关的字谜词即可。仅使用 sortedWord 作为键同样有效。

max 的 key 关键字参数可让您根据谓词查找集合中的最大元素。因此，语句 maxKey = max( d.keys(), key = lambda k : len(d[k]) ) 是一个用于回答查询的简洁 Python 表达式，给定中的键字典中，哪个键具有最大长度的关联值？。在该上下文中对 max 的调用可以写成（更详细地）为 valueWithMaximumLength(d) ，其中该函数定义为：

def valueWithMaximumLength( dictionary ):
    maxKey = None
    for k, v in dictionary.items():
        if not maxKey or len(dictionary[maxKey]) < len(v):
            maxKey = k
    return maxKey

This following seems to work OK on a smallish dictionary. By sorting the letters in the word, it becomes easy to test if two words are an anagram. From that starting point, it's just a matter of accumulating the words in some way. It wouldn't be hard to modify this to report all matches, rather than just the first one

If you do need to add constraints on the number of letters, the use of iterators is a convenient way to filter out some words.

def wordIterator(dictionaryFilename):
    with open(dictionaryFilename,'r') as f:
        for line in f:
            word = line.strip()
            yield word

def largestAnagram(words):
    import collections
    d = collections.defaultdict(list)
    for word in words:
        sortedWord = str(sorted(word))
        d[ hash(sortedWord) ].append(word)
    maxKey = max( d.keys(), key = lambda k : len(d[k]) )
    return d[maxKey]

iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)

Edit:

In response to the comment, the hash(sortedWord), is a space saving optimization, possibly premature in this case, to reduce sortedWord back to an integer, because we don't really care about the key, so long as we can always uniquely recover all the relevant anagrams. It would have been equally valid to just use sortedWord as the key.

The key keyword argument to max lets you find the maximum element in a collection based on a predicate. So the statement maxKey = max( d.keys(), key = lambda k : len(d[k]) ) is a succinct python expression for answering the query, given the keys in the dictionary, which key has the associated value with maximum length?. That call to max in that context could have been written (much more verbosely) as valueWithMaximumLength(d) where that function was defined as:

def valueWithMaximumLength( dictionary ):
    maxKey = None
    for k, v in dictionary.items():
        if not maxKey or len(dictionary[maxKey]) < len(v):
            maxKey = k
    return maxKey

回复收藏 0 原文

慵挽 2024-11-16 00:23:40

wordList 应该是一个集合。

测试列表中的成员资格要求您扫描列表中的所有元素，检查它们是否是您生成的单词。测试集合中的成员资格（平均而言）并不取决于集合的大小。

下一个明显的优化是，一旦测试了一个单词，您就可以从 wordList 中删除它的所有排列，因为它们将在 createList 中给出完全相同的集合。如果一切都用set完成，那么这是一个非常简单的操作——实际上，您只需使用二进制减号即可。

回复收藏 0 原文

娇柔作态 2024-11-16 00:23:40

没有必要进行所有排列，这会浪费内存和 CPU。
因此，首先您的字典应该保存在二叉树中，如下所示：

   e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']

   Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m" 
           is somewhere in the middle of the English alphabet 
           (this step is not necessary, but it helps to balance the tree a bit)

   Step 2: Build the binary tree:

               mother 
              /      \
             /        \
         alex          noise
           \            /  \
            \          /    \
           apple   question xeal
             \
              \
             google

  Step 3: start looking for an anagram by permutations:
  alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
        1.1 # of, course we should omit apple, because len('apple') != len('alex')
            #  but that's enough for an example:
        2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
           -- there is no "p" in "lex", so skipping, GOTO 1            
        ... 
        1. "l"... - nothing begins with "l"...
        1. "l"... - nothing begins with "e"...
        1. "x" - going deeper, "xeal" found
        2. Check if "eal" could be built from "ale" ("x" is already reserved)
           for letter in "eal":
               if not letter in "ale":
                  return False
           return True

就是这样:) 这个算法应该运行得更快。

编辑：

检查 bintrees 包以避免在二叉树实现上花费时间。

There is no need to do ALL permutations, - it's a waste of memory and CPU.
So, first of all your dictionary should be kept in a binary tree, like this:

   e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']

   Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m" 
           is somewhere in the middle of the English alphabet 
           (this step is not necessary, but it helps to balance the tree a bit)

   Step 2: Build the binary tree:

               mother 
              /      \
             /        \
         alex          noise
           \            /  \
            \          /    \
           apple   question xeal
             \
              \
             google

  Step 3: start looking for an anagram by permutations:
  alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
        1.1 # of, course we should omit apple, because len('apple') != len('alex')
            #  but that's enough for an example:
        2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
           -- there is no "p" in "lex", so skipping, GOTO 1            
        ... 
        1. "l"... - nothing begins with "l"...
        1. "l"... - nothing begins with "e"...
        1. "x" - going deeper, "xeal" found
        2. Check if "eal" could be built from "ale" ("x" is already reserved)
           for letter in "eal":
               if not letter in "ale":
                  return False
           return True

That's it :) This algorithm should work much faster.

EDIT:

Check bintrees package to avoid spending time on binary tree implementation.

回复收藏 0 原文

~没有更多了~