我想你可以将其归类为拼字游戏风格的问题,但它是由于一位朋友提到英国电视问答节目倒计时而开始的。节目中的各个回合都会向参赛者展示一组打乱的字母,他们必须想出尽可能长的单词。我的朋友提到的那个是“RAEPKWAEN”。
在相当短的时间内,我用 Python 编写了一些东西来处理这个问题,使用 PyEnchant 来处理字典查找,但是我注意到它确实无法很好地扩展。
这是我目前所拥有的:
#!/usr/bin/python
from itertools import permutations
import enchant
from sys import argv
def find_longest(origin):
s = enchant.Dict("en_US")
for i in range(len(origin),0,-1):
print "Checking against words of length %d" % i
pool = permutations(origin,i)
for comb in pool:
word = ''.join(comb)
if s.check(word):
return word
return ""
if (__name__)== '__main__':
result = find_longest(argv[1])
print result
这对于 9 个字母的示例来说很好,就像他们在节目中使用的那样,9 阶乘 = 362,880 和 8 阶乘 = 40,320。在这个规模上,即使它必须检查所有可能的排列和字长,它也不会那么多。
然而,一旦达到 14 个字符,就有 87,178,291,200 种可能的组合,这意味着您需要依靠运气才能快速找到 14 个字符的单词。
对于上面的示例单词,我的机器需要大约 12 1/2 秒才能找到“重新唤醒”。对于 14 个字符的乱序单词,我们可以在 23 天的范围内进行讨论,只是为了检查所有可能的 14 个字符的排列。
有没有更有效的方法来处理这个问题?
I guess you could classify this as a Scrabble style problem, but it started out due to a friend mentioning the UK TV quiz show Countdown. Various rounds in the show involve the contestants being presented a scrambled set of letters and they have to come up with the longest word they can. The one my friend mentioned was "RAEPKWAEN".
In fairly short order I whipped up something in Python to handle this problem, using PyEnchant to handle the dictionary look-ups, however I'm noticing that it really can't scale all that well.
Here's what I have currently:
#!/usr/bin/python
from itertools import permutations
import enchant
from sys import argv
def find_longest(origin):
s = enchant.Dict("en_US")
for i in range(len(origin),0,-1):
print "Checking against words of length %d" % i
pool = permutations(origin,i)
for comb in pool:
word = ''.join(comb)
if s.check(word):
return word
return ""
if (__name__)== '__main__':
result = find_longest(argv[1])
print result
That's fine on a 9 letter example like they use in the show, 9 factorial = 362,880 and 8 factorial = 40,320. On that scale even if it would have to check all possible permutations and word lengths it's not that many.
However once you reach 14 characters that's 87,178,291,200 possibly combinations, meaning you're reliant on luck that a 14 character word is quickly found.
With the example word above it's taking my machine about 12 1/2 seconds to find "reawaken". With 14 character scrambled words we could be talking on the scale of 23 days just to check all possible 14 character permutations.
Is there any more efficient way to handle this?
发布评论
评论(10)
实施来自 Jeroen Coupé 想法>他的答案与字母计数:
输出(对于我的58000字字典):
注释:
这是没有优化的简单实现。
words_list.txt
- 在 Linux 上可以是/usr/share/dict/words
。更新
如果我们只需要查找单词一次,并且我们有按长度排序的单词词典,例如通过以下脚本:
我们可以找到最长的单词,而无需将完整的字典加载到内存中:
Implementation of Jeroen Coupé idea from his answer with letters count:
Output (for my small 58000 words dict):
Notes:
It's simple implementation without optimizations.
words_list.txt
- can be/usr/share/dict/words
on Linux.UPDATE
In case we need to find word only once, and we have dictionary with words sorted by length, e.g. by this script:
We can find longest word without loading full dict to memory:
您想避免进行排列。您可以计算一个字符在两个字符串(原始字符串和字典中的字符串)中出现的次数。忽略字典中字符出现频率不同的所有单词。
因此,要从字典中检查一个单词,您最多需要计算 MAX (26, n) 次字符。
You want to avoid doing the permutation. You could count how many times a character appears in both strings ( the original string and the one from the dictionary). Dismiss all the words from the dictionary where the frequency of characters isn't the same.
So to check one word from the dictionary you will need to count the characters at most MAX (26, n) time.
然后,当您搜索给定的一组字母时:
您需要对每个单词长度单独执行此操作。
编辑:应该说您正在搜索目标单词长度的排序字母的所有唯一组合 (
range(len(letters), 0, -1)
)Then, when you are searching for a given set of letters:
You'd need to do this separately for each word length.
EDIT: should say that you're searching for all unique combinations of the sorted letters of the target word length (
range(len(letters), 0, -1)
)这类似于我之前处理过的一个字谜问题。我通过使用质数来表示每个字母来解决这个问题。每个单词的字母的乘积产生一个数字。要确定给定的一组输入字符是否足以完成作品,只需将输入字符的乘积除以要检查的数字的乘积即可。如果没有余数,则输入字符就足够了。我已经在下面实现了它。输出为:
您可以在以下位置找到有关字谜案例的更多详细信息和全面解释:
http://mostlyhighperformance.blogspot.com/2012/01 /generate-anagrams-efficient-and-easy.html
该算法需要花费少量时间来建立字典,然后单独检查就像对字典中的每个单词进行单个划分一样简单字典。可能有更快的方法,如果字典缺少字母,则依赖于关闭字典的部分,但如果您有大量输入字母,这些方法最终可能会表现更差,因此它实际上无法关闭字典的任何部分。
This is similar to an anagram problem I've worked on before. I solved that by using prime numbers to represent each letter. The product of the letters for each word produces a number. To determine if a given set of input characters are sufficient to make a work, just divide the product of the input character by the product for the number you want to check. If there is no remainder then the input characters are sufficient. I've implemented it below. The output is:
You can find more details and a thorough explanation of the anagrams case at:
http://mostlyhighperformance.blogspot.com/2012/01/generating-anagrams-efficient-and-easy.html
This algorithm takes a small amount of time to set up a dictionary, and then individual checks are as easy as a single division for every word in the dictionary. There may be faster methods that rely on closing off parts of the dictionary if it lacks a letter, but these may end up performing worse if you have large number of input letters so it is actually not able to close off any part of the dictionary.
在你问这个问题后不久,我昨晚就开始了这个工作,但直到现在才抽出时间来完善它。这是我的解决方案,它基本上是一个修改后的特里树,直到今天我才知道!
测试:
我想我自己更喜欢皮纹而不是最长的单词不受版权保护。就性能而言,使用约 50 万个单词词典(来自此处),
因此平均而言,6/10只需一秒(在我的 i5-2500 上)即可找到所有六万七千个不包含重复字母的单词。
此实现与 trie 之间的最大区别(这使得它与一般的 DAWG 更进一步)是:单词按照其排序的字母存储在 trie 中。所以“dog”这个词和“god”存储在同一个路径下:dgo。第二位是 find_max_word 算法,该算法通过不断砍掉其头部并重新运行搜索来确保访问每个可能的字母组合。
哦,只是为了咯咯笑:
I started this last night shortly after you asked the question, but didn't get around to polishing it up until just now. This was my solution, which is basically a modified trie, which I didn't know until today!
Testing:
I think I prefer dermatoglyphics to uncopyrightable for longest word, myself. Performance-wise, utilizing a ~500k word dictionary (from here),
So, on average, 6/10ths of a second (on my i5-2500) to find all sixty-seven thousand words that contain no repeating letters.
The big differences between this implementation and a trie (which makes it even further from a DAWG in general) is that: words are stored in the trie in relation to their sorted letters. So the word 'dog' is stored under the same path as 'god': d-g-o. The second bit is the the
find_max_word
algorithm, which makes sure every possible letter combination is visited by continually lopping off its head and re-running the search.Oh, and just for giggles:
另一种方法,类似于@market的答案,是为字典中的每个单词预先计算一个“位掩码”。如果单词包含至少一个 A,则设置位 0,如果包含至少一个 B,则设置位 1,等等,直到 Z 的位 25。
如果要搜索字典中可以生成的所有单词从字母组合开始,您首先为字母集合形成位掩码。然后,您可以通过检查
wordBitmask & 是否过滤掉所有使用其他字母的单词。 ~lettersBitMask
为零。如果为零,则该单词仅使用集合中可用的字母,因此可能是有效的。如果该值非零,则它使用集合中不可用的字母,因此是不允许的。这种方法的优点是按位运算速度快。字典中的绝大多数单词将至少使用不在给定集合中的 17 个或更多字母中的一个,并且您可以快速将它们全部打折。但是,对于通过过滤器的少数单词,您还需要进行一项检查。您仍然需要检查单词使用字母的频率是否高于集合中出现的频率。例如,必须禁止使用单词“weakener”,因为它有三个“e”,而字母 RAEPKWAEN 中只有两个。仅按位方法不会过滤掉该单词,因为该单词中的每个字母都出现在集合中。
Another approach, similar to @market's answer, is to precompute a 'bitmask' for each word in the dictionary. Bit 0 is set if the word contains at least one A, bit 1 is set if it contains at least one B, and so on up to bit 25 for Z.
If you want to search for all words in the dictionary that could be made up from a combination of letters, you start by forming the bitmask for the collection of letters. You can then filter out all of the words that use other letters by checking whether
wordBitmask & ~lettersBitMask
is zero. If this is zero, the word only uses letters available in the collection, and so could be valid. If this is non-zero, it uses a letter not available in the collection and so is not allowed.The advantage of this approach is that the bitwise operations are fast. The vast majority of words in the dictionary will use at least one of the 17 or more letters that aren't in the collection given, and you can speedily discount them all. However, for the minority of words that make it through the filter, there is one more check that you still have to make. You still need to check that words aren't using letters more often than they appear in the collection. For example, the word 'weakener' must be disallowed because it has three 'e's, whereas there are only two in the collection of letters RAEPKWAEN. The bitwise approach alone will not filter out this word since each letter in the word appears in the collection.
当查找超过 10 个字母的单词时,您可以尝试迭代超过 10 个字母的单词(我认为没有那么多具有 10 个字母的单词),并检查集合中是否有所需的字母。
问题是你必须首先找到所有 len(word) >= 10 个单词。
那么,我会做什么:
阅读字典时,将单词分为两类:短单词和长单词。您可以通过迭代每个可能的排列来处理 Shorts。您可以通过迭代然后检查它们是否可能来处理长整型。
当然,这两条路径都可以进行许多优化。
When looking for words longer than 10 letters you may try to iterate over words (I think there are not so many words with 10 letters) that are longer than 10 letters and check it you have required letters in your set.
Problem is that you have to find all those len(word) >= 10 words first.
So, what I would do:
When reading the dictionary split the words into 2 categories: shorts and longs. You can process shorts by iterating over every possible permutation. Than you can process longs by iterating over then and checking it they are possible.
Of course there are many optimisations possible to both paths.
此时,您的字典树代表了字典中可以从字母包构建的所有单词。
编辑:您还可以使用 DAGW(有向非循环字图) 其顶点较少。虽然我还没有读过,但这篇维基百科文章有一个关于世界上最快的拼字游戏程序。
At this point, your trie is the representation of all words in your dictionary that can be constructed from your bag of letters.
Edit: you may also use a DAGW (Directed Acyclic Word Graph) which will have fewer vertices. Although I haven't read it, this wikipedia article have a link about The World's Fastest Scrabble Program.
DAWG(有向非循环字图)
Mark Wutka 很友善地在这里提供了一些 pascal 代码。
DAWG (Directed Acyclic Word Graph)
Mark Wutka was kind enough to provide some pascal code here.
如果您有一个包含已排序单词的文本文件。这段代码简单地进行了数学计算:
In case you have a text file with sorted words. Simply this code does the math: