从任意字母中查找最大长度单词
我有 10 个任意字母,需要检查单词文件中的最大长度匹配
我前段时间开始学习 RE,似乎找不到合适的模式
- 第一个想法是使用 set: [10 chars] 但它也会重复包含的字符,我不知道如何避免这种情况
- ,但在 RE 之前,也许不需要 RE,这可以在没有 RE 的情况下解决
- 使用“for this in that:”迭代器似乎不合适,但也许 itertools 可以轻松做到这一点(我对此不熟悉)
我想即使是新手程序员/脚本编写者也知道解决方案,但我却不知道 谢谢
I have 10 arbitrary letters and need to check the max length match from words file
I started to learn RE just some time ago, and can't seem to find suitable pattern
- first idea that came was using set: [10 chars] but it also repeats included chars and I don't know how to avoid that
I stared to learn Python recently but before RE and maybe RE is not needed and this can be solved without it
- using "for this in that:" iterator seems inappropriate, but maybe itertools can do it easily (with which I'm not familiar)
I guess solution is known even to novice programmers/scripters, but not to me
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我猜这就像在给定一组拼字游戏图块的情况下查找可能的单词,这样一个字符只能重复它在原始列表中重复的次数。
诀窍是根据包含源字母的集合有效地测试单词文件中每个单词的每个字符。对于每个字符,如果在测试集中找到,则将其从测试集中删除并继续;否则,该单词不匹配,并继续下一个单词。
Python 有一个很好的函数
all
,用于根据序列中的元素测试一组条件。all
还有一个附加功能,它会“短路”,即一旦有一项不满足条件,则不再进行测试。因此,如果候选单词的第一个字母是“z”,并且源字母中没有“z”,则没有必要在候选单词中测试更多字母。我写这篇文章的第一次尝试很简单:
不幸的是,这里的错误是,如果源字母包含单个“m”,则具有多个“m”的单词将错误匹配,因为每个“m”将单独匹配给定的“m”在源测试集中。所以我需要删除每个匹配的字母。
我利用了
set.remove(item)
返回 None(Python 将其视为布尔False
)这一事实,并扩展了用于调用all 的生成器表达式
。对于word中的每个c,如果在测试集中找到它,我想另外将其从测试集中删除,类似于(伪代码,无效的Python):由于set.remove返回None,我可以将上面引用的位替换为“not testset.remove(c)”,现在我有了一个有效的 Python 表达式:
现在我们只需将其包装在一个循环中,检查列表中的每个单词(确保在检查每个单词之前构建一个新的测试集,因为我们的
all
测试现在已成为破坏性测试):最后一步是按长度降序对匹配项进行排序。我们可以传递一个关键函数来排序。内置的 len 会很好,但是会按长度升序排序。要将其更改为降序排序,我们使用 lambda 给我们的不是
len
,而是-1 * len
:现在您可以在匹配时打印出最长的单词[0],或迭代所有匹配项并将其打印出来。
(令我惊讶的是,这种强力方法运行得如此之好。我使用了 2of12inf.txt 单词列表,其中包含超过 80,000 个单词,对于 10 个字符的列表,我在我的小 1.99 上大约 0.8 秒内返回了匹配列表GHz 笔记本电脑。)
I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.
The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.
Python has a nice function
all
for testing a set of conditions based on elements in a sequence.all
has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.My first shot at writing this was simply:
Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.
I took advantage of the fact that
set.remove(item)
returns None, which Python treats as a BooleanFalse
, and expanded my generator expression used in callingall
. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:
Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our
all
test has now become a destructive test):The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin
len
would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us notlen
, but-1 * len
:Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.
(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)
我认为这段代码将满足您的需求:
如果您需要更复杂的标记,例如,如果您不使用拉丁文本,则应该使用 NLTK:
I think this code will do what you are looking for:
If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:
我假设您正在尝试找出由 10 个任意字母组成的最长单词是什么。
您可以将 10 个任意字母及其出现频率保存在字典中。
例如,您的 4(为简单起见,使用 4 而不是 10)任意字母是:e、w、l、l。这在字典中将是:
{'e':1, 'w':1, 'l':2}
然后,对于文本文件中的每个单词,查看是否可以在任意字母的字典中找到该单词的所有字母。如果是这样,那么这就是您的候选词之一。
所以:
我们
墙
好吧,
well 中的所有字母都可以在您的任意字母词典中找到,因此请保存它及其长度,以便与其他单词进行比较。
I assume you are trying to find out what is the longest word that can be made from your 10 arbitrary letters.
You can keep your 10 arbitrary letters in a dict along with the frequency they occur.
e.g., your 4 (using 4 instead of 10 for simplicity) arbitrary letters are: e, w, l, l. This would be in a dict as:
{'e':1, 'w':1, 'l':2}
Then for each word in the text file, see if all of the letters for that word can be found in your dict of arbitrary letters. If so, then that is one of your candidate words.
So:
we
wall
well
all of the letters in well would be found in your dict of arbitrary letters so save it and its length for comparison against other words.