当前位置：文江博客话题详情

从任意字母中查找最大长度单词

发布于 2024-09-12 22:48:54 字数 336 浏览 1 评论 0原文

我有 10 个任意字母，需要检查单词文件中的最大长度匹配

我前段时间开始学习 RE，似乎找不到合适的模式
- 第一个想法是使用 set: [10 chars] 但它也会重复包含的字符，我不知道如何避免这种情况
，但在 RE 之前，也许不需要 RE，这可以在没有 RE 的情况下解决
- 使用“for this in that:”迭代器似乎不合适，但也许 itertools 可以轻松做到这一点（我对此不熟悉）

我想即使是新手程序员/脚本编写者也知道解决方案，但我却不知道谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我是有多爱你 2024-09-19 22:48:54

我猜这就像在给定一组拼字游戏图块的情况下查找可能的单词，这样一个字符只能重复它在原始列表中重复的次数。

诀窍是根据包含源字母的集合有效地测试单词文件中每个单词的每个字符。对于每个字符，如果在测试集中找到，则将其从测试集中删除并继续；否则，该单词不匹配，并继续下一个单词。

Python 有一个很好的函数 all，用于根据序列中的元素测试一组条件。 all 还有一个附加功能，它会“短路”，即一旦有一项不满足条件，则不再进行测试。因此，如果候选单词的第一个字母是“z”，并且源字母中没有“z”，则没有必要在候选单词中测试更多字母。

我写这篇文章的第一次尝试很简单：

matches = []
for word in wordlist:
    testset = set(letters)
    if all(c in testset for c in word):
        matches.append(word)

不幸的是，这里的错误是，如果源字母包含单个“m”，则具有多个“m”的单词将错误匹配，因为每个“m”将单独匹配给定的“m”在源测试集中。所以我需要删除每个匹配的字母。

我利用了 set.remove(item) 返回 None（Python 将其视为布尔 False）这一事实，并扩展了用于调用 all 的生成器表达式。对于word中的每个c，如果在测试集中找到它，我想另外将其从测试集中删除，类似于（伪代码，无效的Python）：

all(c in testset and "remove c from testset" for c in word)

由于set.remove返回None，我可以将上面引用的位替换为“not testset.remove(c)”，现在我有了一个有效的 Python 表达式：

all(c in testset and not testset.remove(c) for c in word)

现在我们只需将其包装在一个循环中，检查列表中的每个单词（确保在检查每个单词之前构建一个新的测试集，因为我们的 all 测试现在已成为破坏性测试）：

for word in wordlist:
    testset = set(letters)
    if all(c in testset and not testset.remove(c) for c in word):
        matches.append(word)

最后一步是按长度降序对匹配项进行排序。我们可以传递一个关键函数来排序。内置的 len 会很好，但是会按长度升序排序。要将其更改为降序排序，我们使用 lambda 给我们的不是 len，而是 -1 * len：

matches.sort(key=lambda wd: -len(wd))

现在您可以在匹配时打印出最长的单词[0]，或迭代所有匹配项并将其打印出来。

（令我惊讶的是，这种强力方法运行得如此之好。我使用了 2of12inf.txt 单词列表，其中包含超过 80,000 个单词，对于 10 个字符的列表，我在我的小 1.99 上大约 0.8 秒内返回了匹配列表GHz 笔记本电脑。）

I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.

The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.

Python has a nice function all for testing a set of conditions based on elements in a sequence. all has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.

My first shot at writing this was simply:

matches = []
for word in wordlist:
    testset = set(letters)
    if all(c in testset for c in word):
        matches.append(word)

Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.

I took advantage of the fact that set.remove(item) returns None, which Python treats as a Boolean False, and expanded my generator expression used in calling all. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):

all(c in testset and "remove c from testset" for c in word)

Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:

all(c in testset and not testset.remove(c) for c in word)

Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our all test has now become a destructive test):

for word in wordlist:
    testset = set(letters)
    if all(c in testset and not testset.remove(c) for c in word):
        matches.append(word)

The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin len would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us not len, but -1 * len:

matches.sort(key=lambda wd: -len(wd))

Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.

(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)

回复收藏 0 原文

阳光下的泡沫是彩色的 2024-09-19 22:48:54

我认为这段代码将满足您的需求：

>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))

如果您需要更复杂的标记，例如，如果您不使用拉丁文本，则应该使用 NLTK：

>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))

I think this code will do what you are looking for:

>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))

If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:

>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))

回复收藏 0 原文

肩上的翅膀 2024-09-19 22:48:54

我假设您正在尝试找出由 10 个任意字母组成的最长单词是什么。

您可以将 10 个任意字母及其出现频率保存在字典中。

例如，您的 4（为简单起见，使用 4 而不是 10）任意字母是：e、w、l、l。这在字典中将是：
{'e':1, 'w':1, 'l':2}

然后，对于文本文件中的每个单词，查看是否可以在任意字母的字典中找到该单词的所有字母。如果是这样，那么这就是您的候选词之一。

所以：
我们
墙
好吧，

well 中的所有字母都可以在您的任意字母词典中找到，因此请保存它及其长度，以便与其他单词进行比较。

回复收藏 0 原文

~没有更多了~

关于作者

木落

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

从任意字母中查找最大长度单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

从任意字母中查找最大长度单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。