大量免费的英语非代词文本

发布于 2024-08-28 02:44:03 字数 432 浏览 9 评论 0原文

作为自学 python 的一部分,我编写了一个允许用户玩刽子手的脚本。目前,只需在脚本代码的开头手动输入要猜测的刽子手单词即可。

我希望脚本从大量英语单词中随机选择。我知道该怎么做 - 我的问题是首先找到要工作的单词列表。

有谁知道网上有 1000 个常见英语单词的来源吗?可以将它们作为文本块或类似的内容下载,以便我使用?

(我最初的想法是从古腾堡项目中获取一部小说的一部分[这个项目只是为了我自己的娱乐,不会在其他任何地方提供,所以顺便说一句,版权等对我来说并不重要],但类似的事情很可能包含太多不适合刽子手的名称或非标准单词,我需要的文本基本上只有在拼字游戏中合法使用的单词)。

我想这是一个有点奇怪的问题,但实际上我认为答案可能不仅对我有用,而且对从事文字游戏或类似项目的任何其他人都有用,这些项目需要大量的单词种子列表来工作。

非常感谢您的任何链接或建议:)

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.

I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.

Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?

(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).

It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.

Many thanks for any links or suggestions :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

北凤男飞 2024-09-04 02:44:03

这个有用吗?

Would this be useful?

紫轩蝶泪 2024-09-04 02:44:03

您尝试过 /usr/share/dict/words 吗?

Have you tried /usr/share/dict/words?

千纸鹤带着心事 2024-09-04 02:44:03

手动创建文本列表

从古腾堡计划、维基百科或其他来源获取文本。浏览文本并计算每个单词出现的次数。最常出现的单词是代词、连词等等……把它们扔掉吧。

专有名词可能是最不常见的单词,当然,除非您的文本是故事,否则角色名称可能会经常出现。也许处理专有名词的最佳方法是使用许多来源并计算该单词在多少个来源中找到。本质上,在许多不同来源中常见的单词可能不是专有名词。您可以丢弃特定于某一文本源的单词。这个想法与 tfidf 相关。

一旦计算出这些词频,就可以轻松查看这些词并根据需要调整列表。

使用 Wordnet

另一个想法是从 Wordnet 下载单词。 Wordnet 告诉我们很多单词的词性。您可以根据自己的目的只使用名词和动词。

Create text list manually

Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.

Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.

Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.

Use Wordnet

Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文