名词、动词、形容词等的单独单词列表
通常单词列表是 1 个包含所有内容的文件,但是是否有可单独下载的名词列表、动词列表、形容词列表等?
我特别需要它们来学习英语。
Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?
I need them for English specifically.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您仅从 wordnet.princeton.edu/download/current-version 下载数据库文件您可以通过运行以下命令来提取单词:
或者如果您只想要单个单词(无下划线)
If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:
Or if you only want single words (no underscores)
这是一个排名很高的 Google 结果,因此我正在挖掘这个 2 年前的问题,以提供比现有问题更好的答案。
“Kevin 的单词列表”页面提供了 2000 年基于 WordNet 1.6 的旧列表。
你最好去 https://wordnet.princeton.edu/download/current-version 并在阅读本文时下载 WordNet 3.0(仅数据库版本)或任何最新版本。
解析它非常简单;只需应用
"/^(\S+?)[\s%]/"
的正则表达式来抓取每个单词,然后替换所有"_"
(下划线)带空格的结果。最后,将结果转储为您想要的任何存储格式。您将获得单独的形容词、副词、名词、动词列表,甚至还有一个名为“感官”的特殊列表(非常无用/有用,具体取决于您正在做什么),它与我们的嗅觉、视觉、听觉等相关。 ,即诸如“衬衫”或“刺鼻”之类的词。享受!如果您在项目中使用它,请记住包含他们的版权声明。
This is a highly ranked Google result, so I'm digging up this 2 year old question to provide a far better answer than the existing one.
The "Kevin's Word Lists" page provides old lists from the year 2000, based on WordNet 1.6.
You are far better off going to https://wordnet.princeton.edu/download/current-version and downloading WordNet 3.0 (the Database-only version) or whatever the latest version is when you're reading this.
Parsing it is very simple; just apply a regex of
"/^(\S+?)[\s%]/"
to grab every word, and then replace all"_"
(underscores) in the results with spaces. Finally, dump your results to whatever storage format you want. You'll be given separate lists of adjectives, adverbs, nouns, verbs and even a special (very useless/useful depending on what you're doing) list called "senses" which relates to our senses of smell, sight, hearing, etc, i.e. words such as "shirt" or "pungent".Enjoy! Remember to include their copyright notice if you're using it in a project.
正如其他人所建议的,WordNet 数据库文件 是词性的重要来源。也就是说,用于提取单词的示例并不完全正确。每一行实际上是一个“同义词集”,由多个同义词及其定义组成。大约 30% 的单词仅作为同义词出现,因此简单地提取第一个单词会丢失大量数据。
行格式解析起来非常简单(
search.c
,函数parse_synset
),但如果您只对单词感兴趣,则该行的相关部分是格式为:这些对应于:
例如,来自
data.adj
:s
,对应形容词(wnutil.c
,函数getpos
)cut
,词汇 ID 为 0缩短
,词汇 ID 为 0一个简短的 Perl 脚本,用于简单地从
数据中转储单词。*
文件:上述脚本的要点可以在此处找到。
可以在此处找到更强大且忠实于原始来源的解析器。
这两个脚本的使用方式相似:
./wordnet_parser.pl DATA_FILE
。As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.
The line format is pretty simple to parse (
search.c
, functionparse_synset
), but if all you're interested in are the words, the relevant part of the line is formatted as:These correspond to:
For example, from
data.adj
:s
, corresponding to adjective (wnutil.c
, functiongetpos
)cut
with lexical ID 0shortened
with lexical ID 0A short Perl script to simply dump the words from the
data.*
files:A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.
Both scripts are used in a similar fashion:
./wordnet_parser.pl DATA_FILE
.请参阅Kevin 的单词列表。特别是“词性数据库”。您必须自己进行一些最少的文本处理,以便自己将数据库放入多个文件中,但这可以通过一些
grep
命令轻松完成。许可条款可在“自述文件”页面上找到。
See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few
grep
commands.The license terms are available on the "readme" page.
http://icon.shef.ac.uk/Moby/mpos.html
每个词性词汇条目包含一个单词或短语字段,后跟一个字段分隔符 (ASCII 215) 以及使用以下 ASCII 符号编码的词性字段(大小写很重要):
http://icon.shef.ac.uk/Moby/mpos.html
Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):