当前位置：文江博客话题详情

名词、动词、形容词等的单独单词列表

发布于 2024-08-22 00:47:35 字数 78 浏览 4 评论 0原文

通常单词列表是 1 个包含所有内容的文件，但是是否有可单独下载的名词列表、动词列表、形容词列表等？

我特别需要它们来学习英语。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

轻拂→两袖风尘 2024-08-29 00:47:35

如果您仅从 wordnet.princeton.edu/download/current-version 下载数据库文件您可以通过运行以下命令来提取单词：

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

或者如果您只想要单个单词（无下划线）

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

Or if you only want single words (no underscores)

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

回复收藏 0 原文

不回头走下去 2024-08-29 00:47:35

这是一个排名很高的 Google 结果，因此我正在挖掘这个 2 年前的问题，以提供比现有问题更好的答案。

“Kevin 的单词列表”页面提供了 2000 年基于 WordNet 1.6 的旧列表。

你最好去 https://wordnet.princeton.edu/download/current-version 并在阅读本文时下载 WordNet 3.0（仅数据库版本）或任何最新版本。

解析它非常简单；只需应用 "/^(\S+?)[\s%]/" 的正则表达式来抓取每个单词，然后替换所有 "_" （下划线）带空格的结果。最后，将结果转储为您想要的任何存储格式。您将获得单独的形容词、副词、名词、动词列表，甚至还有一个名为“感官”的特殊列表（非常无用/有用，具体取决于您正在做什么），它与我们的嗅觉、视觉、听觉等相关。，即诸如“衬衫”或“刺鼻”之类的词。

回复收藏 0 原文

两相知 2024-08-29 00:47:35

正如其他人所建议的，WordNet 数据库文件是词性的重要来源。也就是说，用于提取单词的示例并不完全正确。每一行实际上是一个“同义词集”，由多个同义词及其定义组成。大约 30% 的单词仅作为同义词出现，因此简单地提取第一个单词会丢失大量数据。

行格式解析起来非常简单（search.c，函数 parse_synset），但如果您只对单词感兴趣，则该行的相关部分是格式为：

NNNNNNNN NN a NN word N [word N ...]

这些对应于：

文件内的字节偏移量（8 个字符整数）
文件编号（2 个字符整数）
词性（1 个字符）
单词数（2 个字符，十六进制编码）
N 次出现...
- 将空格替换为下划线的单词，括号内为可选注释
- 单词词汇 ID（唯一的出现 ID）

例如，来自 data.adj：

00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"

文件内的字节偏移量为 4614
文件编号为 0
词性为s，对应形容词（wnutil.c，函数getpos）
单词数为2
- 第一个单词是 cut，词汇 ID 为 0
- 第二个单词被缩短，词汇 ID 为 0

一个简短的 Perl 脚本，用于简单地从数据中转储单词。* 文件：

#!/usr/bin/perl

while (my $line = <>) {
    # If no 8-digit byte offset is present, skip this line
    if ( $line !~ /^[0-9]{8}\s/ ) { next; }
    chomp($line);

    my @tokens = split(/ /, $line);
    shift(@tokens); # Byte offset
    shift(@tokens); # File number
    shift(@tokens); # Part of speech

    my $word_count = hex(shift(@tokens));
    foreach ( 1 .. $word_count ) {
        my $word = shift(@tokens);
        $word =~ tr/_/ /;
        $word =~ s/\(.*\)//;
        print $word, "\n";

        shift(@tokens); # Lexical ID
    }
}

上述脚本的要点可以在此处找到。
可以在此处找到更强大且忠实于原始来源的解析器。

这两个脚本的使用方式相似：./wordnet_parser.pl DATA_FILE。

As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.

The line format is pretty simple to parse (search.c, function parse_synset), but if all you're interested in are the words, the relevant part of the line is formatted as:

NNNNNNNN NN a NN word N [word N ...]

These correspond to:

Byte offset within file (8 character integer)
File number (2 character integer)
Part of speech (1 character)
Number of words (2 characters, hex encoded)
N occurrences of...
- Word with spaces replaced with underscores, optional comment in parentheses
- Word lexical ID (a unique occurrence ID)

For example, from data.adj:

00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"

Byte offset within the file is 4614
File number is 0
Part of speech is s, corresponding to adjective (wnutil.c, function getpos)
Number of words is 2
- First word is cut with lexical ID 0
- Second word is shortened with lexical ID 0

A short Perl script to simply dump the words from the data.* files:

#!/usr/bin/perl

while (my $line = <>) {
    # If no 8-digit byte offset is present, skip this line
    if ( $line !~ /^[0-9]{8}\s/ ) { next; }
    chomp($line);

    my @tokens = split(/ /, $line);
    shift(@tokens); # Byte offset
    shift(@tokens); # File number
    shift(@tokens); # Part of speech

    my $word_count = hex(shift(@tokens));
    foreach ( 1 .. $word_count ) {
        my $word = shift(@tokens);
        $word =~ tr/_/ /;
        $word =~ s/\(.*\)//;
        print $word, "\n";

        shift(@tokens); # Lexical ID
    }
}

A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.

Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE.

回复收藏 0 原文

谜兔 2024-08-29 00:47:35

请参阅Kevin 的单词列表。特别是“词性数据库”。您必须自己进行一些最少的文本处理，以便自己将数据库放入多个文件中，但这可以通过一些 grep 命令轻松完成。

许可条款可在“自述文件”页面上找到。

回复收藏 0 原文

嗫嚅 2024-08-29 00:47:35

http://icon.shef.ac.uk/Moby/mpos.html

每个词性词汇条目包含一个单词或短语字段，后跟一个字段分隔符 (ASCII 215) 以及使用以下 ASCII 符号编码的词性字段（大小写很重要）：

Noun                            N
Plural                          p
Noun Phrase                     h
Verb (usu participle)           V
Verb (transitive)               t
Verb (intransitive)             i
Adjective                       A
Adverb                          v
Conjunction                     C
Preposition                     P
Interjection                   !
Pronoun                         r
Definite Article                D
Indefinite Article              I
Nominative                      o

http://icon.shef.ac.uk/Moby/mpos.html

Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):

Noun                            N
Plural                          p
Noun Phrase                     h
Verb (usu participle)           V
Verb (transitive)               t
Verb (intransitive)             i
Adjective                       A
Adverb                          v
Conjunction                     C
Preposition                     P
Interjection                   !
Pronoun                         r
Definite Article                D
Indefinite Article              I
Nominative                      o

回复收藏 0 原文

~没有更多了~