如何使用 WordNet 查找英语单词的频率计数?

发布于 2024-11-06 10:23:53 字数 107 浏览 6 评论 0原文

有没有办法使用 Python 使用 WordNet 或 NLTK 来查找英语单词的使用频率?

注意:我不想要给定输入文件中单词的频率计数。我想要根据当今的使用情况来统计某个单词的频率。

Is there a way to find the frequency of the usage of a word in the English language using WordNet or NLTK using Python?

NOTE: I do not want the frequency count of a word in a given input file. I want the frequency count of a word in general based on the usage in today's time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

走野 2024-11-13 10:23:53

在 WordNet 中,每个引理都有一个由该方法返回的频率计数
lemma.count(),存储在文件 nltk_data/corpora/wordnet/cntlist.rev 中。

代码示例:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

结果:

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

但是,许多计数为零,并且源文件或文档中没有使用哪个语料库来创建此数据的信息。根据 Daniel Jurafsky 的语音和语言处理一书James H. Martin,感知频率来自 SemCor 语料库,它是已经很小而且过时的布朗语料库。

因此,最好选择最适合您的应用程序的语料库,并按照克里斯托弗的建议自行创建数据。

要使 Python3.x 兼容,只需执行以下操作:

代码示例:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))

In WordNet, every Lemma has a frequency count that is returned by the method
lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

Result:

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

To make this Python3.x compatible just do:

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))
咋地 2024-11-13 10:23:53

您可以使用棕色语料库来完成此操作,尽管它已经过时(最后一次修订于 1979 年),因此缺少许多当前单词。

import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")

然后,您可以将 FreqDist cpickle 保存到文件中,以便稍后更快地加载。

语料库基本上只是一个充满句子的文件,每行一个,并且还有很多其他语料库,因此您可能会找到一个适合您目的的语料库。更多最新语料库的其他几个来源:Google< /a>,美国国家语料库

您还可以获取前 60,000 个单词及其频率的当前列表
当代美国英语语料库

You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.

import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")

You could then cpickle the FreqDist off to a file for faster loading later.

A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

You can also suppsedly get a current list of the top 60,000 words and their frequencies from
the Corpus of Contemporary American English

撑一把青伞 2024-11-13 10:23:53

查看此网站的词频:
http://corpus.byu.edu/coca/

有人编制了一份取自 opensubtitles 的单词列表。 org(电影脚本)。有一个像这样格式的免费简单文本文件可供下载。有许多不同的语言。

you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962

http://invokeit.wordpress.com/Frequency-word-lists/

Check out this site for word frequencies:
http://corpus.byu.edu/coca/

Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.

you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962

http://invokeit.wordpress.com/frequency-word-lists/

鸠魁 2024-11-13 10:23:53

你实际上不能这样做,因为它很大程度上取决于上下文。不仅如此,对于频率较低的单词,频率将很大程度上取决于样本。

您最好的选择可能是找到给定流派的大量文本(例如,从 古腾堡计划 下载一百本书)然后自己数一下单词数。

You can't really do this, because it depends so much on the context. Not only that, for less frequent words the frequency will be wildly dependent on the sample.

Your best bet is probably to find a large corpus of text of the given genre (e.g. download a hundred books from Project Gutenberg) and count the words yourself.

将军与妓 2024-11-13 10:23:53

查看 Wordnet 相似度项目的信息内容部分,网址为 http://wn-similarity.sourceforge.net/ 。在那里,您将找到 Wordnet 引理的词频数据库(或者更确切地说,从词频派生的信息内容),这些数据库是根据几个不同的语料库计算得出的。源代码是Perl语言,但数据库是独立提供的,可以很容易地与NLTK一起使用。

Take a look at the Information Content section of the Wordnet Similarity project at http://wn-similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.

や莫失莫忘 2024-11-13 10:23:53

您可以从 https://github.com/stanfordnlp/GloVe 下载词向量 glove.6B.zip,解压它们并查看文件 glove.6B.zip。 6B.50d.txt。

在那里,您将找到 400.000 个英语单词,每行一个(加上同一行中每个单词 50 个数字),小写,从最频繁 (the) 到最不频繁排序。您可以通过以原始格式或 pandas 读取此文件来创建单词排名。

它并不完美,但我过去曾使用过它。同一网站还提供了最多包含 220 万个英文单词的其他文件(大小写)。

You can download the word vectors glove.6B.zip from https://github.com/stanfordnlp/GloVe, unzip them and look at the file glove.6B.50d.txt.

There, you will find 400.000 English words, one in each line (plus 50 numbers per word in the same line), lower cased, sorted from most frequent (the) to least frequent. You can create a rank of words by reading this file in raw format or pandas.

It's not perfect, but I have used it in the past. The same website provides other files with up to 2.2m English words, cased.

感受沵的脚步 2024-11-13 10:23:53

维基词典项目有一些基于电视脚本和古腾堡计划的频率列表,但它们的格式不太适合解析。

The Wiktionary project has a few frequency lists based on TV scripts and Project Gutenberg, but their format is not particularly nice for parsing.

穿越时光隧道 2024-11-13 10:23:53

Christopher Pickslay 解决方案的 Python 3 版本(包括将频率保存到 tempdir):

from pathlib import Path
from pickle import dump, load
from tempfile import gettempdir

from nltk.probability import FreqDist


def get_word_frequencies() -> FreqDist:
  tmp_path = Path(gettempdir()) / "word_freq.pkl"
  if tmp_path.exists():
    with tmp_path.open(mode="rb") as f:
      word_frequencies = load(f)
  else:
    from nltk import download
    download('brown', quiet=True)
    from nltk.corpus import brown
    word_frequencies = FreqDist(word.lower() for sentence in brown.sents()
                                for word in sentence)
    with tmp_path.open(mode="wb") as f:
      dump(word_frequencies, f)

  return word_frequencies

用法:

word_frequencies = get_word_frequencies()

print(word_frequencies["and"])
print(word_frequencies.freq("and"))

输出:

28853
0.02484774266443448

Python 3 version of Christopher Pickslay's solution (incl. saving frequencies to tempdir):

from pathlib import Path
from pickle import dump, load
from tempfile import gettempdir

from nltk.probability import FreqDist


def get_word_frequencies() -> FreqDist:
  tmp_path = Path(gettempdir()) / "word_freq.pkl"
  if tmp_path.exists():
    with tmp_path.open(mode="rb") as f:
      word_frequencies = load(f)
  else:
    from nltk import download
    download('brown', quiet=True)
    from nltk.corpus import brown
    word_frequencies = FreqDist(word.lower() for sentence in brown.sents()
                                for word in sentence)
    with tmp_path.open(mode="wb") as f:
      dump(word_frequencies, f)

  return word_frequencies

Usage:

word_frequencies = get_word_frequencies()

print(word_frequencies["and"])
print(word_frequencies.freq("and"))

Output:

28853
0.02484774266443448
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文