在将WordCloud用于Python时,为什么字母的频率“ S&quot”在云中考虑?

发布于 2025-01-30 14:08:50 字数 732 浏览 1 评论 0原文

我要了解Python的WordCloud软件包,并且正在使用NLTK的Moby Dick文本进行测试。其中的一个片段如下:

我的示例字符串片段

正如您可以从亮点中看到的那样在图像中,所有具有的撇号都已逃脱到“/'s”,WordCount似乎将其包含在频率计数中:

​“ https://i.sstatic.net/vz5c5.png” rel =“ nofollow noreferrer”>我偏斜云的

示例t似乎在计算“ S”。我是否缺少某个地方的属性,还是必须从字符串中手动删除“/'s”?

以下是我的代码的摘要:

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:

Snippet of my example string

As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":

Frequency distribution of words

Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:

Example of my skewed cloud

In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?

Below is a summary of my code:

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

酸甜透明夹心 2025-02-06 14:08:50

在此应用程序中,通常使用stopwords首先过滤单词列表,因为您不需要简单的单词,例如a,an,the,it,IT,... ,以主导您的结果。

更改了代码,希望它有所帮助。您可以检查stopwords的内容。

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

输出:请参见 wordcloud imgur

In such application, usually use stopwords to filter the word list first, since you don't want simple words, such as a, an, the, it, ..., to dominate your result.

changed the code a little bit, hope it helps. you can check the content of stopwords.

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

output: see wordcloud Imgur

世俗缘 2025-02-06 14:08:50

看来您的输入是问题的一部分,如果您看起来确实如此,

corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in  corpus]
print word[215:230]

您会得到

['Richardson','“”,'s','dictionary','ketos','',',',',' ','。','cetus',',','','。',','','',','','anglo']

您可以做一些事情来克服这一点,您可以只需过滤超过1的字符串,

words = [word for word in corpus if len(word) > 1]

您可以尝试由NLTK提供的其他文件,也可以尝试读取原始输入并正确解码输入。

It looks like your input is part of the problem, if you look do like so,

corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in  corpus]
print word[215:230]

You get

['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']

You can do a few things to try and overcome this, you could just filter for strings longer than 1,

words = [word for word in corpus if len(word) > 1]

You could try a different file provided by nltk, or you could try reading the input raw and properly decoding it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文