在将WordCloud用于Python时,为什么字母的频率“ S&quot”在云中考虑?
我要了解Python的WordCloud软件包,并且正在使用NLTK的Moby Dick文本进行测试。其中的一个片段如下:
正如您可以从亮点中看到的那样在图像中,所有具有的撇号都已逃脱到“/'s”,WordCount似乎将其包含在频率计数中:
“ https://i.sstatic.net/vz5c5.png” rel =“ nofollow noreferrer”>我偏斜云的
示例t似乎在计算“ S”。我是否缺少某个地方的属性,还是必须从字符串中手动删除“/'s”?
以下是我的代码的摘要:
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:
As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":
Frequency distribution of words
Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:
In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?
Below is a summary of my code:
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在此应用程序中,通常使用
stopwords
首先过滤单词列表,因为您不需要简单的单词,例如a,an,the,it,IT,...
,以主导您的结果。更改了代码,希望它有所帮助。您可以检查
stopwords
的内容。输出:请参见 wordcloud imgur
In such application, usually use
stopwords
to filter the word list first, since you don't want simple words, such asa, an, the, it, ...
, to dominate your result.changed the code a little bit, hope it helps. you can check the content of
stopwords
.output: see wordcloud Imgur
看来您的输入是问题的一部分,如果您看起来确实如此,
您会得到
['Richardson','“”,'s','dictionary','ketos','',',',',' ','。','cetus',',','','。',','','',','','anglo']
您可以做一些事情来克服这一点,您可以只需过滤超过1的字符串,
您可以尝试由NLTK提供的其他文件,也可以尝试读取原始输入并正确解码输入。
It looks like your input is part of the problem, if you look do like so,
You get
['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']
You can do a few things to try and overcome this, you could just filter for strings longer than 1,
You could try a different file provided by nltk, or you could try reading the input raw and properly decoding it.