使用 NLTK 的 FreqDist
我正在尝试使用 Python 获取一组文档的频率分布。我的代码由于某种原因无法工作并产生此错误:
Traceback (most recent call last):
File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
fd = FreqDist(corpus_text)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
self.update(samples)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
self.inc(sample, count=count)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'
你能帮忙吗?
这是到目前为止的代码:
import os
import nltk
from nltk.probability import FreqDist
#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)
corpus = []
#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)
#Append all documents in directory into a single 'document' (list)
for doc in listing:
doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
input = open(doc_name).read()
input = input.split()
corpus.append(input)
#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)
#Remove stop-words
for w in corpus_text:
if w in stopwords:
corpus_text.remove(w)
fd = FreqDist(corpus_text)
I'm trying to get a frequency distribution of a set of documents using Python. My code isn't working for some reason and is producing this error:
Traceback (most recent call last):
File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
fd = FreqDist(corpus_text)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
self.update(samples)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
self.inc(sample, count=count)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'
Can you help?
This is the code so far:
import os
import nltk
from nltk.probability import FreqDist
#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)
corpus = []
#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)
#Append all documents in directory into a single 'document' (list)
for doc in listing:
doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
input = open(doc_name).read()
input = input.split()
corpus.append(input)
#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)
#Remove stop-words
for w in corpus_text:
if w in stopwords:
corpus_text.remove(w)
fd = FreqDist(corpus_text)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我希望至少有两个想法有助于回答。
首先, nltk.text.Text() 方法的文档指出(强调我的):
所以我不确定 Text() 是您想要处理这些数据的方式。在我看来,使用列表就可以了。
其次,我会提醒您考虑一下您要求 NLTK 在此执行的计算。在确定频率分布之前删除停用词意味着您的频率会发生偏差;我不明白为什么在制表之前删除停用词,而不是在事后检查分布时忽略它们。 (我想第二点会比部分答案更好的查询/评论,但我觉得值得指出比例会倾斜。)根据您打算使用频率分布的目的,这可能或可能本身不是问题。
Two thoughts that I hope at least contribute to an answer.
First, the documentation for the nltk.text.Text() method states (emphasis mine):
So I'm not sure Text() is the way you want to handle this data. It seems to me you would do just fine to use a list.
Second, I would caution you to think about the calculation you're asking NLTK to perform here. Removing stopwords before determining a frequency distribution means that the your frequencies will be skewed; I do not understand why the stopwords are removed before tabulation rather than just ignored in examining the distribution after the fact. (I suppose this second point would make a better query/comment than part of an answer, but I felt it worth pointing out that the proportions would be skewed.) Depending on what you intend to use the frequency distribution for, this may or may not be a problem in and of itself.
该错误表明您尝试使用列表作为哈希键。你能将它转换为元组吗?
The error says you try to use a list as a hash key. Can you convert it to a tuple?