使用 NLTK 的 FreqDist

发布于 2024-11-15 01:43:44 字数 1525 浏览 7 评论 0原文

我正在尝试使用 Python 获取一组文档的频率分布。我的代码由于某种原因无法工作并产生此错误：

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

你能帮忙吗？

这是到目前为止的代码：

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

原文

I'm trying to get a frequency distribution of a set of documents using Python. My code isn't working for some reason and is producing this error:

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

Can you help?

This is the code so far:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

菩提树下叶撕阳。 2024-11-22 01:43:44

我希望至少有两个想法有助于回答。

首先， nltk.text.Text() 方法的文档指出（强调我的）：

围绕简单（字符串）标记序列的包装器，旨在支持文本的初始探索（通过交互式控制台）。其方法对文本上下文执行各种分析（例如，计数、索引、搭配发现），并显示结果。 如果您希望编写一个利用这些分析的程序，那么您应该绕过 Text 类，而直接使用适当的分析函数或类。

所以我不确定 Text() 是您想要处理这些数据的方式。在我看来，使用列表就可以了。

其次，我会提醒您考虑一下您要求 NLTK 在此执行的计算。在确定频率分布之前删除停用词意味着您的频率会发生偏差；我不明白为什么在制表之前删除停用词，而不是在事后检查分布时忽略它们。（我想第二点会比部分答案更好的查询/评论，但我觉得值得指出比例会倾斜。）根据您打算使用频率分布的目的，这可能或可能本身不是问题。

回复收藏 0 原文