使用 NLTK 的 FreqDist

发布于 2024-11-15 01:43:44 字数 1525 浏览 4 评论 0原文

我正在尝试使用 Python 获取一组文档的频率分布。我的代码由于某种原因无法工作并产生此错误:

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

你能帮忙吗?

这是到目前为止的代码:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

I'm trying to get a frequency distribution of a set of documents using Python. My code isn't working for some reason and is producing this error:

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

Can you help?

This is the code so far:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

菩提树下叶撕阳。 2024-11-22 01:43:44

我希望至少有两个想法有助于回答。

首先, nltk.text.Text() 方法的文档指出(强调我的):

围绕简单(字符串)标记序列的包装器,旨在支持文本的初始探索(通过交互式控制台)。其方法对文本上下文执行各种分析(例如,计数、索引、搭配发现),并显示结果。 如果您希望编写一个利用这些分析的程序,那么您应该绕过 Text 类,而直接使用适当的分析函数或类。

所以我不确定 Text() 是您想要处理这些数据的方式。在我看来,使用列表就可以了。

其次,我会提醒您考虑一下您要求 NLTK 在此执行的计算。在确定频率分布之前删除停用词意味着您的频率会发生偏差;我不明白为什么在制表之前删除停用词,而不是在事后检查分布时忽略它们。 (我想第二点会比部分答案更好的查询/评论,但我觉得值得指出比例会倾斜。)根据您打算使用频率分布的目的,这可能或可能本身不是问题。

Two thoughts that I hope at least contribute to an answer.

First, the documentation for the nltk.text.Text() method states (emphasis mine):

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

So I'm not sure Text() is the way you want to handle this data. It seems to me you would do just fine to use a list.

Second, I would caution you to think about the calculation you're asking NLTK to perform here. Removing stopwords before determining a frequency distribution means that the your frequencies will be skewed; I do not understand why the stopwords are removed before tabulation rather than just ignored in examining the distribution after the fact. (I suppose this second point would make a better query/comment than part of an answer, but I felt it worth pointing out that the proportions would be skewed.) Depending on what you intend to use the frequency distribution for, this may or may not be a problem in and of itself.

dawn曙光 2024-11-22 01:43:44

该错误表明您尝试使用列表作为哈希键。你能将它转换为元组吗?

The error says you try to use a list as a hash key. Can you convert it to a tuple?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文