在 NLTK Python 的朴素贝叶斯分类器中使用文档长度

发布于 2024-10-21 05:02:03 字数 451 浏览 12 评论 0原文

我正在 Python 中使用 NLTK 构建垃圾邮件过滤器。我现在检查单词的出现情况并使用 NaiveBayesClassifier，结果准确度为 0.98，垃圾邮件的 F 测量值为 0.92，非垃圾邮件的 F 测量值为：0.98。然而，当检查我的程序出错的文档时，我注意到许多被分类为非垃圾邮件的垃圾邮件都是非常短的消息。

所以我想把文档的长度作为 NaiveBayesClassifier 的一个特征。问题是它现在只处理二进制值。除了例如说：length<100 =true/false 之外，还有其他方法可以做到这一点吗？

（附注：我已经构建了类似于 http://nltk. googlecode.com/svn/trunk/doc/book/ch06.html 示例）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柏林苍穹下 2024-10-28 05:02:03

NLTK 的朴素贝叶斯实现并没有做到这一点，但您可以将 NaiveBayesClassifier 的预测与文档长度的分布结合起来。 NLTK 的 prob_classify 方法将为您提供给定文档中单词的类的条件概率分布，即 P(cl|doc)。你想要的是 P(cl|doc,len)——给定文档中的单词及其长度的类的概率。如果我们再做一些独立性假设，我们会得到：

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

您已经从 prob_classify 获得了第一项，所以剩下要做的就是估计 P(len|cl) 和 P(len)。

在对文档长度进行建模时，您可以随心所欲，但一开始您可以假设文档长度的对数呈正态分布。如果您知道每个类和总体中日志文档长度的平均值和标准差，则可以轻松计算 P(len|cl) 和 P(len)。

这是估计 P(len) 的一种方法：

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

唯一需要记住的棘手的事情是，这是对数长度而不是长度的分布，并且它是连续分布。因此，长度为 L 的文档的概率将是：

p.cdf(log(L+1)) - p.cdf(log(L))

使用每个类中文档的对数长度，可以以相同的方式估计条件长度分布。这应该会给你 P(cl|doc,len) 所需的东西。

NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).

You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).

Here's one way of going about estimating P(len):

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:

p.cdf(log(L+1)) - p.cdf(log(L))

The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).

回复收藏 0 原文