在 NLTK Python 的朴素贝叶斯分类器中使用文档长度
我正在 Python 中使用 NLTK 构建垃圾邮件过滤器。我现在检查单词的出现情况并使用 NaiveBayesClassifier,结果准确度为 0.98,垃圾邮件的 F 测量值为 0.92,非垃圾邮件的 F 测量值为:0.98。然而,当检查我的程序出错的文档时,我注意到许多被分类为非垃圾邮件的垃圾邮件都是非常短的消息。
所以我想把文档的长度作为 NaiveBayesClassifier 的一个特征。问题是它现在只处理二进制值。除了例如说:length<100 =true/false 之外,还有其他方法可以做到这一点吗?
(附注:我已经构建了类似于 http://nltk. googlecode.com/svn/trunk/doc/book/ch06.html 示例)
I am building a spam filter using the NLTK in Python. I now check for the occurances of words and use the NaiveBayesClassifier resulting in an accuracy of .98 and F measure for spam of .92 and for non-spam: 0.98. However when checking the documents in which my program errors I notice that a lot of spam that is classified as non-spam are very short messages.
So I want to put the length of a document as a feature for the NaiveBayesClassifier. The problem is it now only handles binary values. Is there any other way to do this than for example say: length<100 =true/false?
(p.s. I have build the spam detector analogous to the http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html example)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
NLTK 的朴素贝叶斯实现并没有做到这一点,但您可以将 NaiveBayesClassifier 的预测与文档长度的分布结合起来。 NLTK 的 prob_classify 方法将为您提供给定文档中单词的类的条件概率分布,即 P(cl|doc)。你想要的是 P(cl|doc,len)——给定文档中的单词及其长度的类的概率。如果我们再做一些独立性假设,我们会得到:
您已经从 prob_classify 获得了第一项,所以剩下要做的就是估计 P(len|cl) 和 P(len)。
在对文档长度进行建模时,您可以随心所欲,但一开始您可以假设文档长度的对数呈正态分布。如果您知道每个类和总体中日志文档长度的平均值和标准差,则可以轻松计算 P(len|cl) 和 P(len)。
这是估计 P(len) 的一种方法:
唯一需要记住的棘手的事情是,这是对数长度而不是长度的分布,并且它是连续分布。因此,长度为 L 的文档的概率将是:
使用每个类中文档的对数长度,可以以相同的方式估计条件长度分布。这应该会给你 P(cl|doc,len) 所需的东西。
NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:
You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).
You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).
Here's one way of going about estimating P(len):
The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:
The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).
MultiNomial NaiveBayes 算法可以处理范围值,但未在 NLTK 中实现。对于 NLTK NaiveBayesClassifier,您可以尝试使用几个不同的长度阈值作为二进制特征。我还建议尝试 Maxent 分类器,看看它如何处理较小的文本。
There are MultiNomial NaiveBayes algorithms that can handle range values, but not implemented in NLTK. For the NLTK NaiveBayesClassifier, you could try having a couple different length thresholds as binary features. I'd also suggest trying a Maxent Classifier to see how it handles smaller text.