使用朴素贝叶斯进行文本分类
我正在使用朴素贝叶斯解决文本分类机器学习问题。我把每个词当作一个特征。我已经能够实现它并且获得了很好的准确性。
我可以使用单词元组作为特征吗?
例如,如果有两个课程:政治和体育。政府这个词可能出现在他们两个身上。然而,在政治中我可以有一个元组(政府,民主),而在体育类中我可以有一个元组(政府,运动员)。因此,如果出现一篇新的政治文章,则元组(政府、民主)的概率比元组(政府、运动员)的概率更大。
我问这个问题是因为这样做我违反了朴素贝叶斯问题的独立性假设,因为我也将单个单词视为特征。
另外,我正在考虑为特征添加权重。例如,3 元组特征的权重小于 4 元组特征的权重。
从理论上讲,这两种方法不会改变朴素贝叶斯分类器的独立性假设吗?另外,我还没有开始使用我提到的方法,但这会提高准确性吗?我认为准确性可能不会提高,但获得相同准确性所需的训练数据量会更少。
I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
即使不添加二元组,真实文档也已经违反了独立性假设。以奥巴马出现在文件中为条件,总统出现的可能性就大得多。尽管如此,朴素贝叶斯在分类方面仍然做得不错,即使它给出的概率估计完全错误。因此,我建议您继续向分类器添加更复杂的特征,看看它们是否可以提高准确性。
如果用更少的数据获得相同的精度,那么基本上相当于用相同数量的数据获得更好的精度。
另一方面,随着数据量的减少,使用更简单、更常见的功能效果会更好。如果您尝试将太多参数拟合到太少数据中,则往往会严重过度拟合。
但最重要的是尝试一下看看。
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
不,从理论角度来看,你并没有改变独立性假设。您只需创建一个修改过的(或新的)样本空间。一般来说,一旦开始使用更高的 n 元模型作为样本空间中的事件,数据稀疏性就会成为一个问题。我认为使用元组会导致同样的问题。您可能需要更多而不是更少的训练数据。您可能还需要更多地考虑您使用的平滑类型。简单的拉普拉斯平滑可能并不理想。
我认为最重要的一点是:无论您使用什么分类器,其特征都高度依赖于领域(有时甚至是数据集)。例如,如果您根据电影评论对文本情感进行分类,仅使用一元语法似乎违反直觉,但它们比仅使用形容词表现更好。另一方面,对于 Twitter 数据集,一元语法和二元语法的组合被发现很好,但更高的 n 元语法没有用。基于此类报告(参考 Pang 和 Lee,意见挖掘和情感分析),我认为使用较长的元组会显示类似的结果,因为毕竟,单词元组只是更高层次中的点。次元空间。基本算法的行为方式相同。
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.