NLTK中选择哪个分类器
我想将短信分为几个类别,例如“关系建立”、“协调”、“信息共享”、“知识共享”和“沟通”。 “冲突解决”。我正在使用 NLTK 库来处理这些数据。我想知道 nltk 中哪个分类器更适合这个特定的多类分类问题。
我打算使用朴素贝叶斯分类,是否可取?
I want to classify text messages into several categories like, "relation building", "coordination", "information sharing", "knowledge sharing" & "conflict resolution". I am using NLTK library to process these data. I would like to know which classifier, in nltk, is better for this particular multi-class classification problem.
I am planning to use Naive Bayes Classification, is it advisable?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
朴素贝叶斯是最简单且易于理解的分类器,因此它很好用。使用集束搜索来找到最佳分类的决策树并不更难理解,而且通常会更好一些。 MaxEnt 和 SVM 往往更复杂,并且 SVM 需要一些调整才能正确。
最重要的是功能的选择+您提供的数据的数量/质量!
对于您的问题,我首先会关注确保您拥有良好的训练/测试数据集,并选择良好的功能。既然你问这个问题,你对 NLP 机器学习没有太多经验,所以我想说从朴素贝叶斯开始,因为它不使用复杂的功能——你可以只标记和计算单词出现的次数。
编辑:
问题如何找到a的主题句子?和我的回答也值得一看。
Naive Bayes is the simplest and easy to understand classifier and for that reason it's nice to use. Decision Trees with a beam search to find the best classification are not significantly harder to understand and are usually a bit better. MaxEnt and SVM tend be more complex, and SVM requires some tuning to get right.
Most important is the choice of features + the amount/quality of data you provide!
With your problem, I would focus first on ensuring you have a good training/testing dataset and also choose good features. Since you are asking this question you haven't had much experience with machine learning for NLP, so I'd say start of easy with Naive Bayes as it doesn't use complex features- you can just tokenize and count word occurrences.
EDIT:
The question How do you find the subject of a sentence? and my answer are also worth looking at.
是的,为每个类别训练一个朴素贝叶斯分类器,然后根据哪个分类器提供最高分数将每条消息标记为一个类别,是解决此类问题的标准第一种方法。如果您发现性能不足,可以使用更复杂的单类分类器算法来替代朴素贝叶斯,例如支持向量机(我相信可以通过 Weka 插件在 NLTK 中使用它,但不是积极的)。除非你能想到这个问题领域中任何特定的东西使得朴素贝叶斯特别不适合,否则它通常是许多项目的“第一次尝试”。
我会考虑尝试的另一个 NLTK 分类器是 MaxEnt,因为我相信它本身可以处理多类分类。 (尽管多重二元分类器方法也非常标准且常见)。无论如何,最重要的是收集大量正确标记的文本消息。
如果您所说的“短信”指的是实际的手机短信,这些短信往往非常短,并且语言非常非正式且多样化,我认为功能选择可能最终成为决定准确性的更大因素,而不是您选择的分类器。例如,使用能够理解常用缩写和习语的 Stemmer 或 Lemmatizer、标记词性或分块、实体提取、提取术语之间的可能关系可能比使用更复杂的分类器提供更多效果。
本文讨论了基于情绪对 Facebook 状态消息进行分类,其中存在一些相同的问题,并且可能会对此提供一些见解。这些链接指向谷歌缓存,因为我在原始站点上遇到了问题:
http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+分类器+多个+类&hl=en&gl=us&pid=bl&a mp;srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO 2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1
Yes, Training a Naive Bayes Classifier for each category and then labeling each message to a class based on which Classifier provides the highest score is a standard first approach to problems like this. There are more sophisticated single class classifier algorithms which you could substitute in for Naive Bayes if you find performance inadequate, such as a Support Vector Machine ( Which I believe is available in NLTK via a Weka plug in, but not positive). Unless you can think of anything specific in this problem domain that would make Naieve Bayes especially unsuitable, its ofen the go-to "first try" for a lot of projects.
The other NLTK classifier I would consider trying would be MaxEnt as I believe it natively handles multiclass classification. (Though the multiple binary classifer approach is very standard and common as well). In any case the most important thing is to collect a very large corpus of properly tagged text messages.
If by "Text Messages" you are referring to actual cell phone text messages these tend to be very short and the language is very informal and varied, I think feature selection may end up being a larger factor in determining accuracy than classifier choice for you. For example, using a Stemmer or Lemmatizer that understands common abbreviations and idioms used, tagging part of speech or chunking , entity extraction, extracting probably relationships between terms may provide more bang than using more complex classifiers.
This paper talks about classifying Facebook status messages based on sentiment, which has some of the same issues, and may provide some insights into this. The links is to a google cache because I'm having problems w/ the original site:
http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl=us&pid=bl&srcid=ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN&sig=AHIEtbQN4J_XciVhVI60oyrPb4164u681w&pli=1