我正在尝试实现一种朴素贝叶斯方法来查找给定文档或单词流的主题。我可以查找朴素贝叶斯方法吗?
另外,我正在努力改进我的字典。最初,我有一堆映射到主题的单词(硬编码)。取决于除已映射的单词之外的单词的出现。根据这些单词的出现,我想将它们添加到映射中,从而改进和学习映射到主题的新单词。并且还改变了单词的概率。
我该怎么做呢?我的方法正确吗?
哪种编程语言最适合实施?
I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
发布评论
评论(1)
朴素贝叶斯的现有实现
您可能最好只使用支持使用朴素贝叶斯进行文档分类的现有软件包之一,例如:
Python - 使用 Python 来执行此操作基于自然语言工具包 (NLTK),请参阅免费提供的 文档分类部分href="http://www.nltk.org/book" rel="noreferrer">NLTK 书籍。
Ruby - 如果您更喜欢 Ruby,则可以使用分类器 宝石。下面是检测 恶搞之家的引言是否有趣的示例代码 -有趣。
Perl - Perl 具有 Algorithm::NaiveBayes 模块,包含包中的示例使用片段概要。
C# - C# 程序员可以使用 nBayes。该项目的主页有一个简单的垃圾邮件/非垃圾邮件分类器的示例代码。
Java - Java 人员拥有 Classifier4J。您可以在此处查看训练和评分代码片段。
从关键字引导分类
听起来您想从一组已知可提示某些主题的关键字开始,然后使用这些关键字引导分类器。
这是一个相当聪明的想法。查看论文使用关键字、EM 和收缩进行引导的文本分类,作者:McCallum 和 Nigam (1999)。通过采用这种方法,他们能够将分类准确率从单独使用硬编码关键字的 45% 提高到使用自举朴素贝叶斯分类器的 66%。就他们的数据而言,后者接近人类的共识水平,因为人们在 72% 的时间里就文档标签达成一致。
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.