使用特征频率训练分类器(朴素贝叶斯)的 Python NLTK 代码片段
我想知道是否有人可以帮助我通过一段代码片段来演示如何使用特征频率方法而不是特征存在来训练朴素贝叶斯分类器。
我认为以下内容如第 6 章所示 链接文本是指使用功能存在(FP)创建功能集 -
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
请建议
I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence.
I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) -
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
Please advice
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在您发送的链接中,它说这个函数是特征提取器,它只是检查这些单词中的每一个是否存在于给定文档中。
下面是整个代码,每行都有编号:
在第 1 行中,它创建了所有单词的列表。
第 2 行采用最常见的 2000 个单词。
3 函数的定义
4 转换文档列表(我认为一定是列表),将列表转换为集合。
5 声明一个字典
6 迭代所有最常见的 2000 个单词
7 创建一个字典,其中键为“contains(theword)”,值为 true 或 false。如果文档中存在该单词,则为 true,否则为 false
8 返回字典,该字典显示文档是否包含最常见的 2000 个单词。
这能回答你的问题吗?
In the link you sent it says this function is feature extractor that simply checks whether each of these words is present in a given document.
Here is the whole code with numbers for each line:
In line 1 it created a list of all words.
In line 2 it takes the most frequent 2000 words.
3 the definition of the function
4 converts the document list (I think it must be a list) and converts the list to a set.
5 declares a dictionary
6 iterates over all of the most frequent 2000 words
7 creates a dictionary where the key is 'contains(theword)' and the value is either true or false. True if the word is present in the document, false otherwise
8 returns the dictionary which is shows whether the document contains the most frequent 2000 words or not.
Does this answer your question?
对于训练,创建适当的 FreqDists,您可以使用它来创建 ProbDists,然后可以将其传递到 NaiveBayesClassifier。但分类实际上适用于特征集,它使用布尔值,而不是频率。因此,如果您想基于 FreqDist 进行分类,则必须实现自己的分类器,该分类器不使用 NLTK 功能集。
For training, create appropriate FreqDists that you can use to create ProbDists, than can then be passed in to the NaiveBayesClassifier. But the classification actually works on feature sets, which use boolean values, not frequencies. So if you want to classify based on a FreqDist, you'll have to implement your own classifier, that does not use the NLTK feature sets.
这是一个可以帮助您的方法:
Here's a method which will help you :