使用特征频率训练分类器(朴素贝叶斯)的 Python NLTK 代码片段

发布于 2024-08-19 10:46:11 字数 478 浏览 14 评论 0原文

我想知道是否有人可以帮助我通过一段代码片段来演示如何使用特征频率方法而不是特征存在来训练朴素贝叶斯分类器。

我认为以下内容如第 6 章所示 链接文本是指使用功能存在(FP)创建功能集 -

def document_features(document): 
    document_words = set(document) 

    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)

    return features

请建议

I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence.

I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) -

def document_features(document): 
    document_words = set(document) 

    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)

    return features

Please advice

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

画离情绘悲伤 2024-08-26 10:46:11

在您发送的链接中,它说这个函数是特征提取器,它只是检查这些单词中的每一个是否存在于给定文档中。

下面是整个代码,每行都有编号:

1     all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
2     word_features = all_words.keys()[:2000] 

3     def document_features(document): 
4          document_words = set(document) 
5          features = {}
6          for word in word_features:
7               features['contains(%s)' % word] = (word in document_words)
8          return features

在第 1 行中,它创建了所有单词的列表。

第 2 行采用最常见的 2000 个单词。

3 函数的定义

4 转换文档列表(我认为一定是列表),将列表转换为集合。

5 声明一个字典

6 迭代所有最常见的 2000 个单词

7 创建一个字典,其中键为“contains(theword)”,值为 true 或 false。如果文档中存在该单词,则为 true,否则为 false

8 返回字典,该字典显示文档是否包含最常见的 2000 个单词。

这能回答你的问题吗?

In the link you sent it says this function is feature extractor that simply checks whether each of these words is present in a given document.

Here is the whole code with numbers for each line:

1     all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
2     word_features = all_words.keys()[:2000] 

3     def document_features(document): 
4          document_words = set(document) 
5          features = {}
6          for word in word_features:
7               features['contains(%s)' % word] = (word in document_words)
8          return features

In line 1 it created a list of all words.

In line 2 it takes the most frequent 2000 words.

3 the definition of the function

4 converts the document list (I think it must be a list) and converts the list to a set.

5 declares a dictionary

6 iterates over all of the most frequent 2000 words

7 creates a dictionary where the key is 'contains(theword)' and the value is either true or false. True if the word is present in the document, false otherwise

8 returns the dictionary which is shows whether the document contains the most frequent 2000 words or not.

Does this answer your question?

深府石板幽径 2024-08-26 10:46:11

对于训练,创建适当的 FreqDists,您可以使用它来创建 ProbDists,然后可以将其传递到 NaiveBayesClassifier。但分类实际上适用于特征集,它使用布尔值,而不是频率。因此,如果您想基于 FreqDist 进行分类,则必须实现自己的分类器,该分类器不使用 NLTK 功能集。

For training, create appropriate FreqDists that you can use to create ProbDists, than can then be passed in to the NaiveBayesClassifier. But the classification actually works on feature sets, which use boolean values, not frequencies. So if you want to classify based on a FreqDist, you'll have to implement your own classifier, that does not use the NLTK feature sets.

半衾梦 2024-08-26 10:46:11

这是一个可以帮助您的方法:

''' Returns the frequency of letters '''
def get_freq_letters(words):
    fdist = nltk.FreqDist([char.lower() for word in words for char in word if char.isalpha()])
    freq_letters = {}
    for key,value in fdist.iteritems():
        freq_letters[key] = value
    return freq_letters

Here's a method which will help you :

''' Returns the frequency of letters '''
def get_freq_letters(words):
    fdist = nltk.FreqDist([char.lower() for word in words for char in word if char.isalpha()])
    freq_letters = {}
    for key,value in fdist.iteritems():
        freq_letters[key] = value
    return freq_letters
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文