使用特征频率训练分类器（朴素贝叶斯）的 Python NLTK 代码片段

发布于 2024-08-19 10:46:11 字数 478 浏览 14 评论 0原文

我想知道是否有人可以帮助我通过一段代码片段来演示如何使用特征频率方法而不是特征存在来训练朴素贝叶斯分类器。

我认为以下内容如第 6 章所示链接文本是指使用功能存在（FP）创建功能集 -

def document_features(document): 
    document_words = set(document) 

    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)

    return features

请建议

原文

I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence.

I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) -

def document_features(document): 
    document_words = set(document) 

    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)

    return features

Please advice

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

画离情绘悲伤 2024-08-26 10:46:11

在您发送的链接中，它说这个函数是特征提取器，它只是检查这些单词中的每一个是否存在于给定文档中。

下面是整个代码，每行都有编号：

1     all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
2     word_features = all_words.keys()[:2000] 

3     def document_features(document): 
4          document_words = set(document) 
5          features = {}
6          for word in word_features:
7               features['contains(%s)' % word] = (word in document_words)
8          return features

在第 1 行中，它创建了所有单词的列表。

第 2 行采用最常见的 2000 个单词。

3 函数的定义

4 转换文档列表（我认为一定是列表），将列表转换为集合。

5 声明一个字典

6 迭代所有最常见的 2000 个单词

7 创建一个字典，其中键为“contains(theword)”，值为 true 或 false。如果文档中存在该单词，则为 true，否则为 false

8 返回字典，该字典显示文档是否包含最常见的 2000 个单词。

这能回答你的问题吗？

In the link you sent it says this function is feature extractor that simply checks whether each of these words is present in a given document.

Here is the whole code with numbers for each line:

1     all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
2     word_features = all_words.keys()[:2000] 

3     def document_features(document): 
4          document_words = set(document) 
5          features = {}
6          for word in word_features:
7               features['contains(%s)' % word] = (word in document_words)
8          return features

In line 1 it created a list of all words.

In line 2 it takes the most frequent 2000 words.

3 the definition of the function

4 converts the document list (I think it must be a list) and converts the list to a set.

5 declares a dictionary

6 iterates over all of the most frequent 2000 words

7 creates a dictionary where the key is 'contains(theword)' and the value is either true or false. True if the word is present in the document, false otherwise

8 returns the dictionary which is shows whether the document contains the most frequent 2000 words or not.

Does this answer your question?

回复收藏 0 原文

深府石板幽径 2024-08-26 10:46:11

对于训练，创建适当的 FreqDists，您可以使用它来创建 ProbDists，然后可以将其传递到 NaiveBayesClassifier。但分类实际上适用于特征集，它使用布尔值，而不是频率。因此，如果您想基于 FreqDist 进行分类，则必须实现自己的分类器，该分类器不使用 NLTK 功能集。

回复收藏 0 原文

半衾梦 2024-08-26 10:46:11

这是一个可以帮助您的方法：

''' Returns the frequency of letters '''
def get_freq_letters(words):
    fdist = nltk.FreqDist([char.lower() for word in words for char in word if char.isalpha()])
    freq_letters = {}
    for key,value in fdist.iteritems():
        freq_letters[key] = value
    return freq_letters

Here's a method which will help you :

''' Returns the frequency of letters '''
def get_freq_letters(words):
    fdist = nltk.FreqDist([char.lower() for word in words for char in word if char.isalpha()])
    freq_letters = {}
    for key,value in fdist.iteritems():
        freq_letters[key] = value
    return freq_letters

回复收藏 0 原文

~没有更多了~