理解卡方特征选择的问题

发布于 2024-10-18 15:08:11 字数 1896 浏览 7 评论 0原文

我在理解卡方特征选择时遇到了问题。我有两个类,正类和负类,每个类包含不同的术语和术语计数。我需要执行卡方特征选择来提取每个类别最具代表性的术语。问题是我最终得到的正类和负类的术语完全相同。这是我用于选择特征的Python代码:

#!/usr/bin/python

# import the necessary libraries
import math

class ChiFeatureSelector:
    def __init__(self, extCorpus, lookupCorpus):
        # store the extraction corpus and lookup corpus
        self.extCorpus = extCorpus
        self.lookupCorpus = lookupCorpus

    def select(self, outPath):
            # dictionary of chi-squared scores
        scores = {}

        # loop over the words in the extraction corpus
        for w in self.extCorpus.getTerms():
            # build the chi-squared table
            n11 = float(self.extCorpus.getTermCount(w))
            n10 = float(self.lookupCorpus.getTermCount(w))
            n01 = float(self.extCorpus.getTotalDocs() - n11)
            n00 = float(self.lookupCorpus.getTotalDocs() - n10)

            # perform the chi-squared calculation and store
            # the score in the dictionary
            a = n11 + n10 + n01 + n00
            b = ((n11 * n00) - (n10 * n01)) ** 2
            c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
            chi = (a * b) / c
            scores[w] = chi

        # sort the scores in descending order
        scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
        i = 0

        for (v, k) in scores:
            print str(k) + " : " + str(v)
            i += 1

            if i == 10:
                break

这就是我使用该类的方式(为了简洁起见,省略了一些代码,是的,我已经检查以确保两个语料库不包含完全相同的数据。

# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)

print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)

我觉得错误来自当我计算术语/文档表时,但我不确定也许我不理解某些东西。

I've been having a problem understanding chi-squared feature selection. I have two classes, positive and negative, each containing different terms and term counts. I need to perform chi-squared feature selection to extract the most representative terms for each class. The problem is that I end up getting the EXACT same terms for both my positive and negative class. Here is my Python code for selecting features:

#!/usr/bin/python

# import the necessary libraries
import math

class ChiFeatureSelector:
    def __init__(self, extCorpus, lookupCorpus):
        # store the extraction corpus and lookup corpus
        self.extCorpus = extCorpus
        self.lookupCorpus = lookupCorpus

    def select(self, outPath):
            # dictionary of chi-squared scores
        scores = {}

        # loop over the words in the extraction corpus
        for w in self.extCorpus.getTerms():
            # build the chi-squared table
            n11 = float(self.extCorpus.getTermCount(w))
            n10 = float(self.lookupCorpus.getTermCount(w))
            n01 = float(self.extCorpus.getTotalDocs() - n11)
            n00 = float(self.lookupCorpus.getTotalDocs() - n10)

            # perform the chi-squared calculation and store
            # the score in the dictionary
            a = n11 + n10 + n01 + n00
            b = ((n11 * n00) - (n10 * n01)) ** 2
            c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
            chi = (a * b) / c
            scores[w] = chi

        # sort the scores in descending order
        scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
        i = 0

        for (v, k) in scores:
            print str(k) + " : " + str(v)
            i += 1

            if i == 10:
                break

And this is how I use the class (some code omitted for brevity sake, and yes, I have checked to ensure that the two corpuses do not contain the exact same data.

# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)

print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)

I feel like the error is coming from when I calculate term/document table but I'm not sure. Perhaps I am not understanding something. Can someone point me in the right direction?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

谜兔 2024-10-25 15:08:11

在两类情况下,如果两个特征的卡方排序相同
交换数据集。它们是两者之间最大不同的功能
两个班级。

In the two-class case, the chi-squared ranking of features is the same if the two
data sets are exchanged. They are the features which differ the most between
the two classes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文