结合朴素贝叶斯垃圾邮件过滤中的个体概率

发布于 2024-11-17 02:39:36 字数 5210 浏览 8 评论 0原文

我目前正在尝试通过分析我积累的语料库来生成垃圾邮件过滤器。

我正在使用维基百科条目 http://en.wikipedia.org/wiki/Bayesian_spam_filtering开发我的分类代码。

我已经实现了代码，通过实现 wiki 中的以下公式来计算邮件包含特定单词的垃圾邮件的概率：

$pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))$

我的 PHP 代码：

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

按照组合在单独的概率部分，我实现了代码来组合测试消息中所有唯一单词的概率来确定垃圾邮件。

来自维基公式：

$p=(p1*pn)/((p1*pn)+ (1-p)(1-pn))$

我的 PHP 代码：

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

在测试字符串上“这还不错all.”，会产生以下输出：

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

这是我的问题：我是否正确实现了组合个体概率？假设我生成有效的单个单词概率，组合方法正确吗？

我担心的是计算结果的概率非常小。我已经在更大的测试消息上对其进行了测试，最终得到了以科学记数法表示的包含 10 多个零位的概率。我期望的是十位或百位的值。

我希望问题出在我的 PHP 实现上——但是当我检查维基百科中的组合函数时，公式的被除数是分数的乘积。我不明白多个概率的组合最终会超过 0.1% 的概率。

如果是这种情况，消息越长，概率分数越低，我如何补偿垃圾邮件配额以正确预测小型和大型测试用例的垃圾邮件/火腿？

其他信息

我的语料库实际上是大约 4 万条 reddit 评论的集合。我实际上正在对这些评论应用“垃圾邮件过滤器”。我根据反对票与赞成票的数量将单个评论评级为垃圾邮件/火腿：如果赞成票少于反对票，则被视为垃圾邮件，否则被视为垃圾邮件。

现在，由于语料库类型的原因，事实证明，垃圾邮件中实际上很少有单词比火腿邮件中使用得更多。即，这里是垃圾邮件中出现频率高于火腿邮件的前十个单词列表。

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

相反，大多数单词在 ham 中的使用量比 ham 还要多。以我的垃圾邮件数量最多的前 10 个单词列表为例。

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

正如您所看到的，垃圾邮件的使用频率明显低于火腿邮件的使用频率。在我的 40k 条评论语料库中，有 2100 条评论被视为垃圾邮件。

如下所示，帖子上的测试短语被视为垃圾邮件率如下：

短语

Cops are losers in general. That's why they're cops.

分析：

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

据此，这是垃圾邮件的可能性极低。然而，如果我现在分析一条火腿评论：

短语

Bill and TED's excellent venture?

分析

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

好吧，这很有趣。我在编写此更新时正在做这些示例，因此这是我第一次看到此特定测试用例的结果。我想我的预测是相反的。它实际上是挑选出“火腿”而不是“垃圾邮件”的概率。这值得验证。

对已知火腿的新测试。

短语

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

分析

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

不幸的是没有。原来这是一个巧合的结果。我开始怀疑评论是否不能那么容易量化。也许不良评论的性质与垃圾邮件的性质有很大不同。

也许垃圾邮件过滤仅在您拥有特定词类的垃圾邮件时才起作用？

最终更新

正如回复中所指出的，奇怪的结果是由于语料库的性质造成的。使用没有垃圾邮件贝叶斯分类明确定义的评论语料库将无法执行。由于任何一条评论都有可能（并且很可能）受到不同用户的垃圾评论和正常评论的评级，因此不可能为垃圾评论生成硬分类。

最终，我想生成一个评论分类器，它可以根据针对评论内容调整的贝叶斯分类来确定评论帖子是否会装饰业力。我可能仍然会研究调整分类器以发送垃圾邮件消息，并看看这样的分类器是否可以猜测评论系统的业力响应。但现在，这个问题已经有了答案。感谢大家的意见。

原文

I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.

I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.

I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:

$pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))$

My PHP code:

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.

From the wiki formula:

$p=(p1*pn)/((p1*pn)+(1-p)(1-pn))$

My PHP code:

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

On a test string "This isn't very bad at all.", the following output is produced:

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?

My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.

I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.

If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?

Additional Info

My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.

Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.

As suggested below, a test phrase on a post considered spam rates as follows:

Phrase

Cops are losers in general. That's why they're cops.

Analysis:

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:

Phrase

Bill and TED's excellent venture?

Analysis

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.

New test on known ham.

Phrase

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

Analysis

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.

Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?

Final Update

As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.

Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

分享到QQ

分享到微博