朴素贝叶斯分类(垃圾邮件过滤)- 哪种计算是正确的?

发布于 2024-09-01 02:02:31 字数 1088 浏览 4 评论 0原文

我正在实施朴素贝叶斯分类器来过滤垃圾邮件。我对某些计算有疑问。请澄清我该怎么做。这是我的问题。

在此方法中,您必须计算

alt text

P(S|W) -> ;给定单词 W 的消息中出现垃圾邮件的概率。

P(W|S)->单词 W 在垃圾邮件中出现的概率。

P(宽|高)->单词 W 在 Ham 消息中出现的概率。

因此,要计算 P(W|S),下列哪项是正确的:

  1. (垃圾邮件中 W 出现的次数)/(所有邮件中 W 出现的总次数)

  2. (单词 W 在垃圾邮件中出现的次数)/(垃圾邮件中单词的总数)

那么,要计算 P(W|S),我应该执行 (1) 还是 (2)? (我认为是(2),但我不确定。)

我指的是 http://顺便说一下,en.wikipedia.org/wiki/Bayesian_spam_filtering 获取信息。

我必须在本周末之前完成实施:(


重复出现单词“W”是否应该增加邮件的垃圾邮件分数?按照您的方法,不会,对吧?。

假设我们有 100 条训练消息,其中其中 50 封是垃圾邮件,50 封是火腿邮件,假设每条邮件的 word_count = 100。

假设,在垃圾邮件中,单词 W 在每条邮件中出现 5 次,而单词 W 在火腿邮件中出现 1 次,因此 W 出现的总次数 所有垃圾邮件中

W 出现的次数 = 5*50 = 250 次,W 在所有非垃圾邮件中出现的总次数 = 1*50 = 50 次,

W 在所有训练消息中出现的总次数 = (250+50) = 300那么

,在这种情况下,您如何计算 P(W|S) 和 P(W|H) 呢

?代码> 对吗?

I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.

In this method, you have to calculate

alt text

P(S|W) -> Probability that Message is spam given word W occurs in it.

P(W|S) -> Probability that word W occurs in a spam message.

P(W|H) -> Probability that word W occurs in a Ham message.

So to calculate P(W|S), which of the following is correct:

  1. (Number of times W occurring in spam)/(total number of times W occurs in all the messages)

  2. (Number of times word W occurs in Spam)/(Total number of words in the spam message)

So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure.)

I am referring http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.

I got to complete the implementation by this weekend :(


Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.

Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.

And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.

So total number of times W occurring in all the spam message = 5*50 = 250 times.

And total number of times W occurring in all Ham messages = 1*50 = 50 times.

Total occurrence of W in all of the training messages = (250+50) = 300 times.

So, in this scenario, how do you calculate P(W|S) and P(W|H) ?

Naturally we should expect, P(W|S) > P(W|H) right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

谁把谁当真 2024-09-08 02:02:31

P(W|S) =(包含 W 的垃圾邮件数量)/(所有垃圾邮件数量)

P(W|S) = (Number of spam messages containing W) / (Number of all spam messages)

一身仙ぐ女味 2024-09-08 02:02:31

尽管这是一个很老的问题,但没有一个答案是完整的,因此值得纠正。

朴素贝叶斯不是单一算法,而是一个算法系列,基于相同的贝叶斯规则:

“在此处输入图像描述”

其中 C 是一个类(在本例中为火腿或垃圾邮件),带有箭头的 x 是属性向量(最简单情况下的单词)。
P(C)只是C类消息在整个数据集中的比例。 P(x) 是具有向量x 描述的属性的消息出现的概率,但由于该参数对于所有类都是相同的,我们可以暂时省略它。但这个问题是关于 P(x|C) 的,给定当前消息的向量 x 应该如何计算它?

实际上,答案取决于NB算法的具体类型。其中有几种,包括多元伯努利 NB多元高斯 NB多项式 NB 以及数字和布尔属性和其他人。有关计算它们中的每一个的 P(x|C) 的详细信息以及用于垃圾邮件过滤任务的 NB 分类器的比较,请参阅 本文

Though it is quite old question, none of answers is complete, so it's worth to correct them.

Naive Bayes is not a single algorithm, but instead a family of algorithms, based on the same Bayes rule:

enter image description here

where C is a class (ham or spam in this example) and x with arrow is a vector of attributes (words in simplest case).
P(C) is just proportion of messages of class C in the whole dataset. P(x) is probability of occurrence of message with attributes described by vector x, but since this parameter is the same for all classes we can omit it for the moment. But this question is about P(x|C), and how should one calculate it given vector x of current message?

Actually, answer depends on concrete type of NB algorithm. There are several of them, including Multivariate Bernoulli NB, Multivariate Gauss NB, Multinomial NB with numeric and boolean attributes and others. For details of calculating P(x|C) for each of them and also comparison of NB classifiers for the task of spam filtering see this paper.

薄情伤 2024-09-08 02:02:31

在这个贝叶斯公式中,W 是你的“特征”,即你观察到的东西。

你必须首先仔细定义什么是W。通常你有很多选择。

假设,在第一种方法中,您说 W 是事件“消息包含单词伟哥”。 (也就是说,W 有两个可能的值:0 =“消息不包含单词 V...”1 =“消息至少包含该单词的出现”)。

在这种情况下,您是对的:P(W|S) 是“单词 W 在垃圾邮件中出现(至少一次)的概率。”
为了估计(比“计算”更好),您可以计算,正如另一个答案所说,“(包含至少一个单词V的垃圾邮件数量)/(所有垃圾邮件数量另

一种方法是:定义“W = 消息中单词伟哥出现的次数”。在这种情况下,我们应该为每个 W 值估计 P(W/S) (P(W=0/S) P(W=1/S) P(W=2/S) ...
更复杂,需要更多样本,更好(希望)的性能。

In this Bayesian formula, W is your "feature", i.e., the thing you observe.

You must carefully define first what is W. Often you have many alternatives.

Let's say that, in a first approach, you say W is the event "message contains the word Viagra". (That is to say, W have two possible values: 0 = "message does not contain the word V..." 1="message contains at least an occurrence of that word").

In that scenario, you're right: P(W|S) is "Probability that word W appears (at least once) in a spam message."
And to estimate (better than "calculate") it, you count , as the other answer says, "(Number of spam messages containing at least one word V) / (Number of all spam messages)"

An alternative approach would be: define "W = number of ocurrences of word Viagra in a message". In this case, we should estimate P(W/S) for each value of W (P(W=0/S) P(W=1/S) P(W=2/S) ...
More complicated, more samples needed, better (hopely) performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文