朴素贝叶斯垃圾邮件过滤效果
朴素贝叶斯过滤对于过滤垃圾邮件的效果如何?
我听说垃圾邮件发送者可以通过填充额外的非垃圾邮件相关单词轻松绕过它们。 您可以使用哪些编程技术与贝叶斯过滤器一起使用来防止这种情况发生?
How effective is naive Bayesian filtering for filtering spam?
I heard that spammers easily bypass them by stuffing extra non-spam-related words. What programming techniques can you use with Bayesian filters to prevent that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Paul Graham 在他的原创文章 < 中真正向整个网络介绍了使用贝叶斯垃圾邮件过滤的想法。 em>垃圾邮件计划,早在 2002 年 8 月。然后,他的后续 -大约一年后,许多问题很快就出现了。 这些仍然是关于该主题的非常出色的作品。
在第二篇文章中,Graham 提到使用 CRM114,它适用于更广泛的模式集,而不仅仅是空格分隔的模式字。 CRM114 很酷,但对于垃圾邮件过滤系统没有太多实施帮助。
有用于贝叶斯垃圾邮件过滤的开源强大工具,例如 Death2Spam 和 SpamProbe。
我发现没有什么比通过 Gmail 帐户过滤邮件更有效的了。 狩猎快乐。
Paul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the web at large with his original article A Plan for Spam, back in August 2002. Then, his follow-up a year or so later introduced many of the problems that swiftly arose. These are still pretty great works on the topic.
In the second article, Graham mentions using CRM114, which works on a much wider set of patterns than just space-delimited words. CRM114 is cool, but comes without much implementation help for a spam filtering system.
There's the open-source powertools for Bayesian spam filtering like Death2Spam and SpamProbe.
I find nothing works quite like filtering mail through a Gmail account. Happy hunting.
我认为,为了击败你提到的那种垃圾邮件攻击,重要的不是学习方法,而是你训练的特征。 我使用 Fidelis Assis 的 OSBF-Lua,这是一个非常成功的过滤器:它不断赢得垃圾邮件过滤器的竞赛。 它使用贝叶斯学习,但我认为其成功的真正原因在于三个原则:
它不是在单个单词上进行训练,而是在稀疏二元组上进行训练:一对由 0 到 4 “don” 分隔的单词不在乎”的话。 垃圾邮件发送者必须将他们的消息放在某个地方,而稀疏的二元组非常擅长找出他们。 它甚至可以发现附件垃圾邮件!
它对邮件标头进行了额外的训练,因为垃圾邮件发送者很难伪装这些标头。 示例:源自您的网络且从未通过离网中继主机的邮件可能不是垃圾邮件。
如果垃圾邮件过滤器对其分类的置信度较低,它会请求人工输入。 (实际上,它添加了一个标头字段,表示“请就该消息对我进行培训”;人们可以忽略该请求。)这意味着,随着垃圾邮件发送者发展新技术,您的过滤器也会不断发展以匹配。
这种技术的组合非常有效。
免责声明:我与 Fidelis 合作重构了一些软件,以便将其用于其他目的,例如将常规邮件分组或可能有一天尝试检测博客评论和其他地方的垃圾邮件。
I think for defeating the kind of spam attack you mention, the important thing is not the learning method but rather what features you train on. I use Fidelis Assis's OSBF-Lua which is a very successful filter: it keeps winning contests for spam filters. It uses Bayesian learning but I think the real reason for its success is three principles:
It trains not on single words but on sparse bigrams: a pair of words separated by 0 to 4 "don't care" words. The spammers have to put their message in somewhere and the sparse bigrams are very good at sussing them out. It even finds attachement spam!
It does extra training on message headers, because these are hard for spammers to disguise. Example: a message that originates on your network and never passes through an off-network relay host is probably not spam.
If the spam filter has low confidence about its classification, it requests input from a human. (In practice it adds a header field saying "please train me on this message"; the human can ignore the request.) This means that as the spammers evolve new techniques, your filter evolves to match.
This combination of techniques is extremely effective.
Disclaimer: I have worked with Fidelis on refactoring some of the software so that it can be used for other purposes such as classifying regular mail into groups or possibly one day trying to detect spam in blog comments and other places.
你是对的,朴素贝叶斯过滤器很容易受到贝叶斯中毒的影响。
You're right, naive Bayesian filters are susceptible to Bayesian poisoning.
我使用 Popfile 不仅可以分类垃圾邮件,还可以将我的电子邮件分类,我发现它非常有效。 它使用朴素贝叶斯过滤器。
I use Popfile to not only sort away spam but also sort my email into categories and I find it hugely effective. It uses naive Bayesian filters.