如何编写垃圾邮件过滤器
我不得不编写一个简单的垃圾邮件过滤器 我不太确定我要怎么做。
到目前为止,我已经提出了单词列表和域过滤,这将给出或删除达到特定阈值的点。
例如,如果您从黑名单域中撰写有关“v1agr4”的内容,则您将因垃圾邮件而获得 2 分,但如果您从 hotmail.com 帐户中撰写有关“v1agr4”的内容,则只会获得 1 分“垃圾邮件点”。
你们还有其他建议/资源吗?
这更多的是学习垃圾邮件过滤器,而不是开发企业级的东西
I'm stuck in having to write a simple spam filter
I'm not really sure about how I'm going to do it.
So far I've come up with wordlist and domain filtering, which will give or remove points up to a certain threshold.
For example, if you're writing about "v1agr4" from a blacklisted domain, you'll get like 2 points for spam, but if you're writing about "v1agr4" from a hotmail.com account, you'll get only 1 "spam point".
Do you guys have any other suggestions / ressources?
This is more about learning spam filters than developing something enterprise grade
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这里有一些非常好的算法信息:
http://www.paulgraham.com/spam.html
http://www.paulgraham.com/better.html
但是,说真的,为什么要重新发明车轮?
只需下载 K9:http://keir.net/k9.html
Some really good algorithm info here:
http://www.paulgraham.com/spam.html
http://www.paulgraham.com/better.html
But, seriously, why reinvent the wheel?
Just download K9: http://keir.net/k9.html
一些与贝叶斯垃圾邮件过滤相关的开源 Java 项目(LFSR Consulting 提到过):
还有一个针对 C++ 的额外功能:
Some open source Java projects related to Bayesian Spam Filtering (that was mentioned by LFSR Consulting):
And one extra for C++:
查看贝叶斯垃圾邮件过滤。
我知道 perl 有一个库,所以我假设 java 也有一个库。
Look into Bayesian Spam Filtering.
I know perl has a library for it, so I'd assume java would have one too.
我已经编写了一个功能齐全的文档。
I've written one with all the bells and whistles.
您可以将其委托给分布式服务。 Akismet 是一个非常好的解决方案。
You can delegate that to a distributed service. Akismet is a very good solution.
编写垃圾邮件过滤器取决于您对可扩展性的需求。
如果您想要一个可扩展的解决方案,那么内容过滤可能不是明智的选择,因为它非常消耗 CPU 和内存,您宁愿选择基于信誉的过滤或基于黑名单的过滤,这对 CPU 更友好在您的服务器上并且更容易编写。
我在我的博客上写了一篇一篇文章< /a> 从程序员的角度解释了编写垃圾邮件过滤器背后的想法,并涵盖了从基于内容的过滤到基于黑名单的过滤的所有选项。
Writing a spam filter depends upon your demands for scalability.
If you want a scalable solution, then content-filtering is probably not the smart choice to make as it is very CPU and memory consuming, and you would instead rather choose either reputation based filtering or blacklist based filtering, which will be way more CPU friendly on your server as well as much easier to write.
I wrote a a post on my blog that explains the idea behind writing a spam filter from a programmer's point of view and covers all the options from content based filtering to black list based filtering.