有免费的反垃圾邮件数据库吗?
Wordpress 有一个名为 Akismet 的垃圾邮件过滤插件,它似乎能够将任何文本块分类为垃圾邮件或非垃圾邮件。 唯一需要注意的是,您需要通过他们的界面,并且他们的数据库/算法不是开源的或容易获得的。
还有一些商业提供商提供可访问 Web 的 API,供您对用户在 Web 应用程序中提交的电子邮件、评论或任何其他文本进行分类。
是否有任何类型的开源或可免费访问的数据库可以将文本块分类为垃圾邮件/非垃圾邮件?
编辑:这是我想要的更清晰的解释
基本上我希望有一个广泛的数据库,其中包含某些短语是垃圾邮件的概率。 由于(我假设)垃圾邮件发送者平等地向所有电子邮件地址发送垃圾邮件,因此通过使用此数据库预先填充我的贝叶斯垃圾邮件过滤器,我可以创建一个应用程序,该应用程序无需任何用户培训即可捕获大多数垃圾邮件。
Wordpress has a spam filtering plugin called Akismet that seems to be able to classify any block of text as spam or not. The only caveat being that you need to go through their interface and their database/algorithm is not open sourced or readily available otherwies.
There are also commercial providers that provide a web accessible API for you to classify the emails, comments or any other text being submitted by users in your web application.
Is there any sort of open source or freely accessible database that can classify a block of text as spam/non-spam?
Edit: Here's a clearer explanation of what I want
Basically I was hoping that there was an extensive database out there with the probabilities of certain phrases being spam. Since (I'm assuming) spammers spam all email addresses equally, by pre-populating my Bayesian spam filter with this database, I could create an application that starts off by capturing most spam without any user training.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
也许这完全是一个死问题——但是,请检查一下:
http://www.stopforumspam.com
使用他们的 API 根据数据库检查 IP 或输入的用户名或电子邮件。 但我建议您使用带有超时参数的 cURL - 该服务有时可能会或可能不会超时。
Maybe this is totally a dead question - however, check this out:
http://www.stopforumspam.com
Use their API to check the IP or entered usernames or emails against their DB. But I advise you to use cURL with it's timeout parameter - the service may or may not time out on you sometimes.
根据评论更新:
我认为简单的数据库无法解决问题。 大多数垃圾邮件是通过算法生成的(例如,垃圾评论通常包含帖子中的内容)。 Akismet 会做一些事情的组合,可能包括链接分析和使用已知的垃圾邮件签名,但他们不发布它。
我读过一些有趣的人工智能项目,对好的内容而不是坏的内容进行分类。 您还可以查看Spam Karma,它根据以下内容分析博客评论各种垃圾邮件触发器(加载页面后立即发布响应等)。
原始答案(DNS黑名单):
Update based on comment:
I don't think a simple database would do the trick. Most spam is algorithmicly generated (e.g. comment spam usually incorporates content from the post). Akismet does a combination of things, probably including link analysis and use of known spam signatures, but they don't publish it.
I've read about some interesting AI projects to classify good rather than bad content. You might also look at Spam Karma, which analyzes blog comments based on a variety of spammy triggers (post of response immediately after loading page, etc.).
Original answer (DNS blacklists):
可能不完全是您正在寻找的内容,但 MoinMoin Wiki 维护者在此保留了 Wiki 垃圾邮件正则表达式的中央列表: http://master.moinmo.in/BadContent
Probably not exactly what you're looking for, but the MoinMoin Wiki maintainers keep a central list of Wiki spam regular expressions here: http://master.moinmo.in/BadContent