适用于较大数据集的轻量级贝叶斯过滤器
我想为我的 CMS 创建另一个垃圾邮件检测。目前我确实看到了三个选项:
- 使用简单的 php 类并在 MySQL 中存储令牌
- 安装 spamassassin 并使用 php-connector
- 像 mahout 这样的大东西
我不喜欢 MySQL 方法,因为我担心它会随着时间的推移而变得非常大降低整个系统的性能。 Spamassassin 方法似乎更有吸引力,但互联网上到处都有人写道 SA 的规则集中于邮件和标头,这不是一个理想的方法。最后但并非最不重要的一点是,我知道 mahout,但我担心它可能有点太大并且会产生大量的管理开销。
有没有什么好的、小而高效的东西可以在 Linux 服务器上运行并从 php 访问?
I would like to create yet another spam detection for my CMS. Currently I do see three options:
- use a simple php class and store tokens in MySQL
- install spamassassin and use a php-connector
- something big like mahout
I do not like the MySQL approach, because I fear that it will grow very big with the time and degrade the performance of the whole system. The spamassassin approach seems to be more attractive, but everywhere on the internet people are writing that SA's rules are focussed on mails and headers and that this is not an ideal way to go. Last but not least i am aware of mahout, but I fear it might be a bit too big and create a lot of administration overhead.
Is there something nice, small and efficient that could be run on a linux server and accessed from php?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最简单的方法是 MySQL 中的令牌,但我不知道这有多好。
如果您想将文本分类为跨度/非垃圾邮件类别,我认为 Mahout 是一个不错的选择。它是为大数据构建的,因此如果您想要映射/减少,则需要 Hadoop 设置 - 但您也可以使用一个轻量级替代方案:LogisticRegression 算法马胡特。
有一个ModelSerializer类,您可以使用它将训练好的模型以二进制格式存储在硬盘或其他地方 - 这样您不必设置 Hadoop。
您可以尝试:
您可以使用以下类作为问题的代码示例:
这里是一些有关网络上 Mahout 的更多资源。
因此,要从 PHP 访问此内容,您可以使用 Java 构建一个小型 RESTful Web 服务或简单的命令行界面。
希望这会有所帮助。
the simplest approach would be the tokens in MySQL but I don't know how good this works.
If you want to classify text into span/not-spam categories I think Mahout is a good choice. It is built for BigData and thus requires, if you want map/reduce, a Hadoop setup - but there is also a lightweight alternative you probably could use: the LogisticRegression Algorithm in Mahout.
There is a ModelSerializer class with which you can store your trained model in binary format on your hard disk or somewhere else - so you don't have to setup Hadoop.
You could try:
There is the following class you could use as a code example for your problem:
Here are some more resources regarding Mahout on the web.
So to access this from PHP you could build a small RESTful webservice in Java or simply a command line interface.
Hope this helps a little bit.