论坛帖子的贝叶斯过滤
有没有人使用贝叶斯过滤器让论坛成员对帖子进行分类,这样随着时间的推移论坛只显示有趣的帖子?贝叶斯过滤器似乎可以很好地检测垃圾邮件。实施贝叶斯过滤器是为用户过滤论坛帖子的可行方法吗?
Has anyone used a Bayesian filter to let forum members classify posts, so over time the forum only displays interesting posts? A Bayesian filter seems to work well for detecting email spam. Is implementation of a Bayesian filter a viable approach to filter forum posts for users?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试通过贝叶斯分类器或任何其他自动分类系统对有趣/好的论坛帖子进行分类的困难在于帖子的单词和/或单词结构与它们的相对价值或效用之间可能缺乏相关性。
垃圾邮件过滤器之所以起作用,主要是因为总体上,单词选择和结构在系统上是不寻常的:垃圾邮件发送者试图推销特定的产品、服务等。尽管垃圾邮件发送者可以尝试增加这样做的难度,但可以学习合理的相关性和模式所以通过各种技术。
对于好与坏的论坛帖子来说,这种单词/结构模式不太可能存在。但是,还有一种可能有用的重构问题的替代方法:
我提出的方法对解决您的问题的实际效果在很大程度上取决于论坛的性质、用户的意愿对内容进行排名,以及他们在如何看待所发布内容的价值方面有多少共同点。此外,用户社区的总体规模也可能是一个因素:如果太小,可能没有足够的数据可供使用;如果太小,可能没有足够的数据可供使用;如果太大,则针对排名数据运行分类器推理方法时可能会出现计算缩放问题。
The difficulty with trying to classify interesting/good forum posts via Bayesian classifiers or any other automated classification system is the probable lack of correlation between the words and/or word structure of postings vs. their relative value or utility.
SPAM filters work primarily because the word choices and structure are systematically unusual overall: the spammer is trying to promote a specific product, service, etc. There are reasonable correlations and patterns that can be learned, though spammers can try to increase the difficulty of doing so via various techniques.
Such word/structure patterns are unlikely to exist for good vs. bad forum posts. However, there is an alternative way to restructure the problem that might be useful:
How well the approach I've proposed would actually work on your problem depends a lot on the nature of the forum, how willing users are to rank content, and how much they have in common for how they perceive the value of posted content. Also, the overall size of the user community could be a factor: if it's too small, there might not be enough data to work with; if too large, you could have computational scaling problems running the classifier inference method against the ranking data.
协同过滤不是更好吗?
Wouldn't collaborative filtering work better?