论坛帖子的贝叶斯过滤

发布于 2024-08-22 01:17:40 字数 97 浏览 6 评论 0原文

有没有人使用贝叶斯过滤器让论坛成员对帖子进行分类，这样随着时间的推移论坛只显示有趣的帖子？贝叶斯过滤器似乎可以很好地检测垃圾邮件。实施贝叶斯过滤器是为用户过滤论坛帖子的可行方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷血 2024-08-29 01:17:40

尝试通过贝叶斯分类器或任何其他自动分类系统对有趣/好的论坛帖子进行分类的困难在于帖子的单词和/或单词结构与它们的相对价值或效用之间可能缺乏相关性。

垃圾邮件过滤器之所以起作用，主要是因为总体上，单词选择和结构在系统上是不寻常的：垃圾邮件发送者试图推销特定的产品、服务等。尽管垃圾邮件发送者可以尝试增加这样做的难度，但可以学习合理的相关性和模式所以通过各种技术。

对于好与坏的论坛帖子来说，这种单词/结构模式不太可能存在。但是，还有一种可能有用的重构问题的替代方法：

允许用户将帖子分类为好或坏，或者按照您所描述的方式对它们进行排名。
使用贝叶斯分类器或其他统计推断方法来识别与整个社区的排名行为相关性最高的论坛用户，即具有最佳品味的用户，并且可以很好地预测如何整个社区都会查看内容。
使用步骤 #2 中确定的良好预测用户池中的论坛帖子排名来过滤论坛帖子。这要求一个或多个这样的用户在某个时刻对新内容进行实际排名，因此该池需要具有一定的规模并包括常规用户，这样的过滤系统才有用。
该分类器系统需要定期重建，因为用户社区可能是动态的、兴趣不断变化等。

我提出的方法对解决您的问题的实际效果在很大程度上取决于论坛的性质、用户的意愿对内容进行排名，以及他们在如何看待所发布内容的价值方面有多少共同点。此外，用户社区的总体规模也可能是一个因素：如果太小，可能没有足够的数据可供使用；如果太小，可能没有足够的数据可供使用；如果太大，则针对排名数据运行分类器推理方法时可能会出现计算缩放问题。

The difficulty with trying to classify interesting/good forum posts via Bayesian classifiers or any other automated classification system is the probable lack of correlation between the words and/or word structure of postings vs. their relative value or utility.

SPAM filters work primarily because the word choices and structure are systematically unusual overall: the spammer is trying to promote a specific product, service, etc. There are reasonable correlations and patterns that can be learned, though spammers can try to increase the difficulty of doing so via various techniques.

Such word/structure patterns are unlikely to exist for good vs. bad forum posts. However, there is an alternative way to restructure the problem that might be useful:

Allow users to classify posts as good or bad or otherwise rank them as you described.
Use Bayesian classifiers or some other statistical inference method to identify forum users who have among the highest correlation with the ranking behavior of the overall community, i.e., the users who have the best taste and are good predictors for how the community as a whole would view the content.
Use forum post rankings from the pool of good-predictor users identified in step #2 to filter forum posts. This requires that one or more such users actually rank the new content at some point, so this pool needs to be of some size and include regular users for such a filtering system to be useful.
This classifier system will require periodic rebuilding as the community of users is presumably dynamic, has changing interests, etc.

How well the approach I've proposed would actually work on your problem depends a lot on the nature of the forum, how willing users are to rank content, and how much they have in common for how they perceive the value of posted content. Also, the overall size of the user community could be a factor: if it's too small, there might not be enough data to work with; if too large, you could have computational scaling problems running the classifier inference method against the ranking data.

回复收藏 0 原文