检查发帖频率是否为垃圾邮件的良好算法

发布于 2024-12-11 03:00:59 字数 270 浏览 2 评论 0原文

我有一个人们可以发布文本的网站。每个帖子都存储在数据库中,其中包含发帖者的 IP 和发帖时间。如果我可以确定发帖者是机器人、垃圾邮件发送者等,我希望能够显示验证码。

有什么好的算法可以做到这一点?最简单的选择是分析在预定时间段(例如一分钟)内的帖子数量是否大于选定的限制(例如 10)。但是,这存在一个缺陷,即多个人从同一 IP 后面发帖,甚至是创建随机频率间隔的机器人>时间段,或该时间段内的帖子少于限制。

显然没有“正确”的答案。然而,有些算法比其他算法更好,我只是想找到最好的算法。

I have a site where a people can post text. Each post is stored in a database with the ip of the poster and the time of the post. I want to be able to display a recaptcha if I can determine that the poster is a bot, spammer, etc.

What is a good algorithm to do this? The simplest choice is to analyze whether the number of posts in a pre-determined time period, say one minute, is greater than a chosen limit, say 10. However, this has the flaw of falling to multiple people posting from behind the same ip, or even a bot that creates random frequency intervals > the time period, or posts less than the limit in that time period.

Obviously there is no "correct" answer. Some algorithms are better than others however, and I am just trying to find the best one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

明月夜 2024-12-18 03:00:59

您可以采用基于限制的方法,并充分利用网站分析。

IP 在单个上下文中发布内容的次数必须受到限制。例如,对于 StackExchange 问题(上下文),我的 IP 地址将(在大多数情况下)发布单个答案(而不是评论)。任何超过一个的答案都是不常见的,因此是可疑的。在其他一些上下文中,频率可能高达几次,例如 StackExchange 评论。

那么用户单次访问所花费的时间必须受到限制。如果您使用 google 网站分析,您必须了解用户在您网站上花费的平均时间。将时间限制稍微设置为比该时间长得多的时间,或者您可以提出的任何其他标准,包括点击和试用方法。

此外,您还可以使用 blogger 方法,但需要进行一些细微的更改。不要在每个帖子上都添加验证码,而是在用户登录或发表第一篇帖子时添加验证码。之后,仅在一段时间间隔或他/她发布一定数量的帖子后才设置验证码。

You can have a limit-based approach, and make good use of website analytics.

There must be limits to how many times an IP will post things in a single context. For example, for a StackExchange question (context), my IP address will (in most cases) post a single answer (not comments). Any more than one answer is uncommon, and hence, suspicious. In some other context, the frequency can be upto a few times, such as StackExchange comments.

Then there must be limits for time spent by a user in a single visit. If you are using google website analytics, you must be knowing the average time a user spends on your site. Make the time limit a bit considerably greater than that, or any other criteria you can come up with, including a hit and trial approach.

Also, you can use the blogger approach, but with a minor change. Instead of having a captcha at each post, have it once the user logs in or makes the first post. After that, put up a captcha only after some time interval or some number of posts by him/her.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文