检查发帖频率是否为垃圾邮件的良好算法
我有一个人们可以发布文本的网站。每个帖子都存储在数据库中,其中包含发帖者的 IP 和发帖时间。如果我可以确定发帖者是机器人、垃圾邮件发送者等,我希望能够显示验证码。
有什么好的算法可以做到这一点?最简单的选择是分析在预定时间段(例如一分钟)内的帖子数量是否大于选定的限制(例如 10)。但是,这存在一个缺陷,即多个人从同一 IP 后面发帖,甚至是创建随机频率间隔的机器人>时间段,或该时间段内的帖子少于限制。
显然没有“正确”的答案。然而,有些算法比其他算法更好,我只是想找到最好的算法。
I have a site where a people can post text. Each post is stored in a database with the ip of the poster and the time of the post. I want to be able to display a recaptcha if I can determine that the poster is a bot, spammer, etc.
What is a good algorithm to do this? The simplest choice is to analyze whether the number of posts in a pre-determined time period, say one minute, is greater than a chosen limit, say 10. However, this has the flaw of falling to multiple people posting from behind the same ip, or even a bot that creates random frequency intervals > the time period, or posts less than the limit in that time period.
Obviously there is no "correct" answer. Some algorithms are better than others however, and I am just trying to find the best one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以采用基于限制的方法,并充分利用网站分析。
IP 在单个上下文中发布内容的次数必须受到限制。例如,对于 StackExchange 问题(上下文),我的 IP 地址将(在大多数情况下)发布单个答案(而不是评论)。任何超过一个的答案都是不常见的,因此是可疑的。在其他一些上下文中,频率可能高达几次,例如 StackExchange 评论。
那么用户单次访问所花费的时间必须受到限制。如果您使用
google网站分析,您必须了解用户在您网站上花费的平均时间。将时间限制稍微设置为比该时间长得多的时间,或者您可以提出的任何其他标准,包括点击和试用方法。此外,您还可以使用 blogger 方法,但需要进行一些细微的更改。不要在每个帖子上都添加验证码,而是在用户登录或发表第一篇帖子时添加验证码。之后,仅在一段时间间隔或他/她发布一定数量的帖子后才设置验证码。
You can have a limit-based approach, and make good use of website analytics.
There must be limits to how many times an IP will post things in a single context. For example, for a StackExchange question (context), my IP address will (in most cases) post a single answer (not comments). Any more than one answer is uncommon, and hence, suspicious. In some other context, the frequency can be upto a few times, such as StackExchange comments.
Then there must be limits for time spent by a user in a single visit. If you are using
googlewebsite analytics, you must be knowing the average time a user spends on your site. Make the time limita bitconsiderably greater than that, or any other criteria you can come up with, including a hit and trial approach.Also, you can use the blogger approach, but with a minor change. Instead of having a captcha at each post, have it once the user logs in or makes the first post. After that, put up a captcha only after some time interval or some number of posts by him/her.