启发式发现垃圾邮件发送者/机器人(在论坛、博客等中)
我能想到的方法是:
- 测量动作之间的时间。
- 比较帖子的内容(如果它们彼此太相似),或者更好的是,仅比较发布的链接。
- 检查一段时间内用户活跃的分布(如果用户活跃,比如说每小时发布一次,持续一周,那么我们这里要么是超人,要么是机器人)。
- 预期的一些特殊活动:就像在 stackoverflow 中一样,我希望用户按下他们的用户名链接(顶部中间)来查看他们的新答案、评论、问题等。
- (由 chakrit 添加)帖子中的链接数量。
- 不是启发式的。 使用一些异步 JS 进行用户登录。 (这只会让机器人程序员的生活变得更加困难)。
- (由 Alekc 添加)不是启发式的。 用户代理值。
- 而且,我怎么能忘记谷歌的方法(威尔·哈同(Will Hartung)提到过)。 让用户能够将某人标记为垃圾邮件,足够的垃圾邮件投票意味着这是垃圾邮件用户。 (计算什么是足够的用户,就是这里的工作)。
还有更多想法吗?
The ways I can think of are:
- Measure the time between actions.
- Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
- Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
- Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
- (added by chakrit) Number of links in a post.
- Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
- (added by Alekc) Not heuristic. User-agent values.
- And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).
Any more ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我可能高估了机器人创建者的智力,但数字 6 对于任何半像样的机器人创建者来说都是完全没有用的。 使用 C# 浏览器控件创建机器人几乎会使 6 无用。 从我对此类软件的观察来看,这是一种非常常见的方法。
对用户代理进行验证几乎毫无用处,我收到的所有博客垃圾邮件都来自看似有效的网络浏览器的机器人。
我曾经收到很多博客垃圾邮件。 我实际上每天会删除数百条评论。 我使用了 reCaptcha,现在我可能每月获得 1 个。
如果你真的尝试做这样的事情。 我会尝试执行以下操作:
用户一开始无法发布网址。
在分析了 X 个帖子与线程中其他帖子的关系后,然后让他们访问帖子网址。
用户在网站上的活动、帖子质量以及您认为必要的其他因素将成为该用户 IP 的声誉。
然后,根据该 IP 和同一子网上的其他 IP 的声誉,您可以根据需要做出其他决定。
这只是我首先想到的事情。 希望能帮助到你。
I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.
Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.
I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.
If you really try to make something like this. I would attempt by doing the following:
User starts off with no ability to post a url.
After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.
The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.
Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.
That was just the first thing that came to mind. Hope it helps.
我相信我在某处读到 Akismet 使用链接数量作为其主要启发法之一。
我博客上的大多数垃圾评论都包含 10 多个链接。
说到这里...您可能只想查看 Akismet API 本身..它们是非常有效。
I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.
And most of spam comments at my blog contains 10+ links in them.
Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.
在帖子正文中搜索与垃圾邮件相关的关键字怎么样?
这不是一种启发式的方法,但却是一种有效的方法:您还可以随时了解 StopForumSpam 发布的统计信息使用他们的 API。
How about a search for spam related keywords in the post body?
Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.
我相信页面访问之间的时间很常见。
我需要在我的个人网站上添加评论部分,并正在考虑要求人们向我提供他们的电子邮件地址; 我会通过电子邮件向他们发送“发布评论”链接。
您可能需要检查它们是否来自垃圾邮件黑名单 IP 地址(请参阅 http://www.spamhaus.org /)
Time between page visits is common I believe.
I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.
You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)
还有另一个答案建议使用 Akismet 来检测垃圾邮件,我完全赞同。
然而,他们并不是唯一的参与者。
有 TypePad AntiSpam,它使用与 Akismet 相同的启发式方法,以及相同的 API(只是不同的 URL)和 api key,调用的结构是相同的)。 可以肯定地说,他们几乎采取了与 Akismet 相同的方法。
您可能还想查看Project Honeypot。 据我所知,它可以根据用户的 IP 地址进行查找,如果它是已知的恶意 IP,它会告诉你(收割机或类似的东西)。
最后,您可以检查 LinkSleeve,它以据称不同的方式处理垃圾评论。 基本上,它检查评论中链接到的链接,并根据链接的目标位置做出决定。
There is another answer that suggests using Akismet for detecting spam, which I completely endorse.
However, they are not the only player on the block.
There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.
You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).
Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.
不要忘记最终的启发:用户可以单击的“报告垃圾邮件”按钮。 如果不出意外的话,这让您作为管理员有机会更新您的规则库,以发现可能漏掉的内容。 当然,您也可以立即删除违规帖子和用户。
Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.
我对 4° 点有一些疑问,无论如何我也会添加 User-Agent。 这很容易伪造,但根据我的经验,大约 90% 的机器人都使用 Perl 作为 UA
I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA
我确信有某种网络服务,您可以获取顶级 SEO 关键字列表,检查这些关键字的内容。 如果内容包含太多关键字,则怀疑其为垃圾邮件。
I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.