PHP 检测类似机器人的行为
我正在尝试构建一个系统,仅在检测到类似机器人的行为时才向用户显示验证码。以下是我目前正在寻找的当有人填写联系表单时的行为...
页面加载后提交表单的速度有多快(如果是 5 秒或更短时间,则几乎人道地不可能填写)
过去一小时内尝试联系的次数(限制 15 次/小时) 、或天(限制 25 次/天)
检查邮件内容中的链接,并交叉-对照过去一天中最近包含的其他链接检查链接
检查邮件中是否有垃圾邮件关键字
我将在此处添加有用的社区解决方案:
使用“蜜罐”(信息位于http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx)
检查外部入口的引用 URL
哪些其他行为表明 PHP 可以帮助检测机器人(不不想使用 JS,因为它可以在没有验证码的帮助下关闭)?
I am attempting to build a system that only shows users a CAPTCHA when bot-like behavior is detected. Here are the behaviors that I am currently looking for when somebody is filling out a contact form...
how quickly the form is submitted after the page loads (if its 5 seconds or less, its almost humanely impossible to fill out)
how many contact attempts have been made in the past hour (limit 15/hour), or day (limit 25/day)
check message content for links, and cross-check links against other links recently included in the past day
check message for spam keywords
I will add useful community solutions here as they come:
use a "honeypot" (info at http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx)
check referring URL for an outside entrance
What other behaviors would be indicative of robots that PHP could help detect (don't want to use JS because it can be switched off) without the help of a CAPTCHA?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
一个非常简单的方法(一些更高级的机器人不会受此影响,但许多基本机器人会) - 以普通用户不可见的形式放置一个虚假字段(并且作为备份,也许使用通常不可见的标签“不要在此处输入任何内容”)。如果提交时字段中有内容,则很可能是机器人。
A very simple one (some more advanced bots won't fall for this, but many basic bots will) - put a bogus field in the form that isn't visible to a regular user (and as a backup, perhaps with a normally invisible label "don't type anything here"). If there's content in the field when submitted, chances are it's a bot.
我相信您可以与您的robots.txt文件进行协调,并确定它是否被用户点击,这样您就可以跟踪请求者的ip/时间戳,这看起来不太可能普通用户会看到您的
robots.txt
文件。因为大多数机器人都会检查您的 robots.txt 文件(可能是目录结构等)。
I believe you could coordinate with your
robots.txt
file, and determine IF it was hit by the user, this would then allow you to keep track of ip/timestamp of requestor, which would make it seem unlikely that a normal user would see yourrobots.txt
file.As most bots will check your robots.txt file (maybe for dir structure, etc).
一个有趣的因素可能是打字频率和鼠标移动。通过 JavaScript 很容易捕获它们。分析它们是另一回事,尽管我认为计算偏差和平均值相当容易,可以很好地了解运动的“有机”程度。
另一方面,这在客户端是极其昂贵的,如果检测到的话可以理解为窥探/间谍活动。也许可以为被怀疑是机器人的客户提供高级安全保护?
An interesting factor could be typing frequency and mouse movements. They are fairly easy to catch via JavaScript. Analyzing them is a different matter, although I imagine it would be fairly easy to calculate deviations and averages that give a good idea how "organic" the movements are.
On the other hand, this is extremely expensive on the client side and can be understood as snooping / spying if detected. Maybe as advanced security for clients that are suspected to be bots?
我用
name="email"
在表单中添加了一个隐藏字段(通过 CSS,display:none),当它被填充时,它是一个机器人;)I added a hidden field (by CSS, display:none) to the form with
name="email"
, when it is filled it was a robot ;)也许检查引用网址?我很难想象很多人最终会在没有先浏览网站中其他几个页面的情况下就进入联系表格,订单表格也是如此,......
Perhaps checking the referring url? I can hardly imagine alot of people ending up at a contact form without actually first going through several other pages in a website, same goes for order forms, ...
我建议忘记尝试猜测标志......它们总是在变化。
我会标记该行为的每个可以想象的“特征”,自动用“好”、“垃圾邮件”或“不确定”对这些特征进行评分。然后,“错误训练”(记录猜测错误的情况)。一段时间后,您的准确率可以达到 99.7%。
以下是我网站提交的 7 个最有趣的功能的示例,该提交的垃圾邮件得分为 89.9771%。这是垃圾邮件。
帖子中发现的每个关键字都有 98.9% 的可能性是垃圾邮件:
电话号码“12345”有 95% 的可能性是垃圾邮件
消息的总长度为 30 个字符(删除 html 后)是表明 94% 垃圾邮件的功能
(还有另一个功能的得分为
Prob 0.01011
,这抵消了总分,将其击倒了一点。但是,我不会说该功能是什么;o)从众所周知的垃圾邮件 IP 提交: http://www.projecthoneypot.org/ip_84.19.186.171 但没有必要使用特定的知识将其标记为垃圾邮件。我收集各种信息,例如 IP、提交率等……但是,正如您所看到的,类似机器人行为的最明显迹象并不是您所猜测的那样。
要构建您自己的其中之一...请阅读以下内容:
http://www.paulgraham.com/spam.html
I'd suggest forget trying to guess the signs...they are always changing.
I'd tokenize every imaginable 'feature' of the behaviour, automatically score the features with either, 'ok', 'spam' or 'unsure'. Then, 'Train on Error' (make a record of the cases where the guess was wrong). After a bit of time you could have 99.7 % accuracy.
Here is an example of the 7 most interesting features of a submission to my site that was scored at 89.9771 % spam. It is spam.
Each of these keywords found in the post are features that are 98.9% likely to be spam:
The telephone number that is '12345' is 95% likely to be spam
The total length of the message being 30 characters (after html removed) is a feature that indicates 94% spam
(There was another feature that scored
Prob 0.01011
which offset the total combined score knocking it down a bit. But, i am not gonna say what that feature was ;o)It was submitted from a well known spam ip: http://www.projecthoneypot.org/ip_84.19.186.171 but there was no need to use that particular knowledge to mark it out as spam. I gather all sorts of info, like IPs, submissions rates etc ...but, as you can see, the most glaring signs of bot-like behavior are not what you might guess.
To build your own one of these .... read this:
http://www.paulgraham.com/spam.html