当前位置：文江博客话题详情

PHP 检测类似机器人的行为

发布于 2024-08-12 08:30:43 字数 632 浏览 6 评论 0原文

我正在尝试构建一个系统，仅在检测到类似机器人的行为时才向用户显示验证码。以下是我目前正在寻找的当有人填写联系表单时的行为...

页面加载后提交表单的速度有多快（如果是 5 秒或更短时间，则几乎人道地不可能填写）
过去一小时内尝试联系的次数（限制 15 次/小时）、或天（限制 25 次/天）
检查邮件内容中的链接，并交叉-对照过去一天中最近包含的其他链接检查链接
检查邮件中是否有垃圾邮件关键字

我将在此处添加有用的社区解决方案：

使用“蜜罐”（信息位于http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx)
检查外部入口的引用 URL

哪些其他行为表明 PHP 可以帮助检测机器人（不不想使用 JS，因为它可以在没有验证码的帮助下关闭）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执妄 2024-08-19 08:30:43

一个非常简单的方法（一些更高级的机器人不会受此影响，但许多基本机器人会） - 以普通用户不可见的形式放置一个虚假字段（并且作为备份，也许使用通常不可见的标签“不要在此处输入任何内容”）。如果提交时字段中有内容，则很可能是机器人。

回复收藏 0 原文

时间你老了 2024-08-19 08:30:43

我相信您可以与您的robots.txt文件进行协调，并确定它是否被用户点击，这样您就可以跟踪请求者的ip/时间戳，这看起来不太可能普通用户会看到您的 robots.txt 文件。

因为大多数机器人都会检查您的 robots.txt 文件（可能是目录结构等）。

回复收藏 0 原文

情丝乱 2024-08-19 08:30:43

一个有趣的因素可能是打字频率和鼠标移动。通过 JavaScript 很容易捕获它们。分析它们是另一回事，尽管我认为计算偏差和平均值相当容易，可以很好地了解运动的“有机”程度。

另一方面，这在客户端是极其昂贵的，如果检测到的话可以理解为窥探/间谍活动。也许可以为被怀疑是机器人的客户提供高级安全保护？

回复收藏 0 原文

梦亿 2024-08-19 08:30:43

我用 name="email" 在表单中添加了一个隐藏字段（通过 CSS，display:none），当它被填充时，它是一个机器人;)

回复收藏 0 原文

蓬勃野心 2024-08-19 08:30:43

也许检查引用网址？我很难想象很多人最终会在没有先浏览网站中其他几个页面的情况下就进入联系表格，订单表格也是如此，......

回复收藏 0 原文

羁客 2024-08-19 08:30:43

我建议忘记尝试猜测标志......它们总是在变化。

我会标记该行为的每个可以想象的“特征”，自动用“好”、“垃圾邮件”或“不确定”对这些特征进行评分。然后，“错误训练”（记录猜测错误的情况）。一段时间后，您的准确率可以达到 99.7%。

以下是我网站提交的 7 个最有趣的功能的示例，该提交的垃圾邮件得分为 89.9771%。这是垃圾邮件。

帖子中发现的每个关键字都有 98.9% 的可能性是垃圾邮件：

mssg txt - "tours" || Prob 0.98993 
mssg txt - "cruises" || Prob 0.98993
mssg txt - "agencies" || Prob 0.98993
mssg txt - "choice" || Prob 0.98991

电话号码“12345”有 95% 的可能性是垃圾邮件

tel number - "123456" || Prob 0.95440 Delta 0.45440

消息的总长度为 30 个字符（删除 html 后）是表明 94% 垃圾邮件的功能

mssg maxlen - "30" || Prob 0.94600

（还有另一个功能的得分为 Prob 0.01011，这抵消了总分，将其击倒了一点。但是，我不会说该功能是什么；o

）从众所周知的垃圾邮件 IP 提交： http://www.projecthoneypot.org/ip_84.19.186.171 但没有必要使用特定的知识将其标记为垃圾邮件。我收集各种信息，例如 IP、提交率等……但是，正如您所看到的，类似机器人行为的最明显迹象并不是您所猜测的那样。

要构建您自己的其中之一...请阅读以下内容：
http://www.paulgraham.com/spam.html

I'd suggest forget trying to guess the signs...they are always changing.

I'd tokenize every imaginable 'feature' of the behaviour, automatically score the features with either, 'ok', 'spam' or 'unsure'. Then, 'Train on Error' (make a record of the cases where the guess was wrong). After a bit of time you could have 99.7 % accuracy.

Here is an example of the 7 most interesting features of a submission to my site that was scored at 89.9771 % spam. It is spam.

Each of these keywords found in the post are features that are 98.9% likely to be spam:

mssg txt - "tours" || Prob 0.98993 
mssg txt - "cruises" || Prob 0.98993
mssg txt - "agencies" || Prob 0.98993
mssg txt - "choice" || Prob 0.98991

The telephone number that is '12345' is 95% likely to be spam

tel number - "123456" || Prob 0.95440 Delta 0.45440

The total length of the message being 30 characters (after html removed) is a feature that indicates 94% spam

mssg maxlen - "30" || Prob 0.94600

(There was another feature that scored Prob 0.01011 which offset the total combined score knocking it down a bit. But, i am not gonna say what that feature was ;o)

It was submitted from a well known spam ip: http://www.projecthoneypot.org/ip_84.19.186.171 but there was no need to use that particular knowledge to mark it out as spam. I gather all sorts of info, like IPs, submissions rates etc ...but, as you can see, the most glaring signs of bot-like behavior are not what you might guess.

To build your own one of these .... read this:
http://www.paulgraham.com/spam.html

回复收藏 0 原文

~没有更多了~