如何确定访问您网站的用户是否是机器人?
我知道用户代理是一个指标,但这很容易被欺骗。还有哪些其他可靠指标可以表明访问者确实是机器人?标题不一致?是否需要图像/javascript?谢谢!
I know that user agents are one indicator, but that's easy to spoof. What other reliable indicators are there that a visitor is really a bot? Inconsistent headers? Whether images/javascript are requested? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
CVSTrac 使用 honeypot 页面来完成此操作。这是一个链接到网站某处的页面,爬虫可以到达该页面,但人们通常会忽略它。 CVSTrac 更进一步,允许用户证明他是人类。
CVSTrac uses a honeypot page to accomplish this. It's a page linked somewhere on the site where crawlers reach it, but humans usually ignore it. CVSTrac goes one step further by allowing the user to prove that he is human.
“是否需要图像/javascript?”我会选择这个,但是 Google 和其他人现在要求图像和 javascript 文件。
请求时间速度怎么样?机器人阅读您的内容的速度比人类快得多。
"Whether images/javascript are requested?" I would go for this one, however Google and others request images and javascript files nowadays.
How about request time speed? Bots read your content a lot faster than humans do.
我们要查找 4 项内容:
用户代理字符串。它很容易伪造,但爬虫通常会使用自己独特的用户代理字符串。
页面的访问速度,如果每半秒左右访问超过一个,通常是一个很好的指示
他们是否只请求 HTML,或者是否请求整个页面。有些爬虫只会询问 HTML 结构。这通常是一个很好的提示。
传入 url
There are 4 things that we look for:
The user agent string. It is very easy to fake, but often crawlers will use their own unique user agent string.
The speed of access of pages, if they access more than one every half second or so, that's usually a good indication
If they request just the HTML, or if they request the entire page. Some crawlers will only ask for the HTML structure. This is usually a good tip off.
The incoming url
某种反向验证码也有帮助;您可以创建一个带有 display: none; 的文本输入字段在它的样式属性(或你的样式表)中。如果它被发布到,那么您很可能正在与机器人打交道。
编辑:这实际上是我的 RSS 阅读器中聚合的内容,如果我能找到来源,我会链接一个很好的示例。
A reverse captcha of sorts can help as well; you could create an text input field with display: none; in it's style attribute (or your stylesheet). If it's posted to, chances are you're dealing with a bot.
Edit: This was actually something that had been aggregated in my RSS reader, if I can find the source, I'll link a good example.
看一下不良行为,这是一个采用多种机器人检测技术的库
Take a look at Bad Behavior, a library which employs a wide array of bot detection techniques
这不就是 验证码 的发明目的吗?
Isn't that what captcha is invented for?