允许 Google 绕过 CAPTCHA 验证 - 明智与否?

发布于 2024-08-28 17:03:34 字数 438 浏览 2 评论 0原文

我的网站有一个数据库查找功能;填写验证码后,您将获得 5 分钟的查找时间。还有一些自定义代码来检测任何自动化脚本。我这样做是因为我不希望有人对我的网站进行数据挖掘。

问题是 Google 在抓取我的网站时看不到查找结果。如果有人正在搜索查找结果中存在的字符串,我希望他们通过谷歌搜索找到此页面。

对我来说,显而易见的解决方案是使用 PHP 变量 $_SERVER['HTTP_USER_AGENT'] 绕过 Google 机器人的验证码和自定义安全代码。我的问题是这是否明智。

然后,人们可以使用 Google 的缓存来查看查找结果,而无需填写验证码,但是 Google 自己的脚本检测方法会阻止他们对这些页面进行数据挖掘吗?

或者人们是否可以通过某种方式让 $_SERVER['HTTP_USER_AGENT'] 显示为 Google 来绕过安全措施?

My web site has a database lookup; filling out a CAPTCHA gives you 5 minutes of lookup time. There is also some custom code to detect any automated scripts. I do this as I don't want someone data mining my site.

The problem is that Google does not see the lookup results when it crawls my site. If someone is searching for a string that is present in the result of a lookup, I would like them to find this page by Googling it.

The obvious solution to me is to use the PHP variable $_SERVER['HTTP_USER_AGENT'] to bypass the CAPTCHA and custom security code for the Google bots. My question is whether this is sensible or not.

People could then use Google's cache to view the lookup results without having to fill out the CAPTCHA, but would Google's own script detection methods prevent them from data mining these pages?

Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦在深巷 2024-09-04 17:03:34

或者人们是否可以通过某种方式让 $_SERVER['HTTP_USER_AGENT'] 显示为 Google 来绕过安全措施?

确实。用户代理很容易伪造。请参阅Firefox 的用户代理切换器。垃圾邮件机器人也很容易将其用户代理标头设置为 Google 机器人。

不过,它可能仍然值得一试。我想说的是,尝试一下,看看结果如何。如果遇到问题,您可能必须考虑其他方法。

识别 Google 机器人的另一种方法可能是它使用的 IP 范围。我不知道机器人是否使用定义的 IP 范围 - 情况可能并非如此,您必须找出答案。

更新:似乎可以通过分析其 IP 来验证 Google Bot。来自 Google 网站站长中心:如何验证 Googlebot

告诉网站管理员根据具体情况使用 DNS 进行验证似乎是最好的方法。我认为推荐的技术是进行反向 DNS 查找,验证该名称是否在 googlebot.com 域中,然后使用该 googlebot.com 名称进行相应的正向 DNS->IP 查找;例如:

主机66.249.66.1
1.66.249.66.in-addr.arpa域名指针crawl-66-249-66-1.googlebot.com。

主机crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com 的地址为 66.249.66.1

我认为仅进行反向 DNS 查找就足够了,因为欺骗者可以设置反向 DNS 来指向crawl-abcd.googlebot.com。

Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?

Definitely. The user agent is laughably easy to forge. See e.g. User Agent Switcher for Firefox. It's also easy for a spam bot to set its user agent header to the Google bot.

It might still be worth a shot, though. I'd say just try it out and see what the results are. If you get problems, you may have to think about another way.

An additional way to recognize the Google bot could be the IP range(s) it uses. I don't know whether the bot uses defined IP ranges - it could be that that's not the case, you'd have to find out.

Update: it seems to be possible to verify the Google Bot by analyzing its IP. From Google Webmaster Central: How to verify Googlebot

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

々眼睛长脚气 2024-09-04 17:03:34

$_SERVER['HTTP_USER_AGENT'] 参数不安全,如果人们真的想获得您的结果,可以伪造它。您的决定是一项商业决定,基本上您是否希望降低安全性并可能允许人们/机器人抓取您的网站,或者您是否希望对谷歌隐藏您的结果。

the $_SERVER['HTTP_USER_AGENT'] parameter is not secure, people can fake it if they really want to get your results. your decision is a business one, basically do you wish to lower security and potentially allow people/bots to scrape your site, or do you want your results hidden from google.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文