如何判断网络请求是否来自谷歌的爬虫?
从HTTP服务器的角度来看。
From the HTTP server's perspective.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
从HTTP服务器的角度来看。
From the HTTP server's perspective.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
您可以阅读官方的验证 Googlebot 页面。
引用这里的页面:
You can read the official Verifying Googlebot page.
Quoting the page here:
我已在我的 asp.net 应用程序中捕获了 google 爬虫请求,以下是 google 爬虫签名的外观。
我的日志在
66.249.71.*
范围内观察到 Google 抓取工具的许多不同 IP。所有这些 IP 的地理位置均位于美国加利福尼亚州山景城。检查请求是否来自 Google 抓取工具的一个不错的解决方案是验证请求是否包含
Googlebot
和http://www.google.com/bot.html
。正如我所说,同一请求客户端观察到许多 IP,我不建议检查 IP。也许这就是客户身份发挥作用的地方。因此,请验证客户身份。这是 C# 中的示例代码。
值得注意的是,任何 Http 客户端都可以轻松伪造这一点。
I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.
My logs observe many different IPs for google crawler in
66.249.71.*
range. All these IPs are geo-located at Mountain View, CA, USA.A nice solution to check if the request is coming from Google crawler would be to verify the request to contain
Googlebot
andhttp://www.google.com/bot.html
. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.Here's a sample code in C#.
It's important to note that, any Http-client can easily fake this.
现在,您可以通过检查 googlebot 在 处发布的 IP 地址列表来执行 IP 地址检查https://developers.google.com/search/apis/ipranges/googlebot.json
来自 文档:
You can now perform an IP address check, by checking against googlebot's published IP address list at https://developers.google.com/search/apis/ipranges/googlebot.json
From the docs:
如果您使用 Apache Web 服务器,您可以查看日志文件“log\access.log”。
然后从 http://www.iplists.com/nw/google.txt 并检查您的日志中是否包含其中一个 IP。
If you're using Apache Webserver, you could have a look at the log file 'log\access.log'.
Then load google's IPs from http://www.iplists.com/nw/google.txt and check whether one of the IPs is contained in your log.
基于此。 __curious_geek的解决方案,这是javascript版本:
Based on this. __curious_geek's solution, here's the javascript version:
要验证网络请求是否来自 Google 的抓取工具,您可以检查 IP 地址是否属于 Google 发布的 IP 范围(可在此处找到):
https://developers.google.com/search/apis/ipranges/googlebot.json
或者,您也可以进行反向 DNS 查找并检查该域名是否与 Google 的域名之一匹配。
注意:您还可以检查用户代理字符串,但由于它可能被欺骗,因此明智的做法是使用上述方法之一。
您可以使用 NPM 包
crawl-bot-verifier
来验证 Google、Bing、百度和许多其他爬虫,该库执行可靠的 DNS 查找,并且具有非常好的 API。您可以在此处找到该软件包:https://www.npmjs.com/package/crawl -机器人验证程序
To verify if a web request is coming from Google's crawler, you can check the IP address if it falls in the IP ranges posted by Google which can be found here:
https://developers.google.com/search/apis/ipranges/googlebot.json
Alternatively, you can also do a reverse DNS lookup and check if the domain matches one of Google's domains.
Note: You can also check the User-Agent string, but because it can be spoofed it's wise to use one of the methods mentioned above instead.
You can use the NPM package
crawl-bot-verifier
to verify Google, Bing, Baidu, and many other crawlers, the library does a DNS lookup which is reliable and has a very nice API. You can find the package here:https://www.npmjs.com/package/crawl-bot-verifier