如何判断网络请求是否来自谷歌的爬虫?

发布于 2024-09-10 18:52:15 字数 21 浏览 6 评论 0原文

从HTTP服务器的角度来看。

From the HTTP server's perspective.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

初雪 2024-09-17 18:52:15

您可以阅读官方的验证 Googlebot 页面。

引用这里的页面:

您可以验证访问您服务器的机器人是否确实是 Googlebot
(或另一个 Google 用户代理)通过使用反向 DNS 查找,
验证该名称是否在 googlebot.com 域中,然后执行
使用该 googlebot 名称进行正向 DNS 查找。这很有用,如果
您担心垃圾邮件发送者或其他麻烦制造者正在访问
您的网站同时声称是 Googlebot。

例如:

<前><代码>>主机 66.249.66.1
1.66.249.66.in-addr.arpa域名指针crawl-66-249-66-1.googlebot.com。

>主机crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com 的地址为 66.249.66.1

Google 不会发布公开的 IP 列表
网站管理员将地址列入白名单。这是因为这些IP
地址范围可能会发生变化,这会给任何网站管理员带来问题
对它们进行了硬编码。识别 Googlebot 访问的最佳方式
是使用用户代理 (Googlebot)。

You can read the official Verifying Googlebot page.

Quoting the page here:

You can verify that a bot accessing your server really is Googlebot
(or another Google user-agent) by using a reverse DNS lookup,
verifying that the name is in the googlebot.com domain, and then doing
a forward DNS lookup using that googlebot name. This is useful if
you're concerned that spammers or other troublemakers are accessing
your site while claiming to be Googlebot.

For example:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer  crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Google doesn't post a public list of IP
addresses for webmasters to whitelist. This is because these IP
address ranges can change, causing problems for any webmasters who
have hard coded them. The best way to identify accesses by Googlebot
is to use the user-agent (Googlebot).

狠疯拽 2024-09-17 18:52:15

我已在我的 asp.net 应用程序中捕获了 google 爬虫请求,以下是 google 爬虫签名的外观。

请求IP66.249.71.113
客户端:Mozilla/5.0(兼容;Googlebot/2.​​1;+http://www .google.com/bot.html)

我的日志在 66.249.71.* 范围内观察到 Google 抓取工具的许多不同 IP。所有这些 IP 的地理位置均位于美国加利福尼亚州山景城。

检查请求是否来自 Google 抓取工具的一个不错的解决方案是验证请求是否包含 Googlebothttp://www.google.com/bot.html 。正如我所说,同一请求客户端观察到许多 IP,我不建议检查 IP。也许这就是客户身份发挥作用的地方。因此,请验证客户身份。

这是 C# 中的示例代码。

    if (Request.UserAgent.ToLower().Contains("googlebot") || 
             Request.UserAgent.ToLower().Contains("google.com/bot.html"))
    {
        //Yes, it's google bot.
    }
    else
    {
        //No, it's something else.
    }

值得注意的是,任何 Http 客户端都可以轻松伪造这一点。

I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.

Requesting IP: 66.249.71.113
Client: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA.

A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.

Here's a sample code in C#.

    if (Request.UserAgent.ToLower().Contains("googlebot") || 
             Request.UserAgent.ToLower().Contains("google.com/bot.html"))
    {
        //Yes, it's google bot.
    }
    else
    {
        //No, it's something else.
    }

It's important to note that, any Http-client can easily fake this.

柏拉图鍀咏恒 2024-09-17 18:52:15

现在,您可以通过检查 googlebot 在 处发布的 IP 地址列表来执行 IP 地址检查https://developers.google.com/search/apis/ipranges/googlebot.json

来自 文档

您可以通过 IP 地址识别 Googlebot,只需将抓取工具的 IP 地址与 进行匹配即可Googlebot IP 地址列表。对于所有其他 Google 抓取工具,请将抓取工具的 IP 地址与 Google IP 地址的完整列表.

You can now perform an IP address check, by checking against googlebot's published IP address list at https://developers.google.com/search/apis/ipranges/googlebot.json

From the docs:

you can identify Googlebot by IP address by matching the crawler's IP address to the list of Googlebot IP addresses. For all other Google crawlers, match the crawler's IP address against the complete list of Google IP addresses.

清眉祭 2024-09-17 18:52:15

如果您使用 Apache Web 服务器,您可以查看日志文件“log\access.log”。

然后从 http://www.iplists.com/nw/google.txt 并检查您的日志中是否包含其中一个 IP。

If you're using Apache Webserver, you could have a look at the log file 'log\access.log'.

Then load google's IPs from http://www.iplists.com/nw/google.txt and check whether one of the IPs is contained in your log.

攀登最高峰 2024-09-17 18:52:15

基于此。 __curious_geek的解决方案,这是javascript版本:

if(window.navigator.userAgent.match(/googlebot|google\.com\/bot\.html/i)) {
  // Yes, it's google bot.
}

Based on this. __curious_geek's solution, here's the javascript version:

if(window.navigator.userAgent.match(/googlebot|google\.com\/bot\.html/i)) {
  // Yes, it's google bot.
}
动次打次papapa 2024-09-17 18:52:15

要验证网络请求是否来自 Google 的抓取工具,您可以检查 IP 地址是否属于 Google 发布的 IP 范围(可在此处找到):

https://developers.google.com/search/apis/ipranges/googlebot.json

或者,您也可以进行反向 DNS 查找并检查该域名是否与 Google 的域名之一匹配。

注意:您还可以检查用户代理字符串,但由于它可能被欺骗,因此明智的做法是使用上述方法之一。

您可以使用 NPM 包 crawl-bot-verifier 来验证 Google、Bing、百度和许多其他爬虫,该库执行可靠的 DNS 查找,并且具有非常好的 API。您可以在此处找到该软件包:

https://www.npmjs.com/package/crawl -机器人验证程序

To verify if a web request is coming from Google's crawler, you can check the IP address if it falls in the IP ranges posted by Google which can be found here:

https://developers.google.com/search/apis/ipranges/googlebot.json

Alternatively, you can also do a reverse DNS lookup and check if the domain matches one of Google's domains.

Note: You can also check the User-Agent string, but because it can be spoofed it's wise to use one of the methods mentioned above instead.

You can use the NPM package crawl-bot-verifier to verify Google, Bing, Baidu, and many other crawlers, the library does a DNS lookup which is reliable and has a very nice API. You can find the package here:

https://www.npmjs.com/package/crawl-bot-verifier

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文