HEAD请求收到“403禁止”而 GET“200 ok”？

发布于 2024-09-13 14:12:45 字数 1524 浏览 6 评论 0 原文

几个月后，该网站从各大搜索引擎的搜索结果中消失，我终于找到了可能的原因。

我使用 WebBug 来调查服务器标头。查看请求是 HEAD 或 GET 的区别。

HEAD 发送数据：

HEAD / HTTP/1.1
Host: www.attu.it
Connection: close
Accept: */*
User-Agent: WebBug/5.0

HEAD 接收数据：

HTTP/1.1 403 Forbidden
Date: Tue, 10 Aug 2010 23:01:00 GMT
Server: Apache/2.2
Connection: close
Content-Type: text/html; charset=iso-8859-1

GET 发送数据：

GET / HTTP/1.1
Host: www.attu.it
Connection: close
Accept: */*
User-Agent: WebBug/5.0

GET 接收数据：

HTTP/1.1 200 OK
Date: Tue, 10 Aug 2010 23:06:15 GMT
Server: Apache/2.2
Last-Modified: Fri, 08 Jan 2010 08:58:01 GMT
ETag: "671f91b-2d2-47ca362815840"
Accept-Ranges: bytes
Content-Length: 722
Connection: close
Content-Type: text/html

// HTML code here

现在，浏览器默认发送数据GET 请求（至少 firebug 是这么说的）。爬虫是否有可能发送 HEAD 请求？如果是这样，为什么只有这台服务器响应 403，而我维护的其他站点的其他服务器却没有响应？

如果它很重要，.htaccess 中存在的唯一行是（除非我的客户更改了它，因为他们不想让我访问他们的服务器）

AddType text/x-component .htc

更新
谢谢@Ryk。 FireBug 和 Fiddler 都发送 GET 请求，得到 200（或 300）个响应。正如预期的那样。所以我猜这要么是服务器设置错误（尽管这很奇怪，因为托管来自一家拥有数百万客户的大公司），要么是他们放入 .htaccess 中的内容。他们必须让我查看他们的帐户。

我的问题的第二部分是，这是否可能是该网站未出现在任何搜索引擎中的原因（site:www.attu.it 没有给出任何结果）。有什么想法吗？

更新2
经过一番摆弄后，发现根目录中有 phpMyAdmin robots-blocking .htaccess，这导致来自 robots 的任何请求都被发送回 403 Forbidden

原文

after several months having the site disappear from search results in every major search engine, I finally found out a possible reason.

I used WebBug to investigate server header. See the difference if the request is HEAD or GET.

HEAD Sent data:

HEAD / HTTP/1.1
Host: www.attu.it
Connection: close
Accept: */*
User-Agent: WebBug/5.0

HEAD Received data:

HTTP/1.1 403 Forbidden
Date: Tue, 10 Aug 2010 23:01:00 GMT
Server: Apache/2.2
Connection: close
Content-Type: text/html; charset=iso-8859-1

GET Sent data:

GET / HTTP/1.1
Host: www.attu.it
Connection: close
Accept: */*
User-Agent: WebBug/5.0

GET Received data:

HTTP/1.1 200 OK
Date: Tue, 10 Aug 2010 23:06:15 GMT
Server: Apache/2.2
Last-Modified: Fri, 08 Jan 2010 08:58:01 GMT
ETag: "671f91b-2d2-47ca362815840"
Accept-Ranges: bytes
Content-Length: 722
Connection: close
Content-Type: text/html

// HTML code here

Now, browsers by default send a GET request (at least this is what firebug says). Is it possible that crawlers send a HEAD request instead? If so, why only this server responds with a 403, while other servers from other sites I'm mantaining do not?

In case it's important, the only line present in .htaccess is (unless my client changed it, as they don't want to give me access to their server)

AddType text/x-component .htc

UPDATE
Thanks @Ryk. FireBug and Fiddler both send GET requests, which get 200 (or 300) responses. As expected. So I guess it's either a server bad setting (even though it's strange as the hosting is from a major company with millions of clients) or something they put in the .htaccess. They will have to let me look into their account.

The second part of my question was if that could be the cause of the website not appearing in any search engine (site:www.attu.it gives no results). Any thought?

UPDATE 2
After some fiddling around, it turns out there was the phpMyAdmin robots-blocking .htaccess in the root directory, that caused any request from robots to be sent back with a 403 Forbidden