robots.txt:用户代理:Googlebot不允许:/ Google仍在索引

发布于 2024-10-14 01:39:59 字数 240 浏览 0 评论 0原文

看一下这个网站的robots.txt:

fr2.dk/robots.txt

内容是:

User-Agent: Googlebot
Disallow: /

那应该告诉谷歌不要索引该网站,不是吗?

如果属实,为什么该网站会出现在谷歌搜索中?

Look at the robots.txt of this site:

fr2.dk/robots.txt

The content is:

User-Agent: Googlebot
Disallow: /

That ought to tell google not to index the site, no?

If true, why does the site appear in google searches?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

携君以终年 2024-10-21 01:39:59

除了必须等待之外,因为 Google 的索引更新需要一些时间,还要注意,如果有其他网站链接到您的网站,仅靠 robots.txt 不足以删除您的网站。

引用 Google 的支持页面 “从 Google 搜索结果中删除网页或网站”

如果该网页仍然存在,但您不希望它出现在搜索结果中,请使用 robots.txt 阻止 Google 抓取它。请注意,一般来说,即使 robots.txt 不允许某个 URL,如果我们在其他网站上找到该页面的 URL,我们仍可能会对该页面建立索引。不过,如果该网页在 robots.txt 中被屏蔽,并且存在针对该网页的主动删除请求,则 Google 不会将该网页编入索引。

上述文档中还提到了一种可能的替代解决方案:

或者,您可以使用 noindex 元标记。当我们在页面上看到此标签时,Google 会将该页面从我们的搜索结果中完全删除,即使其他页面链接到该页面也是如此。如果您无法直接访问站点服务器,这是一个很好的解决方案。 (您需要能够编辑页面的 HTML 源代码)。

Besides having to wait, because Google's index updates take some time, also note that if you have other sites linking to your site, robots.txt alone won't be sufficient to remove your site.

Quoting Google's support page "Remove a page or site from Google's search results":

If the page still exists but you don't want it to appear in search results, use robots.txt to prevent Google from crawling it. Note that in general, even if a URL is disallowed by robots.txt we may still index the page if we find its URL on another site. However, Google won't index the page if it's blocked in robots.txt and there's an active removal request for the page.

One possible alternative solution is also mentioned in above document:

Alternatively, you can use a noindex meta tag. When we see this tag on a page, Google will completely drop the page from our search results, even if other pages link to it. This is a good solution if you don't have direct access to the site server. (You will need to be able to edit the HTML source of the page).

太阳公公是暖光 2024-10-21 01:39:59

我可以确认 Google 不尊重机器人排除文件。这是我的文件,是我在将此源放到网上之前创建的:

https://git.habd.as/robots .txt

以及文件的完整内容:

User-agent: *
Disallow:

User-agent: Google
Disallow: /

Google 仍然将其编入索引。

去年 3 月取消我的帐户后,我不再使用 Google,也从未将此网站添加到 Yandex 之外的网站管理员控制台,这让我有两个假设:

  1. Google 正在抓取 Yandex
  2. Google 不尊重机器人排除标准

我还没有 grep 我的网站日志尚未,但我会的,我的假设是我会发现谷歌蜘蛛在那里行为不端。

I can confirm Google doesn't respect the Robots Exclusion File. Here's my file, which I created before putting this origin online:

https://git.habd.as/robots.txt

And the full contents of the file:

User-agent: *
Disallow:

User-agent: Google
Disallow: /

And Google still indexed it.

I don't use Google after cancelling my account last March and never had this site added to a webmaster console outside Yandex which leaves me with two assumptions:

  1. Google is scraping Yandex
  2. Google doesn't respect the Robots Exclusion Standard

I haven't grepped my logs yet but I will and my assumption is I'll find Google spiders in there misbehaving.

懒猫 2024-10-21 01:39:59

如果您刚刚添加了此内容,那么您必须等待 - 这不是即时的 - 直到 Googlebot 回来重新蜘蛛化该网站并看到 robots.txt,该网站仍将在他们的数据库中。

我怀疑它是否相关,但你可能想将你的“代理”更改为“代理” - 谷歌很可能对此不区分大小写,但严格遵循标准不会有什么坏处。

If you just added this, then you'll have to wait - it's not instantaenous - until Googlebot comes back to respider the site and sees the robots.txt, the site'll still be in their database.

I doubt it's relevant, but you might want to change your "Agent" to "agent" - Google's most likely not case sensitive for this, but can't hurt to follow the standard exactly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文