robots.txt 中可以使用域名吗？

发布于 2025-01-15 08:16:11 字数 573 浏览 2 评论 0 原文

我们在 dev.example.com 有一个开发服务器，正在被 Google 索引。我们使用 AWS Lightsail 将开发服务器全部复制到我们的生产环境 - dev.example.com 和 example.com 上使用相同的 robots.txt 文件。

Google 的 robots.txt 文档没有明确说明是否可以定义根域。我可以对 robots.txt 文件实施域特定规则吗？例如，这是可以接受的：

User-agent: *
Disallow: https://dev.example.com/

User-agent: *
Allow: https://example.com/

Sitemap: https://example.com/sitemap.xml

补充一下，这可以通过 .htaccess 重写引擎来解决 - 我的问题特别是关于 robots.txt。

原文

We have a development server at dev.example.com that is being indexed by Google. We are using AWS Lightsail to duplicate the development server to our production environment in totality — the same robots.txt file is used on both dev.example.com and example.com.

Google's robots.txt documentation doesn't explicitly state whether root domains can be defined. Can I implement domain specific rules to the robots.txt file? For example, is this acceptable:

User-agent: *
Disallow: https://dev.example.com/

User-agent: *
Allow: https://example.com/

Sitemap: https://example.com/sitemap.xml

To add, this can be resolved through .htaccess rewrite engine — my question is specifically about robots.txt.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萤火眠眠 2025-01-22 08:16:11

不可以，您不能在 robots.txt 中指定域。 禁止：https://dev.example.com/ 无效。 robots.txt 排除标准第 6 页规定，禁止行应包含“路径”而不是包含域的完整 URL。

每个主机名（域或子域）都有其自己的 robots.txt 文件。因此，为了防止 Googlebot 抓取 http://dev.example.com/，您需要使用 https://dev.example.com/robots.txt 提供服务内容：

User-agent: *
Disallow: /

同时，您需要从 http://example.com/ 提供不同的文件，可能包含以下内容：

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

如果相同的代码库为您的开发和生产服务器提供支持，那么您需要条件化 robots.txt 的内容基于它是否在生产中运行。

或者，您可以允许 Googlebot 抓取这两个页面，但在每个页面中添加标记，这些标记指向实时网站上该页面的 URL。请参阅如何正确使用 rel='canonical'

No, you can't specify domain in robots.txt. Disallow: https://dev.example.com/ is not valid. Page 6 of the robots.txt exclusion standard says that a disallow line should contain a "path" as opposed to a full URL including the domain.

Each host name (domain or subdomain) has its own robots.txt file. So to prevent Googlebot from crawling http://dev.example.com/ you would need to serve https://dev.example.com/robots.txt with the content:

User-agent: *
Disallow: /

At the same time you would need to serve a different file from http://example.com/, perhaps with the content:

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

If the same code base powers both your dev and production servers, you will need to conditionalize the content of robots.txt based on whether it is running in production or not.

Alternately, you could allow Googlebot to crawl both, but include <link rel=canonical href=...> tags in every page that point to the URL for the page on the live site. See How to use rel='canonical' properly

回复收藏 0 原文

做个ˇ局外人 2025-01-22 08:16:11

根据 Google 创建 robots.txt 文档：

“相对于根域的目录或页面，您不
希望用户代理抓取。如果规则引用一个页面，则它必须是
浏览器中显示的完整页面名称。必须以 / 开头
字符，如果引用目录，则必须以 / 结尾
标记。”

回复收藏 0 原文

南街女流氓 2025-01-22 08:16:11

如果您使用的是 Express Nodejs，我通过检查请求标头主机并使用禁止机器人 txt 进行响应来解决它。

app.get('/robots.txt', function (req, res) {
  if (req.headers.host === "localhost:8080" ||
    req.headers.host === "algonewbie.fly.dev") {
    res.type('text/plain');
    res.send("User-agent: *\nDisallow: /");
  }
});

If you are using express nodejs, I solved it by checking the request headers host and responding with a disallow robots txt.

app.get('/robots.txt', function (req, res) {
  if (req.headers.host === "localhost:8080" ||
    req.headers.host === "algonewbie.fly.dev") {
    res.type('text/plain');
    res.send("User-agent: *\nDisallow: /");
  }
});

回复收藏 0 原文

~没有更多了~