我们在 dev.example.com 有一个开发服务器,正在被 Google 索引。我们使用 AWS Lightsail 将开发服务器全部复制到我们的生产环境 - dev.example.com 和 example.com 上使用相同的 robots.txt 文件。
Google 的 robots.txt 文档没有明确说明是否可以定义根域。我可以对 robots.txt 文件实施域特定规则吗?例如,这是可以接受的:
User-agent: *
Disallow: https://dev.example.com/
User-agent: *
Allow: https://example.com/
Sitemap: https://example.com/sitemap.xml
补充一下,这可以通过 .htaccess 重写引擎来解决 - 我的问题特别是关于 robots.txt。
We have a development server at dev.example.com that is being indexed by Google. We are using AWS Lightsail to duplicate the development server to our production environment in totality — the same robots.txt file is used on both dev.example.com and example.com.
Google's robots.txt documentation doesn't explicitly state whether root domains can be defined. Can I implement domain specific rules to the robots.txt file? For example, is this acceptable:
User-agent: *
Disallow: https://dev.example.com/
User-agent: *
Allow: https://example.com/
Sitemap: https://example.com/sitemap.xml
To add, this can be resolved through .htaccess rewrite engine — my question is specifically about robots.txt.
发布评论
评论(3)
不可以,您不能在
robots.txt
中指定域。禁止:https://dev.example.com/
无效。 robots.txt 排除标准第 6 页规定,禁止行应包含“路径”而不是包含域的完整 URL。每个主机名(域或子域)都有其自己的
robots.txt
文件。因此,为了防止 Googlebot 抓取http://dev.example.com/
,您需要使用https://dev.example.com/robots.txt
提供服务内容:同时,您需要从
http://example.com/
提供不同的文件,可能包含以下内容:如果相同的代码库为您的开发和生产服务器提供支持,那么您需要条件化
robots.txt
的内容基于它是否在生产中运行。或者,您可以允许 Googlebot 抓取这两个页面,但在每个页面中添加
标记,这些标记指向实时网站上该页面的 URL。请参阅如何正确使用 rel='canonical'
No, you can't specify domain in
robots.txt
.Disallow: https://dev.example.com/
is not valid. Page 6 of the robots.txt exclusion standard says that a disallow line should contain a "path" as opposed to a full URL including the domain.Each host name (domain or subdomain) has its own
robots.txt
file. So to prevent Googlebot from crawlinghttp://dev.example.com/
you would need to servehttps://dev.example.com/robots.txt
with the content:At the same time you would need to serve a different file from
http://example.com/
, perhaps with the content:If the same code base powers both your dev and production servers, you will need to conditionalize the content of
robots.txt
based on whether it is running in production or not.Alternately, you could allow Googlebot to crawl both, but include
<link rel=canonical href=...>
tags in every page that point to the URL for the page on the live site. See How to use rel='canonical' properly根据 Google 创建 robots.txt 文档:
Listing full domains in robots.txt is not allowed according to Google's Create a robots.txt documentation:
如果您使用的是 Express Nodejs,我通过检查请求标头主机并使用禁止机器人 txt 进行响应来解决它。
If you are using express nodejs, I solved it by checking the request headers host and responding with a disallow robots txt.