在robots.txt中不允许目录时,我应该使用尾随的斜线吗?
我想禁止在/acct
的爬网我应该使用哪个规则?
disallow:/acct
或disallow:/acct/
acct
都包含子直销和文件。拖尾的效果是什么?
I want to disallow crawling of a directory /acct
in robots.txt
Which rule should I use?
Disallow: /acct
or Disallow: /acct/
acct
contains sub-directories and files both. What is the effect of a trailing slash?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于
robots.txt
规则都是“开头”规则,因此您提出的两个规则都会禁止以下内容:https://example.com/acct/
https://example.com/acct/foo
https://example.com/acct/bar
但是,只有以下规则不允许,而无需拖延斜线:
https://example.com/acct.html
https://example.com/acctbar
<代码>禁止:/acct/通常会更好,因为没有任何不承受意外URL的风险。但是,它不会阻止
/acct
的爬行。在大多数情况下,Web服务器重定向目录URL,而无需拖曳斜线以添加后斜线。在您的服务器上,
https://example.com/acct
重定向到https://example.com/acct/
。如果是这种情况,通常可以允许bot爬网/acct
而没有拖延斜线并查看重定向是可以的。他们将被阻止爬行重定向的目标。Since
robots.txt
rules are all "starts with" rules, both of your proposed rules would disallow the following:https://example.com/acct/
https://example.com/acct/foo
https://example.com/acct/bar
However, the following would only be disallowed by the rule without the trailing slash:
https://example.com/acct
https://example.com/acct.html
https://example.com/acctbar
Disallow: /acct/
is usually better because there is no risk of disallowing unexpected URLs. However, it does NOT prevent crawling of/acct
.In most cases web servers redirect directory URLs without a trailing slash to add the trailing slash. It is likely that on your server,
https://example.com/acct
redirects tohttps://example.com/acct/
. If that is the case, it is usually fine to allow bots to crawl/acct
with no trailing slash and see the redirect. They would be blocked from crawling the target of the redirect.