There's nothing that will work for all crawlers. There are two options that might be useful to you.
Robots that allow wildcards should support something like:
Disallow: /*/
The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.
If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:
The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.
If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.
All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.
In short no there is no way to do this nicely using the robots.txt standard. Remember the Disallow specifies a path prefix. Wildcards and allows are non-standard.
发布评论
评论(2)
没有什么东西适合所有的爬虫。有两个选项可能对您有用。
允许通配符的机器人应该支持如下内容:
主要的搜索引擎爬虫可以理解通配符,但不幸的是大多数较小的爬虫不能理解。
如果根目录中的文件相对较少,并且不经常添加新文件,则可以使用
Allow
只允许访问这些文件,然后使用Disallow: /
限制其他一切。那就是:这里的顺序很重要。爬行者应该拿下第一场比赛。因此,如果您的第一条规则是
Disallow: /
,则行为正常的抓取工具将无法访问以下Allow
行。如果抓取工具不支持
Allow
,那么它将看到Disallow: /
并且不会抓取您网站上的任何内容。当然,前提是它忽略 robots.txt 中它不理解的内容。所有主要的搜索引擎爬虫都支持
Allow
,许多较小的搜索引擎爬虫也支持。它很容易实现。There's nothing that will work for all crawlers. There are two options that might be useful to you.
Robots that allow wildcards should support something like:
The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.
If you have relatively few files in the root and you don't often add new files, you could use
Allow
to allow access to just those files, and then useDisallow: /
to restrict everything else. That is:The order here is important. Crawlers are supposed to take the first match. So if your first rule was
Disallow: /
, a properly behaving crawler wouldn't get to the followingAllow
lines.If a crawler doesn't support
Allow
, then it's going to see theDisallow: /
and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.All the major search engine crawlers support
Allow
, and a lot of the smaller ones do, too. It's easy to implement.简而言之,没有办法使用 robots.txt 标准很好地做到这一点。请记住“禁止”指定路径前缀。 通配符和允许是非标准的。
因此以下方法(一个拼凑!)将起作用。
In short no there is no way to do this nicely using the robots.txt standard. Remember the Disallow specifies a path prefix. Wildcards and allows are non-standard.
So the following approach (a kludge!) will work.