The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.
That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.
If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.
If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.
More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.
In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:
This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.
发布评论
评论(2)
您问题的答案是使用
我能找到的 robots.txt 上最好的(当前可访问的)来源位于 维基百科。 (据称权威来源是 http://www.robotstxt.org,但该网站目前已关闭。)
根据维基百科页面,该标准仅定义了两个字段:用户代理:和禁止:。 Disallow: 字段不允许显式通配符,但每个“不允许”的路径实际上是一个路径前缀;即匹配以指定值开头的任何路径。
“Allow:”字段是非标准扩展,任何对“Disallow”中显式通配符的支持都是非标准扩展。如果您使用这些,您无权期望(合法的)网络爬虫能够理解它们。
这不是爬虫“聪明”或“愚蠢”的问题:这完全取决于标准合规性和互操作性。例如,任何在“Disallow:”中使用显式通配符执行“智能”操作的网络爬虫对于(假设的)robots.txt 文件来说都是不利的,因为这些字符本来是按字面解释的。
The answer to your question is to use
The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
正如 Paul 所说,许多 robots.txt 解释器都不太聪明,可能不会像您打算使用的那样解释路径中的通配符。
也就是说,一些爬虫尝试自行跳过动态页面,担心它们可能会陷入具有不同 URL 的链接的无限循环中。我假设您问这个问题是因为您面对一个勇敢的爬虫,他正在努力访问这些动态路径。
如果您对特定抓取工具有问题,您可以尝试通过搜索其 robots.txt 容量并为其指定特定的 robots.txt 部分来具体调查该抓取工具的工作原理。
如果您通常只想禁止对动态页面进行此类访问,您可能需要重新考虑您的 robots.txt 设计。
通常,动态参数处理“页面”位于特定目录或一组特定目录下。这就是为什么通常非常简单地禁止: /cgi-bin 或 /app 并完成它。
在您的情况下,您似乎已将根映射到处理参数的区域。您可能想要反转 robots.txt 的逻辑,并说如下内容:
这样,您的允许列表将通过专门添加爬虫应索引的内容来覆盖您的禁止列表。请注意,并非所有爬网程序都是一样的,您可能需要稍后优化 robots.txt,为仍然行为不当的任何爬网程序添加特定部分。
As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.
That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.
If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.
If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.
More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.
In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:
This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.