robots.txt 的正则表达式 - 禁止目录中的某些内容,但不允许目录本身
我正在使用带有自定义永久链接的 WordPress,我想禁止我的帖子,但让蜘蛛可以访问我的类别页面。以下是 URL 的一些示例:
类别页面: somesite dot com /2010/category-name/
帖子: somesite dot com /2010/category-name/产品名称/
因此,我很好奇是否有某种类型的正则表达式解决方案可以将页面保留在 /category-name/ 处,同时不允许更深层次的任何内容(第二个示例)。
有什么想法吗?谢谢! :)
I'm using wordpress with custom permalinks, and I want to disallow my posts but leave my category pages accessible to spiders. Here are some examples of what the URLs look like:
Category page: somesite dot com /2010/category-name/
Post: somesite dot com /2010/category-name/product-name/
So, I'm curious if there is some type of a regex solution to leave the page at /category-name/ allowed while disallowing anything one level deeper (the second example.)
Any ideas? Thanks! :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一些可能有帮助的信息。
robots.txt 协议没有官方标准机构或 RFC。它是由机器人邮件列表 ([email protected] 成员于 1994 年 6 月协商一致创建的)。指定不应访问的部分的信息在网站顶级目录中名为 robots.txt 的文件中指定。 robots.txt 模式通过简单的子字符串比较进行匹配,因此应注意确保模式匹配目录附加了最后的“/”字符,否则名称以该子字符串开头的所有文件都将匹配,而不仅仅是那些预期的目录。
当然,没有 100% 确定的方法可以排除您的网页被发现,除非根本不发布它们。
看:
http://www.robotstxt.org/robotstxt.html
共识中没有允许。另外,正则表达式选项也不在共识中。
来自机器人共识:
目前这有点尴尬,因为没有“允许”字段。最简单的方法是将所有要禁止的文件放入一个单独的目录中,例如“stuff”,并将一个文件保留在该目录的上一级:
或者,您可以明确禁止所有不允许的页面:
一种可能的解决方案:
使用 .htaccess 设置禁止从特定文件夹搜索机器人,同时阻止不良机器人。
请参阅:http://www.askapache.com/htaccess/setenvif.html
Some information that might help.
There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list ([email protected]). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.
There’s no 100% sure way to exclude your pages from being found, other than not to publish them at all, of course.
See:
http://www.robotstxt.org/robotstxt.html
There is no Allow in the Consensus. Plus the Regex option is not in the Consensus either.
From the Robots Consensus:
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:
Alternatively you can explicitly disallow all disallowed pages:
A Possible Solution:
Use .htaccess to set to disallow search robots from a specific folder while blocking bad robots.
See: http://www.askapache.com/htaccess/setenvif.html
以下内容可以解决问题吗?您可能需要明确允许
/2010/category-name
下的某些文件夹:但根据 这篇文章,
Allow
字段不在标准之内,所以有些爬虫可能不支持。编辑:
我刚刚找到了每个页面中要使用的另一个资源。 此页面很好地解释了这一点:
Would the following do the trick?You might need to explicitly allow certain folders under
/2010/category-name
:But according to this article,
Allow
field is not within the standard, so some crawlers might not support it.EDIT:
I just found another resource to be used within each page. This page explains it well: