WordPress 博客的 robots.txt (不允许 /blog/page/... 但允许抓取所有帖子的链接?)
我有一个非常幼稚的问题,我找不到答案。 我有一个 WordPress 博客。 所有帖子都列在几个页面中,例如
mydomain.com/blog/
mydomain.com/blog/page/2/
...
mydomain.com/blog/page/N/
,所以我不希望爬虫“记住”特定页面上的内容,但想让它 抓取每个“/page/”上链接的所有帖子,它是否能够跟踪和抓取我不允许的页面上的链接
disallow: /blog/page/ ?
或者我如何禁止抓取特定页面上的哪些帖子,但仍然让它抓取所有帖子?
I have a very naive question I can't find an answer.
I have a wordpress blog.
All posts are listed in several pages, e.g.
mydomain.com/blog/
mydomain.com/blog/page/2/
...
mydomain.com/blog/page/N/
so I don't want a crawler to "remember" what was on particular page, but want to let it
crawl all posts linked on each "/page/", will it be able to follow and crawl links on pages I disallow with
disallow: /blog/page/ ?
Or how do I disallow crawling what posts are on particular page, but still let it crawl all posts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你不能用 robots.txt 做到这一点。您的示例
Disallow
行将告诉抓取工具,“永远不要请求以/blog/page/
开头的 URL。您可能想要做的是添加一个” noindex" 机器人元标记到您的所有 /page/ 页面。这告诉 Google,“不要” t 索引这些页面”,但允许机器人抓取它们并获取各个博客条目的链接。
You can't do that with robots.txt. Your sample
Disallow
line would tell the crawler, "don't ever request a URL that starts with/blog/page/
.What you probably want to do is add a "noindex" robots meta tag to all of your /page/ pages. That tells Google, "don't index these pages," but allows the bot to crawl them and get links to individual blog entries.