阻止 google robots 查找包含特定单词的 URL
我的客户有很多页面不希望被谷歌索引 - 它们都被称为
http://example.com/page-xxx
/page-123 或 /page-2 或 / page-25 等
有没有办法阻止谷歌索引任何以 /page-xxx 开头的页面使用 robots.txt
会像这样工作吗?
Disallow: /page-*
谢谢
my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,一行
Disallow: /post-*
不会做任何事情来阻止抓取“/page-xxx”形式的页面。您的意思是在“禁止”行中放入“页面”,而不是“发布”吗?Disallow 的本质是“禁止以该文本开头的 url”。因此,您的示例行将不允许任何以“/post-”开头的网址。 (也就是说,该文件位于根目录中,并且其名称以“post-”开头。)正如所暗示的那样,本例中的星号是多余的。
您的问题不清楚页面在哪里。如果它们都在根目录中,那么简单的
Disallow: /page-
就可以了。如果它们分散在许多不同地方的目录中,那么事情就有点困难了。正如 @user728345 指出的,处理此问题的最简单方法(从 robots.txt 的角度来看)是将您不希望爬行的所有页面收集到一个目录中,并禁止访问该目录。但如果您无法移动所有这些页面,我可以理解。
特别是对于 Googlebot 以及支持相同通配符语义的其他机器人(它们的数量惊人,包括我的),以下内容应该有效:
Disallow: /*page-
这将匹配包含以下内容的任何内容“页面-”任何地方。但是,这也会阻止诸如“/test/thispage-123.html”之类的内容。如果你想阻止这种情况,那么我认为(我不确定,因为我还没有尝试过)这会起作用:
Disallow: */page-
In the first place, a line that says
Disallow: /post-*
isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple
Disallow: /page-
will work. If they're scattered across directories in many different places, then things are a bit more difficult.As @user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-
看起来 * 将作为 Google 通配符使用,因此您的答案将使 Google 无法抓取,但是其他蜘蛛不支持通配符。您可以在 google 上搜索 robots.txt 通配符以获取更多信息。我会看到 http://seogadget.co.uk/wildcards-in-robots-txt / 了解更多信息。
然后我从谷歌的文档中提取了这个:
注意:据我所知,这是 Google 独有的方法。官方不允许在 robots.txt 中使用通配符来禁止。
It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.
您可以将所有不想访问的页面放在一个文件夹中,然后使用 disallow 告诉机器人不要访问该文件夹中的页面。
Disallow: /private/
我对 robots.txt 不太了解,所以我不确定如何使用这样的通配符
在这里,它说“您不能在用户代理或禁止行中使用通配符模式或正则表达式。”
http://www.robotstxt.org/faq/robotstxt.html
You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html