阻止 google robots 查找包含特定单词的 URL

发布于 2024-11-26 13:40:25 字数 280 浏览 1 评论 0原文

我的客户有很多页面不希望被谷歌索引 - 它们都被称为

http://example.com/page-xxx

/page-123/page-2/ page-25

有没有办法阻止谷歌索引任何以 /page-xxx 开头的页面使用 robots.txt

会像这样工作吗?

Disallow: /page-*

谢谢

my client has a load of pages which they dont want indexed by google - they are all called

http://example.com/page-xxx

so they are /page-123 or /page-2 or /page-25 etc

Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt

would something ike this work?

Disallow: /page-*

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

拍不死你 2024-12-03 13:40:25

首先,一行 Disallow: /post-* 不会做任何事情来阻止抓取“/page-xxx”形式的页面。您的意思是在“禁止”行中放入“页面”,而不是“发布”吗?

Disallow 的本质是“禁止以该文本开头的 url”。因此,您的示例行将不允许任何以“/post-”开头的网址。 (也就是说,该文件位于根目录中,并且其名称以“post-”开头。)正如所暗示的那样,本例中的星号是多余的。

您的问题不清楚页面在哪里。如果它们都在根目录中,那么简单的 Disallow: /page- 就可以了。如果它们分散在许多不同地方的目录中,那么事情就有点困难了。

正如 @user728345 指出的,处理此问题的最简单方法(从 robots.txt 的角度来看)是将您不希望爬行的所有页面收集到一个目录中,并禁止访问该目录。但如果您无法移动所有这些页面,我可以理解。

特别是对于 Googlebot 以及支持相同通配符语义的其他机器人(它们的数量惊人,包括我的),以下内容应该有效:

Disallow: /*page-

这将匹配包含以下内容的任何内容“页面-”任何地方。但是,这也会阻止诸如“/test/thispage-123.html”之类的内容。如果你想阻止这种情况,那么我认为(我不确定,因为我还没有尝试过)这会起作用:

Disallow: */page-

In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?

Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.

Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.

As @user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.

For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:

Disallow: /*page-

That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:

Disallow: */page-

A君 2024-12-03 13:40:25

看起来 * 将作为 Google 通配符使用,因此您的答案将使 Google 无法抓取,但是其他蜘蛛不支持通配符。您可以在 google 上搜索 robots.txt 通配符以获取更多信息。我会看到 http://seogadget.co.uk/wildcards-in-robots-txt / 了解更多信息。

然后我从谷歌的文档中提取了这个:

模式匹配

Googlebot(但不是所有搜索引擎)尊重某些模式匹配。

要匹配字符序列,请使用星号 (*)。例如,要阻止访问所有以 private 开头的 > 子目录:

用户代理:Googlebot
禁止:/private*/

要阻止访问所有包含问号 (?) 的网址(更具体地说,是指以您的域名开头、后跟任意字符串、后跟问号、后跟任意字符串的任何网址):

用户代理:Googlebot
禁止:/*?

要指定匹配 URL 结尾,请使用 $。例如,要阻止任何以 .xls 结尾的 URL:

用户代理:Googlebot
禁止:/*.xls$

您可以将此模式匹配与Allow指令结合使用。例如,如果一个 ?表示会话 ID,您可能需要排除包含它们的所有网址,以确保 Googlebot 不会抓取重复的网页。但是以 ? 结尾的 URL可能是您确实想要包含的页面版本。对于这种情况,您可以按如下方式设置 robots.txt 文件:

用户代理:*
允许:/?$
不允许:/

禁止:/ *?指令将阻止任何包含 ? 的 URL (更具体地说,它将阻止以您的域名开头、后跟任意字符串、后跟问号、后跟任意字符串的任何 URL)。

Allow: /*?$ 指令将允许任何以 ? 结尾的 URL (更具体地说,它将允许任何以您的域名开头、后跟字符串、? 且 ? 后没有任何字符的 URL。)。

通过下载文件或将内容复制到文本文件并另存为 robots.txt 来保存 robots.txt 文件。将文件保存到站点的最高级别目录。 robots.txt 文件必须驻留在域的根目录中,并且必须命名为“robots.txt”。位于子目录中的 robots.txt 文件无效,因为机器人仅在域的根目录中检查此文件。例如,http://www.example.com/robots.txt 是一个有效位置,但 http://www.example.com/mysite/robots.txt 不是。

注意:据我所知,这是 Google 独有的方法。官方不允许在 robots.txt 中使用通配符来禁止。

It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.

Then I pulled this from Google's documentation:

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.

To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:

User-agent: Googlebot
Disallow: /private*/

To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:

User-agent: Googlebot
Disallow: /*.xls$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /?$
Disallow: /
?

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.

送你一个梦 2024-12-03 13:40:25

您可以将所有不想访问的页面放在一个文件夹中,然后使用 disallow 告诉机器人不要访问该文件夹中的页面。

Disallow: /private/

我对 robots.txt 不太了解,所以我不确定如何使用这样的通配符
在这里,它说“您不能在用户代理或禁止行中使用通配符模式或正则表达式。”
http://www.robotstxt.org/faq/robotstxt.html

You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.

Disallow: /private/

I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文