阻止 google robots 查找包含特定单词的 URL

发布于 2024-11-26 13:40:25 字数 280 浏览 1 评论 0原文

我的客户有很多页面不希望被谷歌索引 - 它们都被称为

http://example.com/page-xxx

/page-123 或 /page-2 或 / page-25 等

有没有办法阻止谷歌索引任何以 /page-xxx 开头的页面使用 robots.txt

会像这样工作吗？

Disallow: /page-*

谢谢

原文

my client has a load of pages which they dont want indexed by google - they are all called

http://example.com/page-xxx

so they are /page-123 or /page-2 or /page-25 etc

Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt

would something ike this work?

Disallow: /page-*

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拍不死你 2024-12-03 13:40:25

首先，一行 Disallow: /post-* 不会做任何事情来阻止抓取“/page-xxx”形式的页面。您的意思是在“禁止”行中放入“页面”，而不是“发布”吗？

Disallow 的本质是“禁止以该文本开头的 url”。因此，您的示例行将不允许任何以“/post-”开头的网址。（也就是说，该文件位于根目录中，并且其名称以“post-”开头。）正如所暗示的那样，本例中的星号是多余的。

您的问题不清楚页面在哪里。如果它们都在根目录中，那么简单的 Disallow: /page- 就可以了。如果它们分散在许多不同地方的目录中，那么事情就有点困难了。

正如 @user728345 指出的，处理此问题的最简单方法（从 robots.txt 的角度来看）是将您不希望爬行的所有页面收集到一个目录中，并禁止访问该目录。但如果您无法移动所有这些页面，我可以理解。

特别是对于 Googlebot 以及支持相同通配符语义的其他机器人（它们的数量惊人，包括我的），以下内容应该有效：

Disallow: /*page-

这将匹配包含以下内容的任何内容“页面-”任何地方。但是，这也会阻止诸如“/test/thispage-123.html”之类的内容。如果你想阻止这种情况，那么我认为（我不确定，因为我还没有尝试过）这会起作用：

Disallow: */page-

回复收藏 0 原文

A君 2024-12-03 13:40:25

看起来 * 将作为 Google 通配符使用，因此您的答案将使 Google 无法抓取，但是其他蜘蛛不支持通配符。您可以在 google 上搜索 robots.txt 通配符以获取更多信息。我会看到 http://seogadget.co.uk/wildcards-in-robots-txt / 了解更多信息。

然后我从谷歌的文档中提取了这个：

模式匹配
Googlebot（但不是所有搜索引擎）尊重某些模式匹配。
要匹配字符序列，请使用星号 (*)。例如，要阻止访问所有以 private 开头的 > 子目录：
用户代理：Googlebot
禁止：/private*/
要阻止访问所有包含问号 (?) 的网址（更具体地说，是指以您的域名开头、后跟任意字符串、后跟问号、后跟任意字符串的任何网址）：
用户代理：Googlebot
禁止：/*?
要指定匹配 URL 结尾，请使用 $。例如，要阻止任何以 .xls 结尾的 URL：
用户代理：Googlebot
禁止：/*.xls$
您可以将此模式匹配与Allow指令结合使用。例如，如果一个 ?表示会话 ID，您可能需要排除包含它们的所有网址，以确保 Googlebot 不会抓取重复的网页。但是以 ? 结尾的 URL可能是您确实想要包含的页面版本。对于这种情况，您可以按如下方式设置 robots.txt 文件：
用户代理：*
允许：/?$
不允许：/？
禁止：/ *？指令将阻止任何包含 ? 的 URL （更具体地说，它将阻止以您的域名开头、后跟任意字符串、后跟问号、后跟任意字符串的任何 URL）。
Allow: /*?$ 指令将允许任何以 ? 结尾的 URL （更具体地说，它将允许任何以您的域名开头、后跟字符串、? 且 ? 后没有任何字符的 URL。）。
通过下载文件或将内容复制到文本文件并另存为 robots.txt 来保存 robots.txt 文件。将文件保存到站点的最高级别目录。 robots.txt 文件必须驻留在域的根目录中，并且必须命名为“robots.txt”。位于子目录中的 robots.txt 文件无效，因为机器人仅在域的根目录中检查此文件。例如，http://www.example.com/robots.txt 是一个有效位置，但 http://www.example.com/mysite/robots.txt 不是。

注意：据我所知，这是 Google 独有的方法。官方不允许在 robots.txt 中使用通配符来禁止。

It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.

Then I pulled this from Google's documentation:

Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.