如何阻止搜索引擎抓取整个网站?

发布于 2025-01-01 07:32:44 字数 395 浏览 5 评论 0原文

我想阻止搜索引擎抓取我的整个网站。

我有一个网络应用程序供公司成员使用。它托管在网络服务器上,以便公司的员工可以访问它。没有其他人(公众)需要它或发现它有用。

因此,我想添加另一层安全性(理论上),通过完全删除所有搜索引擎机器人/爬虫对其的访问来尝试防止未经授权的访问。从商业角度来看,让 Google 为我们的网站建立索引以使其可搜索是毫无意义的,只会为黑客提供另一种方式来首先找到该网站并尝试对其进行攻击。

我知道在 robots.txt 中,您可以告诉搜索引擎不要抓取某些目录。

是否可以告诉机器人不要抓取整个网站,而不必列出所有不要抓取的目录?

最好使用 robots.txt 完成此操作,还是使用 .htaccess 或其他文件更好?

I want to stop search engines from crawling my whole website.

I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.

So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.

I know in the robots.txt you can tell search engines not to crawl certain directories.

Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?

Is this best done with robots.txt or is it better done by .htaccess or other?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

星軌x 2025-01-08 07:32:44

使用 robots.txt 让网站远离搜索引擎索引一个鲜为人知的小问题:如果有人从 Google 索引的任何网页链接到您的网站(Google 必须这样做才能找到您的网站)反正,无论是否 robots.txt)、即使您不允许他们获取链接指向的页面,Google 仍可能会索引链接并将其显示为搜索结果的一部分。

如果这对您来说可能是个问题,解决方案是使用robots.txt,而是包含一个robots元标记,其中在您网站的每个页面上设置 noindex,nofollow 值。您甚至可以使用 mod_headers 在 .htaccess 文件中执行此操作X-Robots-Tag HTTP 标头:

Header set X-Robots-Tag noindex,nofollow

该指令会将标头 X-Robots-Tag: noindex,nofollow 添加到它适用的每个页面,包括非 HTML 页面(例如图像)。当然,您可能也想包含相应的 HTML 元标记,以防万一(这是一个较旧的标准,因此可能得到更广泛的支持):

<meta name="robots" content="noindex,nofollow" />

请注意,如果您这样做,Googlebot 仍会尝试抓取它找到的任何链接您的网站,因为它需要在看到标题/元标记之前获取页面。当然,有些人可能会认为这是一个功能而不是一个错误,因为它可以让您查看访问日志以查看 Google 是否找到了指向您网站的任何链接。

无论如何,无论您做什么,请记住,很难长期保守“秘密”网站的秘密。随着时间的推移,您的一位用户意外泄露该网站链接的可能性接近 100%,如果有任何理由认为某人有兴趣查找该网站,您应该假设他们会这样做。因此,请确保您在您的网站上设置适当的访问控制、保持软件最新并定期对其进行安全检查。

Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.

If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:

Header set X-Robots-Tag noindex,nofollow

This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):

<meta name="robots" content="noindex,nofollow" />

Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.

In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

十年九夏 2025-01-08 07:32:44

最好使用 robots.txt 文件处理,仅适用于机器人尊重文件。

要阻止整个网站,请将此内容添加到网站根目录中的 robots.txt 中:

User-agent: *
Disallow: /

要限制其他人对您网站的访问,.htaccess 更好,但您需要定义访问规则,例如通过 IP 地址。

以下是 .htaccess 规则,用于限制除您的人员之外的所有人使用您公司的 IP:

Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all 

It is best handled with a robots.txt file, for just bots that respect the file.

To block the whole site add this to robots.txt in the root directory of your site:

User-agent: *
Disallow: /

To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.

Below are the .htaccess rules to restrict everyone except your people from your company IP:

Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all 
梦初启 2025-01-08 07:32:44

除了提供的答案之外,您还可以阻止搜索引擎在 .robot.text 中对您网站上的特定页面进行爬网/索引。下面是一个示例:

User-agent: *
Disallow: /example-page/ 

当您有动态页面时,上面的示例特别方便,否则,您可能需要在您希望禁止搜索引擎访问的特定页面上添加以下 HTML 元标记:

<meta name="robots" content="noindex, nofollow" />

In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:

User-agent: *
Disallow: /example-page/ 

The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:

<meta name="robots" content="noindex, nofollow" />
笑看君怀她人 2025-01-08 07:32:44

如果您担心安全性,并且锁定 IP 地址不可行,则您应该考虑要求用户以某种方式进行身份验证才能访问您的站点。

这意味着未经身份验证的任何人(谷歌、机器人、偶然发现链接的人)都无法访问您的页面。

您可以将其嵌入您的网站本身,或使用 HTTP 基本身份验证。

https://www.httpwatch.com/httpgallery/authentication/

If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.

That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.

You could bake it into your website itself, or use HTTP Basic Authentication.

https://www.httpwatch.com/httpgallery/authentication/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文