如何阻止搜索引擎抓取整个网站?
我想阻止搜索引擎抓取我的整个网站。
我有一个网络应用程序供公司成员使用。它托管在网络服务器上,以便公司的员工可以访问它。没有其他人(公众)需要它或发现它有用。
因此,我想添加另一层安全性(理论上),通过完全删除所有搜索引擎机器人/爬虫对其的访问来尝试防止未经授权的访问。从商业角度来看,让 Google 为我们的网站建立索引以使其可搜索是毫无意义的,只会为黑客提供另一种方式来首先找到该网站并尝试对其进行攻击。
我知道在 robots.txt
中,您可以告诉搜索引擎不要抓取某些目录。
是否可以告诉机器人不要抓取整个网站,而不必列出所有不要抓取的目录?
最好使用 robots.txt
完成此操作,还是使用 .htaccess 或其他文件更好?
I want to stop search engines from crawling my whole website.
I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.
So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.
I know in the robots.txt
you can tell search engines not to crawl certain directories.
Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?
Is this best done with robots.txt
or is it better done by .htaccess or other?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用
robots.txt
让网站远离搜索引擎索引一个鲜为人知的小问题:如果有人从 Google 索引的任何网页链接到您的网站(Google 必须这样做才能找到您的网站)反正,无论是否robots.txt
)、即使您不允许他们获取链接指向的页面,Google 仍可能会索引链接并将其显示为搜索结果的一部分。如果这对您来说可能是个问题,解决方案是不使用
robots.txt
,而是包含一个robots
元标记,其中在您网站的每个页面上设置noindex,nofollow
值。您甚至可以使用 mod_headers 在.htaccess
文件中执行此操作 和X-Robots-Tag
HTTP 标头:该指令会将标头
X-Robots-Tag: noindex,nofollow
添加到它适用的每个页面,包括非 HTML 页面(例如图像)。当然,您可能也想包含相应的 HTML 元标记,以防万一(这是一个较旧的标准,因此可能得到更广泛的支持):请注意,如果您这样做,Googlebot 仍会尝试抓取它找到的任何链接您的网站,因为它需要在看到标题/元标记之前获取页面。当然,有些人可能会认为这是一个功能而不是一个错误,因为它可以让您查看访问日志以查看 Google 是否找到了指向您网站的任何链接。
无论如何,无论您做什么,请记住,很难长期保守“秘密”网站的秘密。随着时间的推移,您的一位用户意外泄露该网站链接的可能性接近 100%,如果有任何理由认为某人有兴趣查找该网站,您应该假设他们会这样做。因此,请确保您还在您的网站上设置适当的访问控制、保持软件最新并定期对其进行安全检查。
Using
robots.txt
to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway,robots.txt
or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.If this might be a problem for you, the solution is to not use
robots.txt
, but instead to include arobots
meta tag with the valuenoindex,nofollow
on every page on your site. You can even do this in a.htaccess
file using mod_headers and theX-Robots-Tag
HTTP header:This directive will add the header
X-Robots-Tag: noindex,nofollow
to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.
In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.
最好使用
robots.txt
文件处理,仅适用于机器人尊重文件。要阻止整个网站,请将此内容添加到网站根目录中的
robots.txt
中:要限制其他人对您网站的访问,
.htaccess
更好,但您需要定义访问规则,例如通过 IP 地址。以下是
.htaccess
规则,用于限制除您的人员之外的所有人使用您公司的 IP:It is best handled with a
robots.txt
file, for just bots that respect the file.To block the whole site add this to
robots.txt
in the root directory of your site:To limit access to your site for everyone else,
.htaccess
is better, but you would need to define access rules, by IP address for example.Below are the
.htaccess
rules to restrict everyone except your people from your company IP:除了提供的答案之外,您还可以阻止搜索引擎在
.robot.text
中对您网站上的特定页面进行爬网/索引。下面是一个示例:当您有动态页面时,上面的示例特别方便,否则,您可能需要在您希望禁止搜索引擎访问的特定页面上添加以下
HTML
元标记:In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in
.robot.text
. Below is an example:The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below
HTML
meta tag on the specific pages you want to be disallowed from search engines:如果您担心安全性,并且锁定 IP 地址不可行,则您应该考虑要求用户以某种方式进行身份验证才能访问您的站点。
这意味着未经身份验证的任何人(谷歌、机器人、偶然发现链接的人)都无法访问您的页面。
您可以将其嵌入您的网站本身,或使用 HTTP 基本身份验证。
https://www.httpwatch.com/httpgallery/authentication/
If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.
That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.
You could bake it into your website itself, or use HTTP Basic Authentication.
https://www.httpwatch.com/httpgallery/authentication/