我可以阻止 Apache Web 服务器上每个站点的搜索爬虫吗?

发布于 2024-07-07 22:19:06 字数 215 浏览 12 评论 0原文

我在公共互联网上有一个临时服务器,运行一些网站的生产代码副本。 如果暂存站点被索引,我真的不喜欢它。

有没有办法修改临时服务器上的 httpd.conf 以阻止搜索引擎爬虫?

更改 robots.txt 并不会真正起作用,因为我使用脚本将相同的代码库复制到两台服务器。 另外,我也不想更改虚拟主机配置文件,因为有很多站点,并且我不想在创建新站点时必须记住复制某个设置。

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.

Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?

Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

滥情哥ㄟ 2024-07-14 22:19:06

创建包含以下内容的 robots.txt 文件:

User-agent: *
Disallow: /

将该文件放在临时服务器上的某个位置; 您的目录根目录是一个很好的地方(例如/var/www/html/robots.txt)。

将以下内容添加到您的 httpd.conf 文件中:

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

SetHandler 指令可能不是必需的,但如果您使用像 mod_python 这样的处理程序,则可能需要它。

该 robots.txt 文件现在将为服务器上的所有虚拟主机提供服务,从而覆盖您可能为各个主机拥有的任何 robots.txt 文件。

(注意:我的答案本质上与 ceejayoz 的答案建议你做的事情相同,但我不得不额外花几分钟弄清楚所有细节才能使其发挥作用。为了其他人的利益,我决定将这个答案放在这里谁可能会偶然发现这个问题。)

Create a robots.txt file with the following contents:

User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).

Add the following to your httpd.conf file:

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.

That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.

(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)

终难愈 2024-07-14 22:19:06

您可以使用 Apache 的 mod_rewrite 来完成此操作。 假设您的真实主机是 www.example.com,您的临时主机是 staging.example.com。 创建一个名为“robots-staging.txt”的文件,并有条件地重写请求以转到该文件。

这个例子适合保护单个临时站点,比您所要求的用例稍微简单一些,但这对我来说已经可靠地工作了:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]
</IfModule>

您可以尝试将蜘蛛重定向到不同的主robots.txt服务器,但是
有些蜘蛛在从 HTTP 请求中获得除“200 OK”或“404 not found”返回代码之外的任何内容后可能会犹豫不决,并且它们可能不会读取重定向的 URL。

这样做的方法如下:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
</IfModule>

You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.

This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]
</IfModule>

You could try to redirect the spiders to a master robots.txt on a different server, but
some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

Here's how you would do that:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
</IfModule>
娇柔作态 2024-07-14 22:19:06

您能否将暂存虚拟主机上的 robots.txt 别名为托管在不同位置的限制性 robots.txt?

Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?

离线来电— 2024-07-14 22:19:06

要真正阻止页面被索引,您需要隐藏 HTTP 后面的站点授权。 您可以在全局 Apache 配置中执行此操作并使用简单的 .htpasswd 文件。

唯一的缺点是,您现在必须在第一次浏览临时服务器上的任何页面时输入用户名/密码。

To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.

Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.

似狗非友 2024-07-14 22:19:06

根据您的部署场景,您应该寻找将不同 robots.txt 文件部署到 dev/stage/test/prod (或您拥有的任何组合)的方法。 假设您在不同的服务器上有不同的数据库配置文件或(或类似的内容),这应该遵循类似的过程(您的数据库有不同的密码,对吧?)

如果您没有一步部署过程到位,这可能是获得一个的良好动机...有大量适用于不同环境的工具 - Capistrano 是一个非常好的工具,并且在 Rails/Django 世界中受到青睐,但它是由没有意味着唯一。

如果做不到这一切,您可能可以在 Apache 配置中设置一个全局 Alias 指令,该指令将应用于所有虚拟主机并指向限制性的 robots.txt

Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt

赏烟花じ飞满天 2024-07-14 22:19:06

尝试使用 Apache 阻止不良机器人。 您可以在线获取用户代理或仅允许浏览器,而不是尝试阻止所有机器人。

Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文