我可以阻止 Apache Web 服务器上每个站点的搜索爬虫吗?
我在公共互联网上有一个临时服务器,运行一些网站的生产代码副本。 如果暂存站点被索引,我真的不喜欢它。
有没有办法修改临时服务器上的 httpd.conf 以阻止搜索引擎爬虫?
更改 robots.txt 并不会真正起作用,因为我使用脚本将相同的代码库复制到两台服务器。 另外,我也不想更改虚拟主机配置文件,因为有很多站点,并且我不想在创建新站点时必须记住复制某个设置。
I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.
Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?
Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
创建包含以下内容的 robots.txt 文件:
将该文件放在临时服务器上的某个位置; 您的目录根目录是一个很好的地方(例如
/var/www/html/robots.txt
)。将以下内容添加到您的 httpd.conf 文件中:
SetHandler
指令可能不是必需的,但如果您使用像 mod_python 这样的处理程序,则可能需要它。该 robots.txt 文件现在将为服务器上的所有虚拟主机提供服务,从而覆盖您可能为各个主机拥有的任何 robots.txt 文件。
(注意:我的答案本质上与 ceejayoz 的答案建议你做的事情相同,但我不得不额外花几分钟弄清楚所有细节才能使其发挥作用。为了其他人的利益,我决定将这个答案放在这里谁可能会偶然发现这个问题。)
Create a robots.txt file with the following contents:
Put that file somewhere on your staging server; your directory root is a great place for it (e.g.
/var/www/html/robots.txt
).Add the following to your httpd.conf file:
The
SetHandler
directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.
(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)
您可以使用 Apache 的 mod_rewrite 来完成此操作。 假设您的真实主机是 www.example.com,您的临时主机是 staging.example.com。 创建一个名为“robots-staging.txt”的文件,并有条件地重写请求以转到该文件。
这个例子适合保护单个临时站点,比您所要求的用例稍微简单一些,但这对我来说已经可靠地工作了:
您可以尝试将蜘蛛重定向到不同的主robots.txt服务器,但是
有些蜘蛛在从 HTTP 请求中获得除“200 OK”或“404 not found”返回代码之外的任何内容后可能会犹豫不决,并且它们可能不会读取重定向的 URL。
这样做的方法如下:
You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.
This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:
You could try to redirect the spiders to a master robots.txt on a different server, but
some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.
Here's how you would do that:
您能否将暂存虚拟主机上的 robots.txt 别名为托管在不同位置的限制性 robots.txt?
Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?
要真正阻止页面被索引,您需要隐藏 HTTP 后面的站点授权。 您可以在全局 Apache 配置中执行此操作并使用简单的 .htpasswd 文件。
唯一的缺点是,您现在必须在第一次浏览临时服务器上的任何页面时输入用户名/密码。
To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.
Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.
根据您的部署场景,您应该寻找将不同 robots.txt 文件部署到 dev/stage/test/prod (或您拥有的任何组合)的方法。 假设您在不同的服务器上有不同的数据库配置文件或(或类似的内容),这应该遵循类似的过程(您的数据库有不同的密码,对吧?)
如果您没有一步部署过程到位,这可能是获得一个的良好动机...有大量适用于不同环境的工具 - Capistrano 是一个非常好的工具,并且在 Rails/Django 世界中受到青睐,但它是由没有意味着唯一。
如果做不到这一切,您可能可以在 Apache 配置中设置一个全局 Alias 指令,该指令将应用于所有虚拟主机并指向限制性的 robots.txt
Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)
If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.
Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt
尝试使用 Apache 阻止不良机器人。 您可以在线获取用户代理或仅允许浏览器,而不是尝试阻止所有机器人。
Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.