如何禁止在源服务器上进行爬网,同时又能正确传播 robots.txt?
我遇到了一个相当独特的问题。如果您负责扩展大型站点并与 Akamai 这样的公司合作,您就会拥有 Akamai 与之通信的源服务器。无论您向 Akamai 提供什么服务,他们都会在其 CDN 上传播。
但是如何处理 robots.txt 呢?您不希望 Google 抓取您的来源。这可能是一个巨大的安全问题。想想拒绝服务攻击。
但是,如果您在源上提供带有“禁止”的 robots.txt,那么您的整个网站将无法抓取!
我能想到的唯一解决方案是向 Akamai 和全世界提供不同的 robots.txt。不允许全世界,但允许 Akamai。但这非常老套,而且容易出现很多问题,以至于我一想到它就感到畏缩。
(当然,源服务器不应该对公众可见,但我敢说大多数都是出于实际原因......)
这似乎是协议应该更好处理的问题。或者也许允许在搜索引擎的网站管理员工具中添加特定于站点的隐藏 robots.txt...
有什么想法吗?
I've come across a rather unique issue. If you deal with scaling large sites and work with a company like Akamai, you have origin servers that Akamai talks to. Whatever you serve to Akamai, they will propagate on their cdn.
But how do you handle robots.txt? You don't want Google to crawl your origin. That can be a HUGE security issue. Think denial of service attacks.
But if you serve a robots.txt on your origin with "disallow", then your entire site will be uncrawlable!
The only solution I can think of is to serve a different robots.txt to Akamai and to the world. Disallow to the world, but allow to Akamai. But this is very hacky and prone to so many issues that I cringe thinking about it.
(Of course, origin servers shouldn't be viewable to the public, but I'd venture to say most are for practical reasons...)
It seems an issue the protocol should be handling better. Or perhaps allow a site-specific, hidden robots.txt in the Search Engine's webmaster tools...
Thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您确实希望不公开您的来源,请使用防火墙/访问控制来限制 Akamai 以外的任何主机的访问 - 这是避免错误的最佳方法,也是阻止机器人和恶意软件的唯一方法。攻击者只需扫描公共 IP 范围来寻找网络服务器。
也就是说,如果您只想避免非恶意蜘蛛,请考虑在源服务器上使用重定向,该重定向会将没有 Host 标头(指定您的公共主机名)的任何请求重定向到正式名称。如果您有规范主机名的变体,您通常需要类似的东西,以避免混乱或搜索排名稀释的问题。对于 Apache,可以使用 mod_rewrite 甚至简单的 虚拟主机 设置,其中默认服务器具有
RedirectPermanent / http://canonicalname.example.com/
。如果您确实使用此方法,则可以简单地将生产名称添加到测试系统的 主机必要时创建文件,或者还创建一个仅供内部使用的主机名(例如
cdn-bypass.mycorp.com
)并将其列入白名单,以便您可以在需要时直接访问源。If you really want your origins not to be public, use a firewall / access control to restrict access for any host other than Akamai - it's the best way to avoid mistakes and it's the only way to stop the bots & attackers who simply scan public IP ranges looking for webservers.
That said, if all you want is to avoid non-malicious spiders, consider using a redirect on your origin server which redirects any requests which don't have a Host header specifying your public hostname to the official name. You generally want something like that anyway to avoid issues with confusion or search rank dilution if you have variations of the canonical hostname. With Apache this could use mod_rewrite or even a simple virtualhost setup where the default server has
RedirectPermanent / http://canonicalname.example.com/
.If you do use this approach, you could either simply add the production name to your test systems' hosts file when necessary or also create and whitelist an internal-only hostname (e.g.
cdn-bypass.mycorp.com
) so you can access the origin directly when you need to.