Robots.txt:仅允许主要 SE
有没有办法配置 robots.txt,以便该网站仅接受来自 Google、Yahoo! 的访问? 和 MSN 蜘蛛?
Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Slurp 是雅虎的机器人
Slurp is Yahoo's robot
为什么?
任何作恶的人(例如,收集电子邮件地址发送垃圾邮件)都会忽略 robots.txt。 因此,您只会阻止合法的搜索引擎,因为 robots.txt 合规性是自愿的。
但是,如果您坚持这样做,那就是 robots.txt 中的
User-Agent:
行的用途。当然,还有您希望获得流量的所有其他搜索引擎的线路。 Robotstxt.org 有部分列表。
Why?
Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.
But — if you insist on doing it anyway — that's what the
User-Agent:
line in robots.txt is for.With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.
根据您所谈论的国家/地区,有超过 3 个主要搜索引擎。 Facebook 似乎做得很好,只列出合法的:https://facebook.com/robots.txt
所以你的 robots.txt 可以是这样的:
There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt
So your robots.txt can be something like:
众所周知,robots.txt是爬虫必须遵守的标准,因此只有行为良好的代理才会这样做。 所以,放不放并不重要。
如果您有一些数据未在网站上显示,您只需更改权限并提高安全性即可。
As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.
If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.
如果带宽是一个问题,爬行延迟也可能有所帮助
Crawl-Delay could also help if bandwidth is an issue