The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.
Do you want to whitelist your site?
Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.
Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.
Do you want to be left out of something which could be including your site which you have no knowledge of and is indirectly bringing a lot of content your way.
If some strange crawlers are hammering your site and eating your bandwidth you may want to, but it is quite possible that such crawlers wouldn’t honour your robots.txt either.
Examine your log files and see what crawlers you have and what proportion of your bandwidth they are eating. There may be more direct ways to block traffic which is bombarding your site.
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory.
My only worry is that you may miss the next big thing.
There was a long period where AltaVista was the search engine. Possibly even more than Google is now. (there was no bing, or Ask, and Yahoo was a directory, rather than a search engine as such). Sites that blocked all but Altavista back then would have never seen traffic from Google, and therefore never known how popular it was getting, unless they heard about it from another source, which might have put them at a considerable disadvantage for a while.
Pagerank tends to be biased towards older sites. You don't want to appear newer than you are because you were blocking access via robots.txt for no reason. These guys: http://www.dotnetdotcom.org/ may be completely useless now, but maybe in 5 years time, the fact that you weren't in their index now will count against you in the next big search engine.
发布评论
评论(5)
互联网是一种发布机制。 如果你想将你的网站列入白名单,你就违背了原则,但这没关系。
您想将您的网站列入白名单吗?
请记住,忽略 robots.txt 的行为恶劣的机器人无论如何都不会受到影响(显然),而行为良好的机器人的存在可能是有充分理由的,只是这对您来说是不透明的。
The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.
Do you want to whitelist your site?
Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.
虽然抓取您网站的其他网站可能不会按照您的方式发送任何内容,但它们本身可能已被谷歌等人编入索引,因此增加您的网页排名,阻止它们访问您的网站可能会影响这一点。
Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.
您是否希望被排除在可能包括您的网站之外的东西之外,而您不知道这些东西并且间接地为您带来了很多内容。
如果一些奇怪的爬虫正在攻击您的网站并占用您的带宽,您可能希望这么做,但这些爬虫很可能也不会尊重您的 robots.txt。
检查您的日志文件,看看您有哪些爬虫以及它们占用了多少带宽。 可能有更直接的方法来阻止轰炸您网站的流量。
Do you want to be left out of something which could be including your site which you have no knowledge of and is indirectly bringing a lot of content your way.
If some strange crawlers are hammering your site and eating your bandwidth you may want to, but it is quite possible that such crawlers wouldn’t honour your robots.txt either.
Examine your log files and see what crawlers you have and what proportion of your bandwidth they are eating. There may be more direct ways to block traffic which is bombarding your site.
目前这有点尴尬,因为没有“允许”字段。 最简单的方法是将所有不允许的文件放入一个单独的目录,例如“stuff”,并将一个文件保留在该目录的上一级。
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory.
我唯一担心的是你可能会错过下一件大事。
在很长一段时间内,AltaVista 都是搜索引擎。 可能比现在的谷歌还要多。 (没有 bing 或 Ask,雅虎只是一个目录,而不是一个搜索引擎)。 当时屏蔽除 Altavista 以外的所有网站的网站永远不会看到来自 Google 的流量,因此永远不知道它有多受欢迎,除非他们从其他来源听说过它,这可能会让他们在一段时间内处于相当不利的地位。
Pagerank 往往偏向于较旧的网站。 您不想因为无缘无故地阻止通过 robots.txt 进行访问而显得比实际年龄新。 这些家伙:http://www.dotnetdotcom.org/ 现在可能完全没用了,但也许 5 年后到时候,您现在不在他们的索引中的事实将在下一个大型搜索引擎中对您不利。
My only worry is that you may miss the next big thing.
There was a long period where AltaVista was the search engine. Possibly even more than Google is now. (there was no bing, or Ask, and Yahoo was a directory, rather than a search engine as such). Sites that blocked all but Altavista back then would have never seen traffic from Google, and therefore never known how popular it was getting, unless they heard about it from another source, which might have put them at a considerable disadvantage for a while.
Pagerank tends to be biased towards older sites. You don't want to appear newer than you are because you were blocking access via robots.txt for no reason. These guys: http://www.dotnetdotcom.org/ may be completely useless now, but maybe in 5 years time, the fact that you weren't in their index now will count against you in the next big search engine.