You can somehow reduce your site being listed using a robots.txt. Note that this depends on the "goodwill" of the crawler, though (some spambots will explicitly look at locations that you disallow).
The only safe and reliable way of not having a site listed, sadly, is not putting it on the internet.
Simply not linking to your site will not work. Crawlers get their info from many sources, including browser referrers and domain registrars. So, in order to be "invisible", you would have to not visit your site and not register a domain (only access it via IP address). And then, if you run your webserver based on IP address, you still have all the spambots probing random addresses. It will take a while, but they will find you.
Password protecting your site should work, effectively making it inaccessible. Though (and it is beyond my comprehension how that happens) for example there are literally thousands of ACM papers listed in Google which you cannot see without an account and logging in. Yet they are there.
Use a robots.txt, deny from all search engines. They don't all respect robots.txt so check your server logs regularly and deny from ranges of suspected robots/crawlers:
发布评论
评论(8)
看看
nofollow
维基百科Have a look at
nofollow
Wikipedia您需要阅读有关您应该在网站的 webroot 中复制的 robots.txt 文件的信息 - http://www.robotstxt.org/robotstxt.html。 robotstxt.org/robotstxt.html。
You need to read about robots.txt file you are supposed to copy in your site's webroot – http://www.robotstxt.org/robotstxt.html.
使用 robots.txt 文件:http://www.google。 com/support/webmasters/bin/answer.py?answer=156449
Use a robots.txt file: http://www.google.com/support/webmasters/bin/answer.py?answer=156449
除了用密码保护您的网站之外,您还可以将这些行添加到
robots.txt
中:这不会隐藏网站,而是指示机器人不要抓取内容。
Apart from password-protecting your site, you could add these lines to
robots.txt
:This doesn't hide the site but rather instructs bots not to spider the content.
您可以以某种方式使用robots.txt来减少列出的网站数量。请注意,这取决于爬虫的“善意”(一些垃圾邮件机器人会明确查看您不允许的位置)。
遗憾的是,唯一安全可靠的方法就是不将网站放在互联网上。
简单地不链接到您的网站是行不通的。爬虫从许多来源获取信息,包括浏览器引荐来源网址和域名注册商。因此,为了“隐形”,您必须不访问您的网站,也不注册域名(只能通过 IP 地址访问它)。
然后,如果您根据 IP 地址运行网络服务器,那么所有垃圾邮件机器人仍然会探测随机地址。这需要一段时间,但他们会找到你。
保护您的网站的密码应该有效,有效地使其无法访问。尽管(我无法理解这是如何发生的)例如,Google 中列出了数千篇 ACM 论文,如果没有帐户并登录,您就无法看到这些论文。但它们就在那里。
You can somehow reduce your site being listed using a robots.txt. Note that this depends on the "goodwill" of the crawler, though (some spambots will explicitly look at locations that you disallow).
The only safe and reliable way of not having a site listed, sadly, is not putting it on the internet.
Simply not linking to your site will not work. Crawlers get their info from many sources, including browser referrers and domain registrars. So, in order to be "invisible", you would have to not visit your site and not register a domain (only access it via IP address).
And then, if you run your webserver based on IP address, you still have all the spambots probing random addresses. It will take a while, but they will find you.
Password protecting your site should work, effectively making it inaccessible. Though (and it is beyond my comprehension how that happens) for example there are literally thousands of ACM papers listed in Google which you cannot see without an account and logging in. Yet they are there.
使用robots.txt,拒绝所有搜索引擎。
它们并不都尊重 robots.txt,因此请定期检查您的服务器日志并拒绝可疑的 robots/crawlers 范围:
http://httpd.apache.org/docs/2.2/howto/access.html
Use a robots.txt, deny from all search engines.
They don't all respect robots.txt so check your server logs regularly and deny from ranges of suspected robots/crawlers:
http://httpd.apache.org/docs/2.2/howto/access.html
您使用
robots.txt
文件。将包含以下内容的文件放置在站点的根目录中:You use a
robots.txt
file. Place the file in the root of the site with this content:最合适的搜索引擎使用机器人或爬虫来访问网站并为其建立索引。你可以机器人文件方法
most proper search engines uses bots or crawlers to websites and index them. you could Robot File method