当前位置：文江博客话题详情

Robots.txt：仅允许主要 SE

发布于 2024-07-16 09:52:55 字数 66 浏览 4 评论 0原文

有没有办法配置 robots.txt，以便该网站仅接受来自 Google、Yahoo! 的访问？和 MSN 蜘蛛？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

無心 2024-07-23 09:52:55

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Disallow:

Slurp 是雅虎的机器人



User-agent: *

Disallow: /

User-agent: Googlebot

Allow: /

User-agent: Slurp

Allow: /

User-Agent: msnbot

Disallow:

Slurp is Yahoo's robot

回复收藏 0 原文

入怼 2024-07-23 09:52:55

为什么？

任何作恶的人（例如，收集电子邮件地址发送垃圾邮件）都会忽略 robots.txt。因此，您只会阻止合法的搜索引擎，因为 robots.txt 合规性是自愿的。

但是，如果您坚持这样做，那就是 robots.txt 中的 User-Agent: 行的用途。

User-agent: googlebot
Disallow: 

User-agent: *
Disallow: /

当然，还有您希望获得流量的所有其他搜索引擎的线路。 Robotstxt.org 有部分列表。

Why?

Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.

But — if you insist on doing it anyway — that's what the User-Agent: line in robots.txt is for.

User-agent: googlebot
Disallow: 

User-agent: *
Disallow: /

With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.

回复收藏 0 原文

千笙结 2024-07-23 09:52:55

根据您所谈论的国家/地区，有超过 3 个主要搜索引擎。 Facebook 似乎做得很好，只列出合法的：https://facebook.com/robots.txt

所以你的 robots.txt 可以是这样的：

User-agent: Applebot
Allow: /

User-agent: baiduspider
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Facebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: msnbot
Allow: /

User-agent: Naverbot
Allow: /

User-agent: seznambot
Allow: /

User-agent: Slurp
Allow: /

User-agent: teoma
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: Yandex
Allow: /

User-agent: Yeti
Allow: /

User-agent: *
Disallow: /

There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt

So your robots.txt can be something like:

User-agent: Applebot
Allow: /

User-agent: baiduspider
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Facebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: msnbot
Allow: /

User-agent: Naverbot
Allow: /

User-agent: seznambot
Allow: /

User-agent: Slurp
Allow: /

User-agent: teoma
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: Yandex
Allow: /

User-agent: Yeti
Allow: /

User-agent: *
Disallow: /

回复收藏 0 原文

陈独秀 2024-07-23 09:52:55

众所周知，robots.txt是爬虫必须遵守的标准，因此只有行为良好的代理才会这样做。所以，放不放并不重要。

如果您有一些数据未在网站上显示，您只需更改权限并提高安全性即可。

回复收藏 0 原文

×眷恋的温暖 2024-07-23 09:52:55

如果带宽是一个问题，爬行延迟也可能有所帮助

User-agent: *
Disallow: /
Crawl-Delay: 10
Sitemap: https://yoursite.com/sitemapindex.xml

User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Allow: /
User-agent: Applebot
Allow: /
User-agent: baiduspider
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /

Disallow:

Crawl-Delay could also help if bandwidth is an issue

User-agent: *
Disallow: /
Crawl-Delay: 10
Sitemap: https://yoursite.com/sitemapindex.xml

User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Allow: /
User-agent: Applebot
Allow: /
User-agent: baiduspider
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /

Disallow:

回复收藏 0 原文

~没有更多了~