用户代理字符串是否必须与我的服务器日志中显示的完全相同?
使用 Robots.txt 文件时,用户代理字符串是否必须与服务器日志中显示的完全相同?
例如,当尝试匹配 GoogleBot 时,我可以只使用 googlebot
吗?
另外,部分匹配有效吗?例如,仅使用 Google
?
When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs?
For example when trying to match GoogleBot, can I just use googlebot
?
Also, will a partial-match work? For example just using Google
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
至少对于 googlebot 来说,用户代理不区分大小写。阅读“用户代理的优先顺序”部分:
https: //code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html
At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section:
https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html
(正如已经在另一个问题中回答的)
在原始 robots.txt 规范(从 1994 年开始),它说:
但是如果/哪些解析器像那样工作是另一个问题。最好的选择是查找要添加的机器人的文档。您通常会在其中找到代理标识符字符串,例如:
必应:
<块引用>
我们希望网站站长知道
bingbot
仍会遵循为msnbot
编写的 robots.txt 指令,因此无需对您的 robots.txt 文件进行任何更改。DuckDuckGo:
<块引用>
DuckDuckBot 是 DuckDuckGo 的网络爬虫。它尊重 WWW::RobotRules [...]
Google:
<块引用>
Google 用户代理是(足够恰当的)
Googlebot
。互联网档案:
<块引用>
用户代理
archive.org_bot
用于我们对网络的广泛抓取。它的设计尊重 robots.txt 和 META 机器人标签。…
(As already answered in another question)
In the original robots.txt specification (from 1994), it says:
But if/which parsers work like that is another question. Your best bet would be to look for the documentation of the bots you want to add. You’ll typically find the agent identifier string in it, e.g.:
Bing:
DuckDuckGo:
Google:
Internet Archive:
…
robots.txt 区分大小写,尽管 Google 比其他机器人更为保守,并且可能以任何方式接受其字符串,但其他机器人可能不会。
robots.txt is case-sensitive, although Google is more conservative than other bots, and may accept its string either way, other bots may not.
理论上是的。然而,实际上,它似乎是特定的部分匹配或“子字符串”(如@unor的答案中提到的)匹配。这些特定的“子串”似乎被称为“令牌”。通常它必须与这些“令牌”完全匹配。
对于标准 Googlebot,这似乎仅与
Googlebot
匹配(不区分大小写)。任何较小的部分匹配(例如Google
)都无法匹配。任何较长的部分匹配(例如Googlebot/1.2
)都将无法匹配。并且使用完整的用户代理字符串(Mozilla/5.0(兼容;Googlebot/2.1;+http://www.google.com/bot.html
))也无法匹配。(尽管有从技术上讲,Googlebot 无论如何都有多个用户代理,因此无论如何都不建议匹配完整的用户代理字符串 - 即使它确实有效。)这些测试是使用 Google 的 robots.txt 测试器。
参考:
robots.txt
中使用)In theory, yes. However, in practise it seems to be specific partial-matches or "substrings" (as mentioned in @unor's answer) that match. These specific "substrings" appear to be referred to as "tokens". And often it must be an exact match for these "tokens".
With regards to the standard Googlebot, this only appears to match
Googlebot
(case-insensitive). Any lesser partial-match, such asGoogle
, fails to match. Any longer partial-match, such asGooglebot/1.2
, fails to match. And using the full user-agent string (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html
) also fails to match. (Although there is technically more than one user-agent for the Googlebot anyway, so matching on the full user-agent string would not be recommended anyway - even if it did work.)These tests were performed with Google's robots.txt tester.
Reference:
robots.txt
)是的,用户代理必须完全匹配。
来自 robotstxt.org:“用户代理或禁止中均不支持通配符和正则表达式线”
Yes, the user agent has to be an exact match.
From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"