PHP scraper 脚本中的 Useragent
我购买的抓取脚本中有一行 PHP 代码,它是:
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
我猜这意味着该脚本的行为就像 Googlebot 一样,我对吗?如果是这种情况,我可以将其更改为我自己的机器人名称,例如 Searchbox 吗?
I have a line of PHP code in a scraper script that I bought which is:
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
I am guessing it means the script acts like it is Googlebot, am I correct? If this is the case, can I change it so that it's a name of my own bot like Searchbox?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
用户代理完全是建议性的,它不应该对呈现的页面产生任何影响(实际上,这违反了 Google 准则并导致被排除在索引之外)。它应该包含网站管理员可以用来联系行为不当机器人的所有者的 URL 或电子邮件。
您不应假装是 GoogleBot,但应在用户代理中包含您的电子邮件地址或主页。
The user agent is completely advisory, it should not have any effect on the rendered page (actually, that would be against Google's guidelines and result in being thrown out of the index). It should contain a URL or email webmasters can use to contact the owners of misbehaving bots.
You should not pretend to be the GoogleBot, but include your email address or homepage in the user agent.
这取决于脚本的作用以及它抓取的网站类型。 Google Bot 代理字符串的存在是有原因的 - 可能是为了 欺骗新闻网站显示付费内容,或者更无辜地获得内容的搜索引擎优化版本。
如果您不需要依赖这些“副作用”,您可以选择任何您想要的用户代理字符串。对于机器人,通常会包含“Bot”一词以及网站管理员可以获取更多信息的 URL。
That depends on what the script does, and what kind of sites it scrapes. The Google Bot agent string is there for a reason - possibly to trick news websites into showing paid content, or more innocently, to get a search engine-optimized version of the content.
If you do not need to rely on these "side effects", you can choose any user agent string you want. With bots, it is the custom to include the word "Bot", and a URL where webmasters can get more information.