识别搜索引擎爬虫
我正在开发一个通过 AJAX 加载数据的网站。我还希望整个网站可以被谷歌和雅虎等搜索引擎抓取。 我想制作该网站的 2 个版本... [1] 当用户到来时,超链接应该像 GMAIL 一样工作(#'ed 超链接) [2] 当爬虫出现时,超链接应该正常工作(AJAX 模式关闭)
我如何识别爬虫?
I am working on a website which loads its data via AJAX. I also want that the whole website can be crawled by search engines like google and yahoo.
I want to make 2 versions of the site...
[1] When a user comes the hyperlinks should work just like GMAIL (#'ed hyperlinks)
[2] When a crawler comes the hyperlinks should work normally (AJAX mode off)
How can i identify a Crawler??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
爬虫通常可以通过 User-Agent HTTP 标头来识别。请查看此页面,获取专门用于爬虫的用户代理列表。一些示例是:
Google:
此外,这里还有一些获取各种语言的用户代理字符串的示例:
Crawlers can usually be identified with the User-Agent HTTP Header. Look at this page for a list of user agents for crawlers specifically. Some examples are:
Google:
Also, here are some examples for getting the user agent string in various languages:
您不应向用户和爬虫展示不同形式的网站。如果谷歌发现您这样做,他们可能会因此降低您的搜索排名。另外,如果您有一个仅适用于爬虫的版本,它可能会在您不注意的情况下崩溃,从而为搜索引擎提供错误的数据。
我的建议是构建一个不需要 AJAX 的网站版本,并在每个页面上都有指向非 AJAX 版本的显着链接。这也将帮助那些可能不喜欢 AJAX 版本或浏览器无法正确处理它的用户。
You should not present a different form of your website to your users and a crawler. If Google discovers you doing that, they may reduce your search ranking because of it. Also, if you have a version that's only for a crawler, it may break without you noticing, thus giving search engines bad data.
What I'd recommend is building a version of your site that doesn't require AJAX, and having prominent links on each page to the non-AJAX version. This will also help users who may not like the AJAX version, or who have browser which aren't capable of handling it properly.
爬虫的http标头应包含User-Agent字段。您可以在您的服务器上检查此字段。
这里是大量用户代理的列表。一些例子:
The http headers of the crawler should contain a User-Agent field. You can check this field on your server.
Here is a list of TONS of User-Agents. Some examples:
这种方法只会让你的生活变得困难。它要求您维护网站的两个完全独立的版本,并尝试猜测向任何给定用户提供哪个版本。搜索引擎并不是唯一没有可用和启用 JavaScript 的用户代理。
遵循unobtrusive JavaScript和建立在有效的基础上。这避免了确定向用户提供哪个版本的需要,因为 JS 可以优雅地失败,同时留下有效的 HTML 版本。
This approach just makes life difficult for you. It requires you to maintain two completely separate versions of the site and try to guess what version to serve to any given user. Search engines are not the only user agents that don't have JavaScript available and enabled.
Follow the principles of unobtrusive JavaScript and build on things that work. This avoids the need to determine which version to give to a user since the JS can gracefully fail while leaving a working HTML version.