当 Googlebot 请求“?_escaped_fragment_=”网址时,它如何知道网络服务器没有隐藏?

发布于 2024-12-22 13:23:33 字数 379 浏览 5 评论 0原文

对于 Google 的 AJAX 抓取规范,如果服务器返回 #! URL 的一个内容(即 JavaScript 密集型文件)和其他其他内容 em>(即页面的“html 快照”)发送给 Googlebot,当 #! 替换为 ?_escaped_fragment_= 时,感觉喜欢对我伪装。毕竟,Googlebot 如何确定服务器正在返回 #!?_escaped_fragment_= 网址的善意等效内容。然而,这正是 AJAX 爬行规范实际上告诉网站管理员要做的事情。我错过了什么吗? Googlebot 如何确定服务器在这两种情况下返回相同的内容?

With regard to Google's AJAX crawling spec, if the server returns one thing (namely, a JavaScript-heavy file) for a #! URL and something else (namely, a "html snapshot" of the page) to Googlebot when the #! is replaced with ?_escaped_fragment_=, that feels like cloaking to me. After all, how is Googlebot sure that the server is returning good faith equivalents for both the #! and ?_escaped_fragment_= URLs. Yet this is what the AJAX crawling spec actually tells webmasters to do. Am I missing something? How is Googlebot sure that the server is returning the same content in both cases?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

疯到世界奔溃 2024-12-29 13:23:33

爬虫不知道。但即使对于返回纯 ol' html 的网站,它也永远不会知道 - 编写基于爬虫使用的 http 标头或已知 IP 标头来隐藏网站的代码非常容易。

如何知道您正在伪装?

请参阅此相关问题: Google 这似乎是猜测,但似乎可能存在各种适当的检查,在欺骗正常浏览器标题和实际的真人查看页面之间有所不同。

继续这个猜想,编写一种实际检索用户所看到内容的爬虫程序当然不会超出谷歌程序员的能力——毕竟,他们有自己的浏览器可以做到这一点。一直这样做会耗费大量的 CPU 资源,但对于偶尔的抽查来说可能是有意义的。

The crawler does not know. But it never knows even for sites that return plain ol' html either - it is extremely easy to write code that cloaks the site based on http headers used by crawlers or known IP headers.

See this related question: How does Google Know you are Cloaking?

Most of it seems like conjecture, but it seems likely there are various checks in-place, varying between spoofing normal browser headers and actual real-person looking at the page.

Continuing the conjecture, it certainly wouldn't be beyond the capabilities of programmers at Google to write a form of crawler that actually retrieved what the user sees - after all, they have their own browser that does just that. It would be prohibitively CPU-expensive to do that all the time, but probably makes sense for the occasional spot-check.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文