使 AJAX 应用程序可爬行?如何在 Google App Engine 上构建简单的 Web 服务来生成 HTML 快照?
现实世界问题:
我的应用程序托管在 Heroku 上,(据我所知)无法提供运行无头(无 GUI)浏览器的解决方案 - 例如 HTMLUnit - 用于生成 HTML 快照,供 Googlebot 为我的 AJAX 内容编制索引。
我建议的解决方案:
如果您还没有阅读过,我建议您阅读 Google 的使 AJAX 应用程序可爬行的完整规范。
假设我有:
- 一个 Sinatra 应用程序托管在 Heroku 域
http://example.com
上, - 该应用程序在页面顶部有多个选项卡 TabA、TabB 和 TabC
- 在每个选项卡下 SubTab1、SubTab2、SubTab3
- 如果 url 为
http://example.com#!tab=TabA&subtab=SubTab3
,则 onload 则客户端 Javascript 会采用location.hash
并通过 AJAX 加载 TabA、SubTab3 内容。
注意:Hash Bang (#!) 是 google 的一部分规范。
我想构建一个托管在 Google App Engine 上的简单“网络服务” (GAE):
- 接受 URL 参数,例如
http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3
( url 参数应该是 URLEncoded) - 运行 HTMLUnit 以打开
http://example.com#!tab=TabA&subtab=SubTab3
并在服务器上运行客户端 javascript。 - 一旦一切完成(或者大约 45 秒过去),HTMLUnit 将返回 DOM。
- 返回内容可以通过 JSON/JSONP 发送回,或者返回一个 URL,生成并存储在 google app engine 服务器上的文件(对于基于文件的“缓存”结果)...在此处接受建议。如果返回文件的 URL,则您可以 CURL 获取源代码(也称为 HTML 快照)。
我的 http://example.com
应用程序需要管理对 http://htmlsnapshot.appspot.com
的调用...基本上:
- 捕获 Googlebots 对
的调用>http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3
(googlebot 抓取工具会转义某些字符,例如 %26 = &)。 - 从后端发送请求到
http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3
(url 参数应该是 URLEncoded) - 渲染将 HTML 快照返回到前端。
- Google 为内容建立了索引,我们很高兴!
我没有任何使用 Google App Engine、Java 或 HTMLUnit 的经验。
我也许能弄清楚......如果我做到了,我会发布我的结果。
否则,我觉得这是一个非常好的机会,可以让某人写一篇精彩的博客文章,概述新手分步指南来设置网络服务,例如这。
这将向更多的人介绍优秀的(而且免费的!)Google App Engine。而且,它无疑会鼓励更多的人采用 Google 的可抓取 AJAX 内容规范……我们都可以从中受益!
随着 Google 的规范获得更多人的接受,设置无头浏览器的“障碍”将让许多开发人员在 Google 上搜索答案!现在就带着对名誉和荣耀的答案吧! (编辑:至少我会赞扬你)。
如果您想讨论解决方案,请在 Twitter 上联系我@_chrisjacob
。
Real World Problem:
I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.
My Proposed Solution:
If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.
Imagine I have:
- a Sinatra app hosted on Heroku on the domain
http://example.com
- the app has tabs along the top of the page TabA, TabB and TabC
- under each tab is SubTab1, SubTab2, SubTab3
- onload if the url is
http://example.com#!tab=TabA&subtab=SubTab3
then client-side Javascript takes thelocation.hash
and loads in TabA, SubTab3 content via AJAX.
Note: the Hash Bang (#!) is part of the google spec.
I would like to build a simple "web service" hosted on Google App Engine (GAE) that:
- Accepts a URL param e.g.
http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3
(url param should be URLEncoded) - Runs HTMLUnit to open
http://example.com#!tab=TabA&subtab=SubTab3
and run the client-side javascript on the sever. - HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
- The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).
My http://example.com
app would need to manage the call to http://htmlsnapshot.appspot.com
... basically:
- Catch Googlebots call to
http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3
(googlebot crawler escapes certain characters e.g. %26 = &). - Send request from the backend to
http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3
(url param should be URLEncoded) - Render the returned HTML Snapshot to the frontend.
- Google Indexes the content and we rejoice!
I don't have any experience with Google App Engine or Java or HTMLUnit.
I might be able to figure it out... and will post my results if I do.
Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.
This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!
As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).
Hit me up on twitter @_chrisjacob
if you would like to discuss solutions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我已经在 AppEngine 上成功使用了 HTMLunit。我执行此操作的 GWT 代码可在 gwt-platform 项目 中找到,我得到的结果与 < a href="http://ajax-crawler.appspot.com/" rel="nofollow noreferrer">HTMLunit-AppEngine 测试应用程序,作者:Amit Manjhi。
使用 GWTP 当前的 HTMLunit 支持来准确执行您所描述的操作应该相对容易,尽管您可能可以在更简单的应用程序中执行此操作。我发现的一个问题是 AppEngine 请求有 30 秒的超时时间,因此 HTMLunit 处理页面的时间不能超过该超时时间。
更新:
已经有一段时间了,但我终于解决了有关使用 GWTP 使 GWT 应用程序可爬网的长期存在的问题。文档并不完整,但请检查一下问题:
http://code.google.com/p/gwt-平台/问题/详细信息?id=1
I have successfully used HTMLunit on AppEngine. My GWT code to do this is available in the gwt-platform project the results I got were similar to that of the HTMLunit-AppEngine test application by Amit Manjhi.
It should be relatively easy to use GWTP current HTMLunit support to do exactly what you describe, although you could likely do it in a simpler app. One problem I see is that AppEngine requests have a 30 second timeout, so you can't have a page that takes HTMLunit longer than that to process.
UPDATE:
It's been a while, but I finally closed the long standing issue about making GWT applications crawlable using GWTP. The documentation is not entirely there, but check out the issue:
http://code.google.com/p/gwt-platform/issues/detail?id=1