使 AJAX 应用程序可爬行？如何在 Google App Engine 上构建简单的 Web 服务来生成 HTML 快照？

发布于 2024-09-15 04:10:35 字数 2713 浏览 5 评论 0原文

现实世界问题：

我的应用程序托管在 Heroku 上，（据我所知）无法提供运行无头（无 GUI）浏览器的解决方案 - 例如 HTMLUnit - 用于生成 HTML 快照，供 Googlebot 为我的 AJAX 内容编制索引。

我建议的解决方案：

如果您还没有阅读过，我建议您阅读 Google 的使 AJAX 应用程序可爬行的完整规范。

假设我有：

一个 Sinatra 应用程序托管在 Heroku 域 http://example.com 上，
该应用程序在页面顶部有多个选项卡 TabA、TabB 和 TabC
在每个选项卡下 SubTab1、SubTab2、SubTab3
如果 url 为 http://example.com#!tab=TabA&subtab=SubTab3，则 onload 则客户端 Javascript 会采用 location.hash 并通过 AJAX 加载 TabA、SubTab3 内容。

注意：Hash Bang (#!) 是 google 的一部分规范。

我想构建一个托管在 Google App Engine 上的简单“网络服务” (GAE)：

接受 URL 参数，例如 http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 ( url 参数应该是 URLEncoded）
运行 HTMLUnit 以打开 http://example.com#!tab=TabA&subtab=SubTab3 并在服务器上运行客户端 javascript。
一旦一切完成（或者大约 45 秒过去），HTMLUnit 将返回 DOM。
返回内容可以通过 JSON/JSONP 发送回，或者返回一个 URL，生成并存储在 google app engine 服务器上的文件（对于基于文件的“缓存”结果）...在此处接受建议。如果返回文件的 URL，则您可以 CURL 获取源代码（也称为 HTML 快照）。

我的 http://example.com 应用程序需要管理对 http://htmlsnapshot.appspot.com 的调用...基本上：

捕获 Googlebots 对 的调用>http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3（googlebot 抓取工具会转义某些字符，例如 %26 = &）。
从后端发送请求到 http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 （url 参数应该是 URLEncoded）
渲染将 HTML 快照返回到前端。
Google 为内容建立了索引，我们很高兴！

我没有任何使用 Google App Engine、Java 或 HTMLUnit 的经验。

我也许能弄清楚......如果我做到了，我会发布我的结果。

否则，我觉得这是一个非常好的机会，可以让某人写一篇精彩的博客文章，概述新手分步指南来设置网络服务，例如这。

这将向更多的人介绍优秀的（而且免费的！）Google App Engine。而且，它无疑会鼓励更多的人采用 Google 的可抓取 AJAX 内容规范……我们都可以从中受益！

随着 Google 的规范获得更多人的接受，设置无头浏览器的“障碍”将让许多开发人员在 Google 上搜索答案！现在就带着对名誉和荣耀的答案吧！（编辑：至少我会赞扬你）。

如果您想讨论解决方案，请在 Twitter 上联系我@_chrisjacob。

原文

Real World Problem:

I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.

My Proposed Solution:

If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.

Imagine I have:

a Sinatra app hosted on Heroku on the domain http://example.com
the app has tabs along the top of the page TabA, TabB and TabC
under each tab is SubTab1, SubTab2, SubTab3
onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.

Note: the Hash Bang (#!) is part of the google spec.

I would like to build a simple "web service" hosted on Google App Engine (GAE) that:

Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).

My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:

Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Render the returned HTML Snapshot to the frontend.
Google Indexes the content and we rejoice!

I don't have any experience with Google App Engine or Java or HTMLUnit.

I might be able to figure it out... and will post my results if I do.

Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.

This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!

As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).

Hit me up on twitter @_chrisjacob if you would like to discuss solutions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里°也失望 2024-09-22 04:10:35

我已经在 AppEngine 上成功使用了 HTMLunit。我执行此操作的 GWT 代码可在 gwt-platform 项目中找到，我得到的结果与 < a href="http://ajax-crawler.appspot.com/" rel="nofollow noreferrer">HTMLunit-AppEngine 测试应用程序，作者：Amit Manjhi。

使用 GWTP 当前的 HTMLunit 支持来准确执行您所描述的操作应该相对容易，尽管您可能可以在更简单的应用程序中执行此操作。我发现的一个问题是 AppEngine 请求有 30 秒的超时时间，因此 HTMLunit 处理页面的时间不能超过该超时时间。

更新：
已经有一段时间了，但我终于解决了有关使用 GWTP 使 GWT 应用程序可爬网的长期存在的问题。文档并不完整，但请检查一下问题：
http://code.google.com/p/gwt-平台/问题/详细信息？id=1