使用zombie.js 进行网站抓取的问题
我需要做一些网络抓取。在使用了不同的 Web 测试框架之后,其中大多数框架要么太慢(Selenium),要么对我的需求来说有太多错误(env.js),我决定 zombie.js 看起来最有前途,因为它使用一组可靠的库来进行 HTML 解析和 DOM 操作。然而,在我看来,它甚至不支持基本的基于事件的 Javascript 代码,如以下网页所示:
<html>
<head>
<title>test</title>
<script type="text/javascript">
console.log("test script executing...");
console.log("registering callback for event DOMContentLoaded on " + document);
document.addEventListener('DOMContentLoaded', function(){
console.log("DOMContentLoaded triggered");
}, false);
function loaded() {
console.log("onload triggered");
}
</script>
</head>
<body onload="loaded();">
<h1>Test</h1>
</body>
</html>
然后我决定像这样手动触发这些事件:
zombie = require("zombie");
zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {
doc = browser.document;
console.log("firing DOMContentLoaded on " + doc);
browser.fire("DOMContentLoaded", doc, function (err, browser, status) {
body = browser.querySelector("body");
console.log("firing load on " + body);
browser.fire("load", body, function (err, browser, status) {
console.log(browser.html());
});
});
});
这适用于这个特定的测试页面。不过,我的问题是一个更普遍的问题:我希望能够抓取更复杂的基于 AJAX 的网站,例如 Facebook 上的朋友列表(类似于 http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends)。使用僵尸登录网站没有问题,但一些内容(例如那些列表)似乎完全使用 AJAX 动态加载,并且我不知道如何触发启动加载的事件处理程序。
关于这个问题,我有几个问题:
- 是否有人已经实现了一个类似复杂的抓取工具,而不使用像 Selenium 这样的浏览器远程控制解决方案?
- 是否有一些关于复杂的基于 Javascript 的页面的加载过程的参考?
- 有人可以提供有关如何调试真实浏览器以查看我可能需要执行什么来触发 Facebook 事件处理程序的建议吗?
- 关于这个话题还有其他想法吗?
再次强调,请不要向我指出涉及控制像 Selenium 这样的真实浏览器的解决方案,据我所知。然而,值得欢迎的是对真正的内存渲染器(如可通过 Ruby 脚本语言访问的 WebKit)的建议,但最好能够设置 cookie,并且最好还加载原始 HTML,而不是触发真正的 HTTP 请求。
I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:
<html>
<head>
<title>test</title>
<script type="text/javascript">
console.log("test script executing...");
console.log("registering callback for event DOMContentLoaded on " + document);
document.addEventListener('DOMContentLoaded', function(){
console.log("DOMContentLoaded triggered");
}, false);
function loaded() {
console.log("onload triggered");
}
</script>
</head>
<body onload="loaded();">
<h1>Test</h1>
</body>
</html>
I then decided to trigger those events manually like this:
zombie = require("zombie");
zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {
doc = browser.document;
console.log("firing DOMContentLoaded on " + doc);
browser.fire("DOMContentLoaded", doc, function (err, browser, status) {
body = browser.querySelector("body");
console.log("firing load on " + body);
browser.fire("load", body, function (err, browser, status) {
console.log(browser.html());
});
});
});
Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.
There are several questions I have regarding this problem:
- Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
- Is there some reference on the loading process of a complex Javascript-based page?
- Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
- Any other ideas about this topic?
Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了提取数据,运行“无头浏览器”并手动触发 JavaScript 事件并不是最简单的事情。虽然并非不可能,但还有更简单的方法可以做到这一点。
大多数网站,甚至是 AJAX 密集型网站,都可以在不执行任何 Javascript 代码的情况下被抓取。事实上,它通常比尝试找出网站的 Javascript 代码更容易,后者通常是模糊的、精简的且难以调试。如果您对 HTTP 有深入的了解,您就会明白为什么:(几乎)所有与服务器的交互都被编码为 HTTP 请求,因此无论它们是由 Javascript 发起的,还是用户单击链接,还是机器人程序中的自定义代码,和服务器没有什么区别。 (我说“几乎”是因为当 Flash 或小应用程序参与其中时,无法知道什么数据在哪里;它们可以是特定于应用程序的。但是在 Javascript 中完成的任何操作都将通过 HTTP 进行。)
话虽这么说,但可以在任何使用定制软件的网站。首先,您必须能够看到发送到服务器的原始 HTTP 请求。您可以使用代理服务器记录真实浏览器向目标网站发出的请求。您可以使用很多很多工具来实现此目的:Charles 或 Fiddler 是方便、最专用的屏幕抓取工具有一个基本的代理内置的,适用于 Firefox 和 Chrome 的 Firebug 扩展具有用于查看 AJAX 请求的类似工具...您明白了。
一旦您可以看到由于网站上的特定操作而发出的 HTTP 请求,就可以轻松编写程序来模拟这些请求;只需向服务器发送相同的请求,它就会像执行特定操作的浏览器一样对待您的程序。
不同的语言有不同的库提供不同的功能。对于 ruby,我看到很多人使用 mechanize for ruby。
如果数据提取是您的唯一目标,那么您几乎总是能够通过这种方式模仿 HTTP 请求来获得您需要的内容。无需 JavaScript。
注意 - 既然你提到了 Facebook,我应该提到专门抓取 Facebook 可能会异常困难(尽管并非不可能),因为 Facebook 已采取措施来检测自动访问(他们不仅仅使用验证码);如果他们发现帐户有可疑活动,他们将禁用该帐户。毕竟,这违反了他们的服务条款(第 3.2 节)。
For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.
Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)
That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.
Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.
There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.
If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.
Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity coming from it. It is, after all, against their terms of service (section 3.2).