在 Node.js 中进行屏幕抓取的最优雅的方法是什么?
我正在编写一个 Web 应用程序,该应用程序在 Node.js 中使用大量的屏幕抓取。我感觉自己在每个角落都在与潮流作斗争。必须有一种更简单的方法来做到这一点。最值得注意的是,有两件事令人恼火:
Cookie 传播。我可以从响应标头中提取“set-cookie”数组,但是执行字符串操作来解析数组中的 cookie 感觉非常hackish。
重定向以下内容。我希望每个请求在返回 302 状态代码时都遵循重定向。
我遇到了两个看起来有用的东西,但最终我无法使用:
http://zombie.labnotes.org/,但它没有 HTTPS 支持,所以我无法使用它。
http://www.phantomjs.org/,但无法使用它,因为它不t(似乎)与node.js集成。对于我正在做的事情来说,它也是相当重量级的。
是否有任何类似 JavaScript 屏幕抓取的库可以传播 cookie、遵循重定向并支持 HTTPS?有关如何使这变得更容易的任何指示?
I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:
Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.
Redirect following. I want each request to follow through redirects when a 302 status code is returned.
I came across two things which looked useful, but I couldn't use in the end:
http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.
http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.
Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我现在实际上有一个爬虫库 https://github.com/mikeal/spider 它非常好,你可以使用 jquery 和路由。
欢迎反馈:)
i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.
feedback is welcome :)
您可能想查看 mikeal 的 https://github.com/mikeal/request,我刚刚说过他告诉他聊天室,他说目前它不处理cookie,但您可以同时编写一个子模块来处理这些cookie。
在重定向方面它处理得很好:)
You may want to check out https://github.com/mikeal/request from mikeal, I just spoke to him the chatroom and he says that it does not handle cookies at the moment but you can write a submodule to handle these for you in the meantime.
in regards to redirect it handles beautifully :)
事实证明有人为node.js制作了一个phantomjs模块:
https://github.com/sgentle/phantomjs- 虽然phantom
相当重,但它也支持 SSL、cookie 以及典型浏览器支持的所有其他内容(毕竟它是一个 webkit 浏览器)。
尝试一下,它可能正是您正在寻找的。
It turns out someone made a phantomjs module for node.js:
https://github.com/sgentle/phantomjs-node
While phantom is fairly heavy, it also supports SSL, cookies, and everything else a typical browser supports (since it is a webkit browser, after all).
Give it a shot, it may be exactly what you are looking for.