是否有任何库或框架提供浏览器的功能,但不需要实际物理渲染到屏幕上?
我想要在网页上自动导航(例如,Mechanize 就是这样做的),但我想要完整的浏览器体验,包括 Javascript。因此,我想要一个某种类型的虚拟浏览器,我可以使用它以编程方式“单击链接”,在其中呈现 DOM 元素和 JS 脚本,并操作这些元素。
解决方案最好用Python,但我可以管理其他的。
Are there any libraries or frameworks that provide the functionality of a browser, but do not need to actually render physically onto the screen?
I want to automate navigation on web pages (Mechanize does this, for example), but I want the full browser experience, including Javascript. Thus, I'd like to have a virtual browser of some sort, that I can use to "click on links" programmatically, have DOM elements and JS scripts render within it, and manipulate these elements.
Solution preferably in Python, but I can manage others.
发布评论
评论(7)
Java 中的 HTMLUnit 非常好。我认为只有无头浏览器的 Java 实现能够提供 Javascript 支持。
MaxQ,我读到了这里,听起来可能很有趣:“用 Java 编写,生成 Jython 脚本”
HTMLUnit in Java is very good. I think it's only the Java implementations of headless browsers that manage to provide Javascript support.
MaxQ, I read about here, sounds like it might be interesting: "written in Java, generates Jython scripts"
尝试 HtmlUnit!
Try HtmlUnit !!!
PhantomJS 和 PyPhantomJS 是我用于此类任务的工具。
它是一个基于 WebKit 的无头浏览器,可通过 JavaScript 完全控制。有一个 C++ 实现 (PhantomJS) 和一个 Python 实现 (PyPhantomJS)。不过,我更喜欢 Python,因为它有一个插件系统,允许您向核心添加功能,而无需实际修改任何代码,这与 C++ 不同。 :)
PhantomJS and PyPhantomJS are what I use for tasks like these.
What it is, is a headless WebKit based browser which is fully controllable via JavaScript. There's a C++ implementation (PhantomJS) and a Python one (PyPhantomJS). I prefer the Python one though, because it has a plugin system which allows you to add functionality to the core without actually modifying any code, unlike the C++ one. :)
现在有大量的免费软件技术可供使用:请访问 http://wiki.python 进行选择。 org/moin/WebBrowserProgramming,但如果您有具体问题,请加入 pyjamas-dev 的 Google 群组,我很乐意在那里提供更多详细信息。简短的回答:您可以“无头”运行 pywebkitgtk,或者您可以使用 pygtk 再次使用 xulrunner (通过 python-hulahop),而无需实际执行“browserwidget.show()”,而且还有 pykhtml。您也可以使用 python COM 连接到 MSHTML.DLL。
这些都是“作弊”方法:使用 python 绑定到图形 Web 浏览器引擎,而不实际启动图形位。如果您真的想要放入一些严肃的硬核编程,您可以创建一个未连接到GUI工具包的webkit“端口”:作为一个经验丰富的webkit程序员,我会说大约... 2周的全职工作来制作这样一个“无头”版本的webkit。
湖
There is an absolute ton of free software technology now available: take your pick at http://wiki.python.org/moin/WebBrowserProgramming but if you have specific questions join pyjamas-dev on google groups and i'll be happy to give further details, there. brief answer: you can run pywebkitgtk "headless", or you can use xulrunner (via python-hulahop) again using pygtk without actually doing "browserwidget.show()", and there's also pykhtml. also you could use python COM to connect to MSHTML.DLL.
these are all "cheat" methods: using python bindings to a graphical web browser engine without actually firing up the graphical bit. if you really wanted to put some serious hard-core programming in, you could create a "port" of webkit which was not connected to a GUI toolkit: as an experienced webkit programmer i'd put it as around... 2 weeks of full-time effort to make such a "headless" version of webkit.
l.
看起来 http://watin.sourceforge.net/ 可能是一个不错的选择。
如果您不必使用纯 Python,您可以使用 IronPython,因为它是一个 C# 项目。
Looks like http://watin.sourceforge.net/ might be a good way to go.
If you don't have to go pure Python, you could do IronPython since it's a C# project.
看看 ajaxian 上的这个小东西
http:// /ajaxian.com/archives/server-side-rendering-with-yui-on-node-js
它还讨论了 Aptana Jaxer 我认为它运行在无头的 Firefox 上,所以基本上是 Mozilla 浏览器引擎的全部荣耀。
take a look at this little doosy on ajaxian
http://ajaxian.com/archives/server-side-rendering-with-yui-on-node-js
It also talks about Aptana Jaxer which I think runs on a headless firefox so is basically the Mozilla browser engine in all it's glory.
有卡波。它是纯 Java 的,需要花钱:
http://kapowtech.com/
还有 Lixto:它基于 Eclipse 并使用Mozilla Gecko 作为渲染引擎(除非他们已经将其更改为 WebKit,正如他们几年前所说的那样)。它非常好,但也要花钱:
http://www.lixto.com/?page_id=50< /a>
它们都是图形工具,您可以在其中定义站点导航以及应通过点击提取的内容。但您也可以编写 xpath 和正则表达式,甚至在站点上下文中运行的 JavaScript。
我在维也纳技术大学的 Web 数据提取和应用 Web 数据提取讲座中都使用了它们(Lixto 是由举办讲座的教授编写的)。
There is Kapow. Its pure Java and costs money:
http://kapowtech.com/
And there is Lixto: Its Eclipse based and uses Mozilla Gecko as rendering engine (unless they already changed it to WebKit, as they said they'll do years ago). Its very nice and also costs money:
http://www.lixto.com/?page_id=50
They are both graphical tools where you define the site navigation and what should be extracted by point and click. But you can also write xpath and regular expressions and even JavaScript that runs in the sites context.
I used them both in the lectures web data extraction and applied web data extraction at the technical university Vienna (Lixto is written by the Professor who held the lecture).