使用 WebKit(或 Gecko)查找渲染的 HTML 元素位置
我想获取浏览器呈现网页的所有 HTML 元素的尺寸(坐标),即它们呈现的位置。 例如,(top-left,top-right,bottom-left,bottom-right)
在 lxml 中找不到此内容。 那么,Python 中有没有库可以做到这一点呢? 我还查看了 Perl 中的 Mechanize::Mozilla,但是,这似乎很难配置/设置。
我认为满足我的要求的最佳方法是使用渲染引擎 - 例如 WebKit 或 Gecko。
是否有可用于上述两个渲染引擎的 perl/python 绑定? Google 搜索有关如何“插入”WebKit 渲染引擎的教程并没有多大帮助。
I would like to get the dimensions (coordinates) for all the HTML elements of a webpage as they are rendered by a browser, that is the positions they are rendered at. For example, (top-left,top-right,bottom-left,bottom-right)
Could not find this in lxml. So, is there any library in Python that does this? I had also looked at Mechanize::Mozilla in Perl but, that seems difficult to configure/set-up.
I think the best way to do this for my requirement is to use a rendering engine - like WebKit or Gecko.
Are there any perl/python bindings available for the above two rendering engines? Google searches for tutorials on how to "plug-in" to the WebKit rendering engine is not very helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
lxml 根本不会帮助你。 它根本不关心前端渲染。
要准确地了解某些内容的渲染方式,您需要渲染它。 为此,您需要连接到浏览器,生成页面并在页面上运行一些 JS 来查找 DOM 元素并获取其属性。
这是完全可能的,但我认为您应该首先了解网站屏幕截图工厂的工作原理(因为它们将共享启动浏览器并显示正确页面所需的 90% 的代码)。
您可能仍想使用 lxml 将 javascript 注入页面。
lxml isn't going to help you at all. It isn't concerned about front-end rendering at all.
To accurately work out how something renders, you need to render it. For that you need to hook into a browser, spawn the page and run some JS on the page to find the DOM element and get its attributes.
It's totally possible but I think you should start by looking at how website screenshot factories work (as they'll share 90% of the code you need to get a browser launching and showing the right page).
You may want to still use lxml to inject your javascript into the page.
我同意 Oli,渲染有问题的页面并通过 JavaScript 检查 DOM 是最实用的方法。
您可能会发现 jQuery 在这里非常有用:
相关文档是 此处。
I agree with Oli, rendering the page in question and inspecting DOM via JavaScript is the most practical way IMHO.
You might find jQuery very useful here:
Related documentation is here.
是的,Javascript 是正确的选择:
var allElements=document.getElementsByTagName("*"); 将选择页面中的所有元素。
然后您可以循环遍历它,从每个元素中提取您需要的信息。 关于获取元素的尺寸和位置的良好文档 在这里。
getElementsByTagName 返回节点列表而不是数组(因此,如果您的 JS 更改了 HTML,这些更改将反映在节点列表中),因此我很想将数据构建到 AJAX 帖子中,并在完成后将其发送到服务器。
Yes, Javascript is the way to go:
var allElements=document.getElementsByTagName("*"); will select all the elements in the page.
Then you can loop through this a extract the information you need from each element. Good documentation about getting the dimensions and positions of an element is here.
getElementsByTagName returns a nodelist not an array (so if your JS changes your HTML those changes will be reflected in the nodelist), so I'd be tempted to build the data into an AJAX post and send it to a server when it's done.
我无法找到任何简单的解决方案(即 Java/Perl/Python :) 来连接 Webkit/Gecko 来解决上述渲染问题。 我能找到的最好的是用 Java 编写的 Lobo 渲染引擎,它有一个非常清晰的 API,可以完全满足我的需求- 访问 DOM 和 HTML 元素的渲染属性。
JRex 是 Gecko 渲染引擎的 Java 包装器。
I was not able to find any easy solution (ie. Java/Perl/Python :) to hook onto Webkit/Gecko to solve the above rendering problem. The best I could find was the Lobo rendering engine written in Java which has a very clear API that does exactly what I want - access to both DOM and the rendering attributes of HTML elements.
JRex is a Java wrapper to Gecko rendering engine.
您有三个主要选项:
1) http://www.gnu.org/software/pythonwebkit基于 webkit;
2) python-comtypes 用于访问 MSHTML(仅限 Windows)
3) hulahop (python-xpcom) 基于 xulrunner
您应该获取 pyjamas-desktop 源代码并在 pyjd/ 目录中查找“启动”代码,这将允许您创建一个 Web 浏览器应用程序,并在引擎调用“页面加载”回调后开始操作 DOM。
您可以执行节点遍历,并且可以访问所需的 DOM 元素的属性。 你可以查看 pyjamas/library/pyjamas/DOM.py 模块来查看你需要使用的许多东西来完成你想做的事情。
但如果上述三个选项还不够,那么您应该阅读页面 http://wiki.python.org/ moin/WebBrowserProgramming 了解更多选项,其中许多选项已被其他人提到过。
湖
you have three main options:
1) http://www.gnu.org/software/pythonwebkit is webkit-based;
2) python-comtypes for accessing MSHTML (windows only)
3) hulahop (python-xpcom) which is xulrunner-based
you should get the pyjamas-desktop source code and look in the pyjd/ directory for "startup" code which will allow you to create a web browser application and begin, once the "page loaded" callback has been called by the engine, to manipulate the DOM.
you can perform node-walking, and can access the properties of the DOM elements that you require. you can look at the pyjamas/library/pyjamas/DOM.py module to see many of the things that you will need to be using in order to do what you want.
but if the three options above are not enough then you should read the page http://wiki.python.org/moin/WebBrowserProgramming for further options, many of which have been mentioned here by other people.
l.
您可能会考虑查看 WWW::Selenium。 有了它(和 selenium rc),你可以从内部操纵字符串 IE、Firefox 或 Safari Perl 的。
You might consider looking at WWW::Selenium. With it (and selenium rc) you can puppet string IE, Firefox, or Safari from inside of Perl.
问题是当前的浏览器渲染的内容并不完全相同。 如果您正在寻找符合标准的做事方式,您可能可以用 Python 编写一些东西来呈现页面,但这将是一项繁重的工作。
您可以使用 wxWidgets 中的 wxHTML 控件 单独渲染页面的每个部分以获得关于它的大小的想法。
如果您有 Mac,您可以尝试 WebKit。 同一篇文章也对其他平台上的解决方案提出了一些建议。
The problem is that current browsers don't render things quite the same. If you're looking for the standards compliant way of doing things, you could probably write something in Python to render the page, but that's going to be a hell of a lot of work.
You could use the wxHTML control from wxWidgets to render each part of a page individually to get an idea of it's size.
If you have a Mac you could try WebKit. That same article has some suggestions for solutions on other platforms too.