哪种布局引擎可以查找网页上 html 元素的坐标?
我正在做一些网络数据分类任务,并且正在考虑是否可以获得 html 元素的坐标,因为它们将出现在网络浏览器上,而不考虑任何 css 或 javascript 中引用的内容网页。
我的编程语言是 c++,并且需要几百万页的结果,因此它必须很快。我知道有一个 Microsoft COM 组件,它在 Web 浏览器控件中呈现页面,然后可以查询不同 html 标签的位置。但这不适合我的情况,因为它首先渲染整个页面,这会占用大量时间。
据我发现,有开源布局引擎 WebKit、Gecko 可能可以用于此目的。但这是一段巨大的代码,我需要有人引导我找到正确的类或正确的模块来研究或某人以前做过的任何先前/类似的工作。另外,如果我想自定义现有代码以与多个线程一起使用以使其更快,请告诉我你们认为什么是不错的选择。
谢谢
I am doing some web data classification task and was thinking if I could get the co-ordinates of html elements as they would appear on a web-browser without taking into consideration any css or javascript being referred in the web page.
My language of programming is c++ and the need results for a couple million of pages, so it has to be fast. I know there is a Microsoft COM component which renders the page in a web browser control and then can be queried for position of different html tags. But this is not suitable in my case as it first renders the whole page which takes up a lot of time.
So as I found out, there are open-source layout engines WebKit, Gecko that can probably be used for this. But that's a huge piece of code and I need someone to direct me to the right classes or right modules to look into or any previous/similar work someone has done previously. Also, please let me know what you guys think is a good choice if I want to customize the existing code for use with multiple threads to make it faster.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一般来说,你会发现不同的页面渲染引擎确实以自己的方式渲染html,并且结果会有所不同。
问题是,如果您坚持使用任何具体的浏览器引擎,您要做的是以某种方式将该引擎引入您的项目并使用引擎的接口来检索这些坐标。但这是一项艰巨的任务,因为您必须阅读大量文档并爬行数千个文件。
我认为正确的方法是将这个任务发布在某个地方,这是特定于您选择的页面渲染引擎的。 (gecko/webkit/...)
如果您更喜欢坚持 MS 特定的内容,我猜它会更容易,但无法帮助您处理您想要查看的类名或代码块等内容。在这种情况下,可能其他人可以指导你。
Generally, you would find that different page rendering engines do render the html in their own way and the results will differ.
The thing is that if you stick to any concrete browser engine, what you are to do is somehow bringing this engine into your project and using engine's interface to retrieve these coordinates. Kind of a tough task though, simply because you'll have to read a lot of documentation and crawl through thousands of files.
I think that right approach would be posting this task in some place, that is specific for the page rendering engine you've chosen. (gecko/webkit/...)
If you prefer sticking to something MS-specific, guess it's gonna be easier, but can't help you with something like class names or code chunks that you want to see. Probably somebody else could guide you in this case.