是否值得使用 QTextBrowser 在后台解析和修改 HTML 页面而不是 QWebPage？

发布于 2024-12-21 21:27:12 字数 1848 浏览 6 评论 0原文

出于学习 C++ 和 Qt 的纯粹目的，我正在编写一个基于 Qt 的小程序，该程序从本地目录读取 HTML 文件（最多数百个），修改它们并将它们写回另一个本地目录。

我的第一次尝试是使用 QWebPage 和 QWebElement 提供的 HTML 解析功能。然而，我遇到了由 QWebPage 引起的一些严重的内存泄漏问题（这很可能是由于我没有正确使用它造成的。但这是另一个主题，不是这个问题的一部分）。

到目前为止，我还没有使用任何 GUI，尽管我打算稍后再这样做，但这部分程序永远不会成为 GUI 的一部分，而是在后台的某个地方。
因此我想用QTextBrowser替换QWebPage的使用，这看起来更轻量。但是，我在Qt-API中找不到类似于QWebElement的解析函数的函数。到目前为止，我的代码依赖于 QWebElement::findFirst()、QWebElement::nextSibling() 以及最后的 QWebElement::takeFromDocument()。

那么，是否有可能几乎轻松地实现（或使用）QTextBrowser 作为 HTML 解析器？甚至可能是“最佳实践”？
我不需要评估任何 JavaScript，尽管它很可能内联在 HTML 页面中。我也不需要使用 CSS 进行样式设置，尽管它在相关 HTML 页面中大量使用。我只需要根据某些 HTML 块（如表行）的 id 或 CSS 类来检索它们。

PS：我只愿意使用现有的 C++ HTML 解析库，以防使用纯 Qt 的所有可行且合理的尝试失败。

PPS：只是为了看到和了解它们，我现在也想找到非凡的解决方案。 ;-)

这是我当前代码的一部分，我使用 QWebElement 解析并删除 HTML 页面的某些部分。 reportPage 是一个QWebPage 对象。

reportPage->document().findFirst( "table[id=gadgettable]" ).findFirst( "tr[class=c2]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-left-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-right-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();

原文

For the pure purpose of learning C++ and Qt I'm writing a little Qt-based program, which reads HTML files (up to several hundreds) from a local directory, modifies them and writes them back into another local directory.

My first try was using QWebPage and the HTML parsing functionality provided by QWebElement. However I run into some severe problems with memory leaks caused by QWebPage (Which is very likely caused by my lack of using it the right way. But this is another topic and not part of this question).

By now I'm not using any GUI and though I intend to do so later on, this part of my program will never be part of the GUI but somewhere in the background.
Thus I though of replace the usage of QWebPage by QTextBrowser, which seems more lightweight. However, I could not find functions in the Qt-API similar to the parsing functions of QWebElement. So far my code relies on QWebElement::findFirst(), QWebElement::nextSibling() and finally QWebElement::takeFromDocument().

So, is there an almost painless possibility of implementing (or using) QTextBrowser as a HTML parser? Maybe even a 'best practice'?
I do not need to evaluate any JavaScript though it is very likely inlined in the HTML pages. Neither do I need to use CSS for styling, though it is heavily used in the HTML pages in question. I just need to retrieve certain HTML blocks (as table rows) based on their id or CSS class.

PS: I'm only willing of using present C++ HTML parsing libraries in case all feasible and rational attempts using pure Qt fail.

PPS: Just for the sake of seeing and knowing them, I'd also like to get to now extraordinary solutions. ;-)

Here is the part of my current code, where I parse and remove certain parts of the HTML page using QWebElement. reportPage is a QWebPage object.

reportPage->document().findFirst( "table[id=gadgettable]" ).findFirst( "tr[class=c2]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-left-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-right-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();

分享到QQ

分享到微博