lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html?

发布于 2025-01-07 01:24:27 字数 64 浏览 0 评论 0 原文

对于格式错误的 html,哪一个更好、更有用?
我找不到如何使用 libxml2。

谢谢。

Which one is better and more useful for malformed html?
I cannot find how to use libxml2.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

绿光 2025-01-14 01:24:27

libxml2 页面 中,您可以看到以下注释:

请注意,一些 Python 纯粹主义者不喜欢默认的 Python 绑定集,与其抱怨,不如我建议他们查看 lxml,即 libxml2 和 libxslt 的更具 Python 风格的绑定,并检查邮件列表。

lxml 页面中,还有另一个:

lxml XML 工具包是 C 库 libxml2 和 libxslt 的 Pythonic 绑定。它的独特之处在于它将这些库的速度和 XML 功能完整性与本机 Python API 的简单性相结合,大部分兼容但优于众所周知的 ElementTree API。

因此本质上,使用 lxml 您可以获得完全相同的功能,
但具有与标准库中的 ElementTree 库兼容的 Pythonic API(因此这意味着标准库文档对于学习如何使用 lxml 很有用)。这就是为什么 lxml 优于 libxml2 (即使底层实现是相同的)。

编辑:话虽如此,正如其他答案所解释的那样,要解析格式错误的html,最好的选择是使用 BeautifulSoup。需要注意的一件有趣的事情是,如果您安装了 lxmlBeautifulSoup 将按照 新版本的文档

如果您不指定任何内容,您将获得已安装的最佳 HTML 解析器。 Beautiful Soup 将 lxml 的解析器评为最好,然后是 html5lib,最后是 Python 的内置解析器。

无论如何,即使 BeautifulSoup 在底层使用 lxml,您也能够解析无法用 解析的损坏的 html >xml 直接。例如:

>>> lxml.etree.fromstring('<html>')
...
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

但是:

>>> bs4.BeautifulSoup('<html>', 'lxml')
<html></html>

最后,请注意 lxml 还为旧版本的 BeautifulSoup 提供了一个接口,如下所示:

>>> lxml.html.soupparser.fromstring('<html>')
<Element html at 0x13bd230>

所以最终,您可能会无论如何,使用 lxmlBeautifulSoup 。您唯一需要选择的是您最喜欢的 API。

In the libxml2 page you can see this note:

Note that some of the Python purist dislike the default set of Python bindings, rather than complaining I suggest they have a look at lxml the more pythonic bindings for libxml2 and libxslt and check the mailing-list.

and in the lxml page this other one:

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

So essentially, with lxml you get exactly the same functionality,
but with a a pythonic API compatible with the ElementTree library in the standard library (so this means the standard library documentation will be useful to learn how to use lxml). That's why, lxml is preferred over libxml2 (even when the underlying implementation is the same one).

Edit: Having said that, as other answers explain, to parse malformed html your best option is to use BeautifulSoup. One interesting thing to note is that, if you have installed lxml, BeautifulSoup will use it as explained in the documentation for the new version:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

Anyway, even if BeautifulSoup uses lxml under the hood, you'll be able to parse broken html that you can't parse with xml directly. For example:

>>> lxml.etree.fromstring('<html>')
...
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

However:

>>> bs4.BeautifulSoup('<html>', 'lxml')
<html></html>

Finally, note that lxml also provides an interface to the old version of BeautifulSoup as follows:

>>> lxml.html.soupparser.fromstring('<html>')
<Element html at 0x13bd230>

So at the end of the day, you'll probably be using lxml and BeautifulSoup anyway. The only thing you've got to choose is what's the API that you like the most.

看春风乍起 2025-01-14 01:24:27

尝试一下beutifulsoup。它的目的是解析结构不良的数据。

http://pypi.python.org/pypi/BeautifulSoup

http://lxml.de/elementsoup.html

Try beutifulsoup instead. It is aimed at parsing poorly structured data.

http://pypi.python.org/pypi/BeautifulSoup

http://lxml.de/elementsoup.html

长安忆 2025-01-14 01:24:27

BeautifulSoup 很适合解析 html。您可以检查它的示例,发现它与其他示例相比很好。

BeautifulSoup is good to parse the html. You can check its example and find that its good compare to the others.

屌丝范 2025-01-14 01:24:27

lxml 是通常推荐的。具体来说,lxml.html(如果我没记错的话)。

我相信它在底层使用了 libxml2,但如果 html 特别令人讨厌,则会回退到 beautifulsoup,但不要相信我的话,请查看该网站! ( http://lxml.de/ )

lxml is the one that's generally recommended. Specifically, lxml.html (if I recall correctly).

I believe it makes use of libxml2 under-the-hood, but falls back to beautifulsoup if the html is particularly nasty, but don't take my word for it, check out the website! ( http://lxml.de/ )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文