lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html？

发布于 2025-01-07 01:24:27 字数 64 浏览 0 评论 0 原文

对于格式错误的 html，哪一个更好、更有用？
我找不到如何使用 libxml2。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绿光 2025-01-14 01:24:27

在 libxml2 页面中，您可以看到以下注释：

请注意，一些 Python 纯粹主义者不喜欢默认的 Python 绑定集，与其抱怨，不如我建议他们查看 lxml，即 libxml2 和 libxslt 的更具 Python 风格的绑定，并检查邮件列表。

在 lxml 页面中，还有另一个：

lxml XML 工具包是 C 库 libxml2 和 libxslt 的 Pythonic 绑定。它的独特之处在于它将这些库的速度和 XML 功能完整性与本机 Python API 的简单性相结合，大部分兼容但优于众所周知的 ElementTree API。

因此本质上，使用 lxml 您可以获得完全相同的功能，
但具有与标准库中的 ElementTree 库兼容的 Pythonic API（因此这意味着标准库文档对于学习如何使用 lxml 很有用）。这就是为什么 lxml 优于 libxml2 （即使底层实现是相同的）。

编辑：话虽如此，正如其他答案所解释的那样，要解析格式错误的html，最好的选择是使用 BeautifulSoup。需要注意的一件有趣的事情是，如果您安装了 lxml，BeautifulSoup 将按照新版本的文档：

如果您不指定任何内容，您将获得已安装的最佳 HTML 解析器。 Beautiful Soup 将 lxml 的解析器评为最好，然后是 html5lib，最后是 Python 的内置解析器。

无论如何，即使 BeautifulSoup 在底层使用 lxml，您也能够解析无法用 解析的损坏的 html >xml 直接。例如：

>>> lxml.etree.fromstring('<html>')
...
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

但是：

>>> bs4.BeautifulSoup('<html>', 'lxml')
<html></html>

最后，请注意 lxml 还为旧版本的 BeautifulSoup 提供了一个接口，如下所示：

>>> lxml.html.soupparser.fromstring('<html>')
<Element html at 0x13bd230>

所以最终，您可能会无论如何，使用 lxml 和 BeautifulSoup 。您唯一需要选择的是您最喜欢的 API。

In the libxml2 page you can see this note:

Note that some of the Python purist dislike the default set of Python bindings, rather than complaining I suggest they have a look at lxml the more pythonic bindings for libxml2 and libxslt and check the mailing-list.

and in the lxml page this other one:

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

So essentially, with lxml you get exactly the same functionality,
but with a a pythonic API compatible with the ElementTree library in the standard library (so this means the standard library documentation will be useful to learn how to use lxml). That's why, lxml is preferred over libxml2 (even when the underlying implementation is the same one).

Edit: Having said that, as other answers explain, to parse malformed html your best option is to use BeautifulSoup. One interesting thing to note is that, if you have installed lxml, BeautifulSoup will use it as explained in the documentation for the new version:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

Anyway, even if BeautifulSoup uses lxml under the hood, you'll be able to parse broken html that you can't parse with xml directly. For example:

>>> lxml.etree.fromstring('<html>')
...
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7

However:

>>> bs4.BeautifulSoup('<html>', 'lxml')
<html></html>

Finally, note that lxml also provides an interface to the old version of BeautifulSoup as follows:

>>> lxml.html.soupparser.fromstring('<html>')
<Element html at 0x13bd230>

So at the end of the day, you'll probably be using lxml and BeautifulSoup anyway. The only thing you've got to choose is what's the API that you like the most.