当前位置：文江博客话题详情

如何在Python中使用html5lib获取body元素的内容？

发布于 2024-11-10 22:14:43 字数 195 浏览 9 评论 0原文

如何在Python中使用html5lib获取元素的内容？

输入数据示例：xxxyyy

预期输出： xxxyyy

即使 HTML 被破坏（未封闭的标签，...），它也应该可以工作。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反话 2024-11-17 22:14:43

html5lib 允许您使用各种标准树格式来解析文档。您可以使用 lxml 来完成此操作，就像我在下面所做的那样，或者您可以按照用户文档中的说明使用 minidom, ElementTree 或 BeautifulSoup 。

file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):

对评论的回应

无需使用自己的simpletree.py，但根据文件开头的注释判断
我猜这不是推荐的方式...

# Really crappy basic implementation of a DOM-core like thing

但是，如果您仍然想这样做，您可以像这样解析 html 文档：

f = open("mydocument.html")
doc = html5lib.parse(f)

然后通过对 html 进行广度优先搜索来找到您要查找的元素文档中的子节点。节点保存在名为 childNodes 的数组中，每个节点都有一个存储在字段 name 中的名称。

html5lib allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.

file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):

Response to comment

It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file
I would guess this is not the recommended way...

# Really crappy basic implementation of a DOM-core like thing

If you still want to do this, however, you can parse the html document like so:

f = open("mydocument.html")
doc = html5lib.parse(f)

and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named childNodes and each node has a name stored in the field name.

回复收藏 0 原文

~没有更多了~