如何在Python中使用html5lib获取body元素的内容?
如何在Python中使用html5lib
获取元素的内容?
输入数据示例:xxxyyy
预期输出: xxxyyy
即使 HTML 被破坏(未封闭的标签,...),它也应该可以工作。
How can I get the content of <body>
element by using html5lib
in Python?
Example input data: <html><head></head><body>xxx<b>yyy</b></hr></body></html>
Expected output: xxx<b>yyy</b></hr>
It should work even if HTML is broken (unclosed tags,...).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
html5lib
允许您使用各种标准树格式来解析文档。您可以使用 lxml 来完成此操作,就像我在下面所做的那样,或者您可以按照用户文档中的说明使用 minidom, ElementTree 或 BeautifulSoup 。对评论的回应
无需使用自己的simpletree.py,但根据文件开头的注释判断
我猜这不是推荐的方式...
但是,如果您仍然想这样做,您可以像这样解析 html 文档:
然后通过对 html 进行广度优先搜索来找到您要查找的元素文档中的子节点。节点保存在名为
childNodes
的数组中,每个节点都有一个存储在字段name
中的名称。html5lib
allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.Response to comment
It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file
I would guess this is not the recommended way...
If you still want to do this, however, you can parse the html document like so:
and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named
childNodes
and each node has a name stored in the fieldname
.