相当于使用lxml.html解析HTML时的InnerHTML
我正在编写一个使用 lxml.html 解析网页的脚本。我曾经做过相当多的 BeautifulSoup,但由于 lxml 的速度,我现在正在尝试它。
我想知道库中最明智的方法是什么,相当于 Javascript 的 InnerHtml - 即检索或设置标签的完整内容。
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
因此,InnerHtml 是:
<h1>A title</h1>
<p>Some text</p>
我可以使用 hacks(转换为字符串/正则表达式等)来完成此操作,但我假设有一种正确的方法可以使用由于不熟悉而缺少的库来完成此操作。感谢您的任何帮助。
编辑:感谢 pobk 如此快速有效地向我展示了这方面的方法。对于任何尝试相同的人,这就是我最终得到的结果:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
请注意,lxml.html 解析器将修复未封闭的标记,因此请注意这是否是一个问题。
I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
很抱歉再次提出这个问题,但我一直在寻找解决方案,而您的解决方案包含一个错误:
根元素正下方的文本被忽略。我最终这样做了:
Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
Text directly under the root element is ignored. I ended up doing this:
您可以使用根节点的 getchildren() 或 iterdescendants() 方法获取 ElementTree 节点的子节点:
这可以简写如下:
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
This can be shorthanded as follows:
您还可以使用
.get('href')
作为标记,使用.attrib
作为属性,这里标记 no 是硬编码的,但您也可以动态执行此操作
you can also use
.get('href')
for a tag and.attrib
for attribute ,here tag no is hardcoded but you can also do this dynamic
这是一个 Python 3 版本:
请注意,这包括按照 andreymal -- 这是为了避免标签注入,如果您正在使用经过净化的 HTML!
Here is a Python 3 version:
Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!
我发现没有一个令人满意的答案,有些甚至是在 Python 2 中的。因此,我添加了一个单行解决方案,可以生成类似内部 HTML 的输出并与 Python 3 一起使用:
结果将是:
它的作用: xpath 传递所有节点子节点(文本、元素、评论)。列表推导生成文本节点的文本内容和元素节点的 HTML 内容的列表。然后将它们连接成一个字符串。如果您想删除注释,请对 xpath 使用
*|text()
而不是node()
。I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
The result will be:
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use
*|text()
instead ofnode()
for xpath.