奇怪的 lxml 行为

发布于 2024-08-07 18:55:41 字数 398 浏览 1 评论 0原文

考虑以下代码片段:

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text

我期望看到 '


gt;Hello text
',因为 br 不能嵌套文本并且是“自封闭的”(我的意思是 />)。如何让lxml正确处理它?

Consider the following snippet:

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text

I was expecting to see '<div><br />Hello text</div>', because br can't have nested text and is "self-closed" (I mean />). How to make lxml handle it right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天荒地未老 2024-08-14 18:55:41

HTML 没有自闭合标签。这是一个xml的东西。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

<br/>Hello text

注意,文本不在标签内。 lxml 有一个“tail”概念。

>>> print text.text
None
>>> print text.tail
Hello text

HTML doesn't have self-closing tags. It is a xml thing.

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

prints

<br/>Hello text

Note that the text is not inside the tag. lxml has a "tail" concept.

>>> print text.text
None
>>> print text.tail
Hello text
错々过的事 2024-08-14 18:55:41

当您处理有效的 XHTML 时,您可以使用 etree 而不是 html。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

有趣的是,您通常可以使用它将 HTML 转换为 XHTML:

import lxml.etree
import lxml.html

html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

输出:"
gt;Hello text"

When you are dealing with valid XHTML you can use the etree instead of html.

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Fun thing, you can typically use this to convert HTML to XHTML:

import lxml.etree
import lxml.html

html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Output: "<br/>Hello text"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文