如何使用 xpath & 获取节点的完整内容lxml?

发布于 2024-10-01 07:42:38 字数 1156 浏览 0 评论 0 原文

我正在使用 lxml 的 xpath 函数来检索网页的部分内容。我正在尝试获取 标记的内容,其中包括其自己的 html 标记。如果我使用,

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]

我会得到适量的节点,但它们会作为 lxml 对象返回()。

如果我使用,

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()

我会得到我想要的,除了我没有得到 节点中包含的任何 HTML 代码。

如果我使用

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()

if 得到文本和 lxml 元素的混合! (例如 something Something Something

是否可以使用纯 XPath 查询来获取 节点的内容,甚至强制 lxml 从 .xpath() 方法返回内容字符串,而不是 lxml 对象?

请注意,我从 XPath 查询返回许多节点的列表,因此解决方案需要支持它。

只是为了澄清...我想返回一些东西里面诸如此类的东西

<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>

I am using lxml's xpath function to retrieve parts of a webpage. I am trying to get contents of a <font> tag, which includes html tags of its own. If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]

I get the right amount of nodes, but they are returned as lxml objects (<Element font at 0x101fe5eb0>).

If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()

I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font> nodes.

If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()

if get a mixture of text and lxml elements! (e.g. something something <Element a at 0x102ac2140> something)

Is there anyway to use a pure XPath query to get the contents of the <font> nodes, or even to force lxml to return a string of the contents from the .xpath() method, rather than an lxml object?

Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that.

just to clarify... i want to return something something <a href="url">inside</a> something from something like...

<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

水染的天色ゝ 2024-10-08 07:42:38

我不确定我是否理解——这是否接近您正在寻找的内容?

import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))

xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']

I'm not sure I understand -- is this close to what you are looking for?

import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))

xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']
棒棒糖 2024-10-08 07:42:38

有没有办法使用纯 XPath
查询以获取内容
节点,甚至强制 lxml
返回内容的字符串
来自 .xpath() 方法,而不是
比 lxml 对象?

请注意,我正在返回一个包含许多内容的列表
来自 XPath 查询的节点,因此
解决方案需要支持这一点。

只是为了澄清...我想回来
某事某事里面某事来自
类似...

href="url">在某物内

简短回答:不。

XPath 不适用于“标签”,但适用于节点

节点以托管 XPath 的语言表示为特定对象的实例。

如果您需要特定节点标记的字符串表示形式,此类对象通常支持 outerXML 属性 - 请检查托管语言(在本例中为 lxml)的文档。

正如 @Robert-Rossney 在评论中指出的:lxml 的 tostring() 方法等同于其他环境的 outerXml 属性

Is there anyway to use a pure XPath
query to get the contents of the
<font> nodes, or even to force lxml
to return a string of the contents
from the .xpath() method, rather
than an lxml object?

Note that I'm returning a list of many
nodes from the XPath query so the
solution needs to support that.

just to clarify... i want to return
something something <a
href="url">inside</a> something
from
something like...

<font face="verdana" color="#ffffff" size="2"><a

href="url">inside something

Short answer: No.

XPath doesn't work on "tags" but with nodes

The selected nodes are represented as instances of specific objects in the language that is hosting XPath.

In case you need the string representation of a particular node's markup, such objects typically support an outerXML property -- check the documentation of the hosting language (lxml in this case).

As @Robert-Rossney pointed out in his comment: lxml's tostring() method is equivalent to other environments' outerXml property.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文