我正在使用 lxml 的 xpath 函数来检索网页的部分内容。我正在尝试获取
标记的内容,其中包括其自己的 html 标记。如果我使用,
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]
我会得到适量的节点,但它们会作为 lxml 对象返回(
)。
如果我使用,
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()
我会得到我想要的,除了我没有得到
节点中包含的任何 HTML 代码。
如果我使用
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()
if 得到文本和 lxml 元素的混合! (例如 something Something Something
)
是否可以使用纯 XPath 查询来获取
节点的内容,甚至强制 lxml 从 .xpath()
方法返回内容字符串,而不是 lxml 对象?
请注意,我从 XPath 查询返回许多节点的列表,因此解决方案需要支持它。
只是为了澄清...我想返回一些东西里面诸如此类的东西
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
I am using lxml's xpath function to retrieve parts of a webpage. I am trying to get contents of a <font>
tag, which includes html tags of its own. If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]
I get the right amount of nodes, but they are returned as lxml objects (<Element font at 0x101fe5eb0>
).
If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()
I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font>
nodes.
If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()
if get a mixture of text and lxml elements! (e.g. something something <Element a at 0x102ac2140> something
)
Is there anyway to use a pure XPath query to get the contents of the <font>
nodes, or even to force lxml to return a string of the contents from the .xpath()
method, rather than an lxml object?
Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that.
just to clarify... i want to return something something <a href="url">inside</a> something
from something like...
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
发布评论
评论(2)
我不确定我是否理解——这是否接近您正在寻找的内容?
I'm not sure I understand -- is this close to what you are looking for?
简短回答:不。
XPath 不适用于“标签”,但适用于节点
节点以托管 XPath 的语言表示为特定对象的实例。
如果您需要特定节点标记的字符串表示形式,此类对象通常支持
outerXML
属性 - 请检查托管语言(在本例中为 lxml)的文档。正如 @Robert-Rossney 在评论中指出的:lxml 的
tostring()
方法等同于其他环境的outerXml
属性。Short answer: No.
XPath doesn't work on "tags" but with nodes
The selected nodes are represented as instances of specific objects in the language that is hosting XPath.
In case you need the string representation of a particular node's markup, such objects typically support an
outerXML
property -- check the documentation of the hosting language (lxml in this case).As @Robert-Rossney pointed out in his comment: lxml's
tostring()
method is equivalent to other environments'outerXml
property.