XPath text() 未获取链接节点的文本
from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
好吧,这对我来说有点神秘,也许我只是忽略了最简单的事情,但是 XPath '//item/link/text()'
只产生一个空列表,而 < code>'//item/title/text()' 的工作方式与预期完全相同。 节点是否有任何特殊用途?我可以使用
'//item/link'
选择所有这些,但我只是无法让 text()
选择器来处理它们。
from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()'
does only produce an empty list while '//item/title/text()'
works exactly like expected. Does the <link>
node hold any special purpose? I can select all of them with '//item/link'
I just can't get the text()
selector to work on them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在使用
etree.HTMLParser
来解析 XML 文档。我怀疑这是处理 XML 命名空间的尝试,但我认为这可能是错误的解决方案。将 XML 文档视为 HTML 可能最终是问题的根源。如果我们改用 XML 解析器,一切都会按预期工作。
首先,如果我们查看根元素,我们会看到它设置了默认命名空间:
这意味着当我们在文档中看到
item
元素时,它实际上是一个“item
在http://purl.org/rss/1.0/
命名空间”元素中。我们需要通过传入namespaces
字典来在 xpath 查询中提供该名称空间信息,并在元素名称上使用名称空间前缀,如下所示:您的第一个 xpath 表达式(查看
/item/ title/text()
) 变为:您的第二个 xpath 表达式(查看
/item/link/text()
)变为:这使得代码看起来像:
请注意,通过使用 < code>f.content (这是一个字节string)而不是
f.text
(unicode 字符串),我们避免了整个 unicode 解析错误。You're using
etree.HTMLParser
to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.If we use the XML parser instead, everything pretty much works as expected.
First, if we look at the root element, we see that it sets a default namespace:
That means when we see an
item
element in the document, it's actually an "item
in thehttp://purl.org/rss/1.0/
namespace" element. We need to provide that namespace information in our xpath queries by passing in anamespaces
dictionary and use a namespace prefix on the element names, like this:Your first xpath expression (looking at
/item/title/text()
) becomes:And your second xpath expression (looking at
/item/link/text()
) becomes:This makes the code look like:
Note that by using
f.content
(which is a byte string) instead off.text
(a unicode string), we avoid the whole unicode parsing error.