XPath text() 未获取链接节点的文本

发布于 2025-01-17 00:39:45 字数 818 浏览 0 评论 0原文

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles  
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!

好吧,这对我来说有点神秘,也许我只是忽略了最简单的事情,但是 XPath '//item/link/text()' 只产生一个空列表,而 < code>'//item/title/text()' 的工作方式与预期完全相同。 节点是否有任何特殊用途?我可以使用 '//item/link' 选择所有这些,但我只是无法让 text() 选择器来处理它们。

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles  
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!

Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神经暖 2025-01-24 00:39:45

您正在使用 etree.HTMLParser 来解析 XML 文档。我怀疑这是处理 XML 命名空间的尝试,但我认为这可能是错误的解决方案。将 XML 文档视为 HTML 可能最终是问题的根源。

如果我们改用 XML 解析器,一切都会按预期工作。

首先,如果我们查看根元素,我们会看到它设置了默认命名空间:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:orfon="http://rss.orf.at/1.0/"
  xmlns="http://purl.org/rss/1.0/"
>

这意味着当我们在文档中看到 item 元素时,它实际上是一个“itemhttp://purl.org/rss/1.0/ 命名空间”元素中。我们需要通过传入 namespaces 字典来在 xpath 查询中提供该名称空间信息,并在元素名称上使用名称空间前缀,如下所示:

>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]

您的第一个 xpath 表达式(查看 /item/ title/text()) 变为:

>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']

您的第二个 xpath 表达式(查看 /item/link/text())变为:

>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']

这使得代码看起来像:

from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))

请注意,通过使用 < code>f.content (这是一个字节string)而不是 f.text (unicode 字符串),我们避免了整个 unicode 解析错误。

You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.

If we use the XML parser instead, everything pretty much works as expected.

First, if we look at the root element, we see that it sets a default namespace:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:orfon="http://rss.orf.at/1.0/"
  xmlns="http://purl.org/rss/1.0/"
>

That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:

>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]

Your first xpath expression (looking at /item/title/text()) becomes:

>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']

And your second xpath expression (looking at /item/link/text()) becomes:

>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']

This makes the code look like:

from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))

Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文