lxml - 解析 stackexchange rss feed 时遇到困难

发布于 2025-01-08 03:11:18 字数 538 浏览 2 评论 0原文

嗨，

我在用 python 解析 stackexchange 的 rss feed 时遇到问题。当我尝试获取摘要节点时，返回一个空列表

我一直在尝试解决这个问题，但无法理解。

有人可以帮忙吗？谢谢一个

<代码> 在[3o]中：导入lxml.etree，urllib2

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []

原文

Hia

I am having problems parsing an rss feed from stackexchange in python.
When I try to get the summary nodes, an empty list is return

I have been trying to solve this, but can't get my head around.

Can anyone help out?
thanks
a

In [3o]: import lxml.etree, urllib2

In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds' 

In [32]: cooking_content = urllib2.urlopen(url_cooking)

In [33]: cooking_parsed = lxml.etree.parse(cooking_content)

In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

In [35]: cooking_texts
Out[35]: []

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贱贱哒 2025-01-15 03:11:18

看一下这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

正如您所发现的，第二个版本不返回任何节点，但 lxml.html 版本工作正常。 etree 版本无法工作，因为它需要命名空间，而 html 版本则可以工作，因为它忽略命名空间。 http://lxml.de/lxmlhtml.html 的一部分，它说“HTML 解析器明显忽略了命名空间和其他一些 XMLism ”。

请注意，当您打印 etree 版本的根节点 (print(data.getroot())) 时，您会得到类似。这意味着它是一个命名空间为 http://www.w3.org/2005/Atom 的 feed 元素。这是 etree 代码的更正版本。

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

Take a look at these two versions

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

As you discovered, the second version returns no nodes, but the lxml.html version works fine. The etree version is not working because it's expecting namespaces and the html version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."

Note when you print the root node of the etree version (print(data.getroot())), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom. Here is a corrected version of the etree code.

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

回复收藏 0 原文

人疚 2025-01-15 03:11:18

问题是命名空间。

运行此命令：

 cooking_parsed.getroot().tag

您将看到该元素的命名空间为

{http://www.w3.org/2005/Atom}feed

如果您导航到提要条目之一，

“类似”。这意味着 lxml 中正确的 xpath 是：

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })

The problem is namespaces.

Run this :

 cooking_parsed.getroot().tag

And you'll see that the element is namespaced as

{http://www.w3.org/2005/Atom}feed

Similarly if you navigate to one of the feed entries.

This means the right xpath in lxml is:

print cooking_parsed.xpath(
  "//a:feed/a:entry",
  namespaces={ 'a':'http://www.w3.org/2005/Atom' })

回复收藏 0 原文

﹉夏雨初晴づ 2025-01-15 03:11:18

尝试使用 beautifulsoup 导入中的 BeautifulStoneSoup。
它可能会起作用。

回复收藏 0 原文

~没有更多了~

关于作者

灵芸

每个人心里都住着一个人，或眷念，或暗恋，或想念。

文章

23718 人气

关注发私信

友情链接

文江博客

lxml - 解析 stackexchange rss feed 时遇到困难

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

lxml - 解析 stackexchange rss feed 时遇到困难

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。