使用 Python 抓取 RSS 提要

发布于 2025-01-04 16:27:58 字数 831 浏览 2 评论 0 原文

我是 Python 和编程的新手,所以如果问题非常愚蠢,请原谅我。

我一直在关注 关于 RSS 逐步抓取的教程,但在尝试收集标题的相应链接时,我收到来自 Python 的“列表索引超出范围”错误正在收集的文章的数量。

这是我的代码:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')

find_title = re.findall(title, source)
find_link = re.findall(link, source)

literate = []
literate[:] = range(1, 16)

for i in literate:
    print find_title[i]
    print find_link[i]

当我只告诉它检索标题时,它执行得很好,但当我想检索标题及其相应的链接时,它会立即抛出索引错误。

任何帮助将不胜感激。

I'm a newbie to Python and programming in general so please excuse me if the question is very dumb.

I've been following this tutorial on RSS scraping step by step but I am getting a "list index out of range" error from Python when trying to gather the corresponding links to the titles of the articles being gathered.

Here is my code:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')

find_title = re.findall(title, source)
find_link = re.findall(link, source)

literate = []
literate[:] = range(1, 16)

for i in literate:
    print find_title[i]
    print find_link[i]

It executes fine when I only tell it to retrieve titles, but immediately throws an index error when I would like to retrieve titles and their corresponding links.

Any assistance will be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

拍不死你 2025-01-11 16:27:58

您可以使用 feedparser< /code> 模块从给定的 url 解析 RSS feed:

#!/usr/bin/env python
import feedparser # pip install feedparser

d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews')
# .. skipped handling http errors, cacheing ..

for e in d.entries:
    print(e.title)
    print(e.link)
    print(e.description)
    print("\n") # 2 newlines

输出

Even Critics Of Safety Net Increasingly Depend On It
http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html
<p>Ki Gulbranson owns a logo apparel shop, deals in 
<!-- ... snip ... -->

Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And
Charged With Aggravated Assault And Robbery
http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html
<p>ATLANTA -- Atlanta police have arrested a suspect 
<!-- ... snip ... -->

使用 解析 rss(xml) 的正则表达式

You could use feedparser module to parse an RSS feed from a given url:

#!/usr/bin/env python
import feedparser # pip install feedparser

d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews')
# .. skipped handling http errors, cacheing ..

for e in d.entries:
    print(e.title)
    print(e.link)
    print(e.description)
    print("\n") # 2 newlines

Output

Even Critics Of Safety Net Increasingly Depend On It
http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html
<p>Ki Gulbranson owns a logo apparel shop, deals in 
<!-- ... snip ... -->

Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And
Charged With Aggravated Assault And Robbery
http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html
<p>ATLANTA -- Atlanta police have arrested a suspect 
<!-- ... snip ... -->

It might not be a good idea to use regular expressions to parse rss(xml).

青萝楚歌 2025-01-11 16:27:58

我认为您使用了错误的正则表达式从页面中提取链接。

>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>

查看页面的 html 源代码,您会发现链接未包含在
模式。

实际上,模式是

这就是你的正则表达式不起作用的原因。

I think you are using a wrong regex for extracting link from your page.

>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>

Take a look at html source of your page you will find that the links are not enclosed in
<link></link> pattern.

Actually the pattern is <link rel="alternate" type="text/html" href= links here.

That's the reason why your regex is not working.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文