我是 Python 和编程的新手,所以如果问题非常愚蠢,请原谅我。
我一直在关注此 关于 RSS 逐步抓取的教程,但在尝试收集标题的相应链接时,我收到来自 Python 的“列表索引超出范围”错误正在收集的文章的数量。
这是我的代码:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 16)
for i in literate:
print find_title[i]
print find_link[i]
当我只告诉它检索标题时,它执行得很好,但当我想检索标题及其相应的链接时,它会立即抛出索引错误。
任何帮助将不胜感激。
I'm a newbie to Python and programming in general so please excuse me if the question is very dumb.
I've been following this tutorial on RSS scraping step by step but I am getting a "list index out of range" error from Python when trying to gather the corresponding links to the titles of the articles being gathered.
Here is my code:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
source = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()
title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')
find_title = re.findall(title, source)
find_link = re.findall(link, source)
literate = []
literate[:] = range(1, 16)
for i in literate:
print find_title[i]
print find_link[i]
It executes fine when I only tell it to retrieve titles, but immediately throws an index error when I would like to retrieve titles and their corresponding links.
Any assistance will be greatly appreciated.
发布评论
评论(2)
您可以使用
feedparser< /code> 模块从给定的 url 解析 RSS feed
:
输出
使用 解析 rss(xml) 的正则表达式。
You could use
feedparser
module to parse an RSS feed from a given url:Output
It might not be a good idea to use regular expressions to parse rss(xml).
我认为您使用了错误的正则表达式从页面中提取链接。
查看页面的
html 源代码
,您会发现链接未包含在模式。
实际上,模式是
。
这就是你的正则表达式不起作用的原因。
I think you are using a wrong regex for extracting link from your page.
Take a look at
html source
of your page you will find that the links are not enclosed in<link></link>
pattern.Actually the pattern is
<link rel="alternate" type="text/html" href= links here
.That's the reason why your regex is not working.