Feedparser - 从 Google Reader 检索旧消息

发布于 2024-08-10 21:26:53 字数 973 浏览 6 评论 0原文

我正在使用 python 中的 feedparser 库从当地报纸检索新闻（我的目的是对此语料库进行自然语言处理），并且希望能够从 RSS feed 检索许多过去的条目。

我不太熟悉 RSS 的技术问题，但我认为这应该是可能的（我可以看到，例如，Google Reader 和 Feedly 可以在我移动滚动条时“按需”执行此操作）。

当我执行以下操作时：

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

我只收到十几个条目左右。我正在考虑数百个。如果可能的话，也许是上个月的所有条目。仅使用 feedparser 可以做到这一点吗？

我打算从 rss feed 中仅获取新闻项的链接，并使用 BeautifulSoup 解析整个页面以获得我想要的文本。另一种解决方案是使用爬虫程序，它跟踪页面中的所有本地链接以获取大量新闻项目，但我现在想避免这种情况。

出现的一种解决方案是使用 Google Reader RSS 缓存：

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/ emcimadahora/rss091.xml?n=1000

但要访问此内容，我必须登录 Google 阅读器。有人知道我如何从 python 中做到这一点吗？（我真的对网络一无所知，我通常只搞数值演算）。

原文

I'm using the feedparser library in python to retrieve news from a local newspaper (my intent is to do Natural Language Processing over this corpus) and would like to be able to retrieve many past entries from the RSS feed.

I'm not very acquainted with the technical issues of RSS, but I think this should be possible (I can see that, e.g., Google Reader and Feedly can do this ''on demand'' as I move the scrollbar).

When I do the following:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

I get only a dozen entries or so. I was thinking about hundreds. Maybe all entries in the last month, if possible. Is it possible to do this only with feedparser?

I intend to get from the rss feed only the link to the news item and parse the full page with BeautifulSoup to obtain the text I want. An alternate solution would be a crawler that follows all local links in the page to get a lot of news items, but I want to avoid that for now.

One solution that appeared is to use the Google Reader RSS cache:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml?n=1000

But to access this I must be logged in to Google Reader. Anyone knows how I do that from python? (I really don't know a thing about web, I usually only mess with numerical calculus).

分享到QQ

分享到微博