Feedparser - 从 Google Reader 检索旧消息

发布于 2024-08-10 21:26:53 字数 973 浏览 5 评论 0原文

我正在使用 python 中的 feedparser 库从当地报纸检索新闻(我的目的是对此语料库进行自然语言处理),并且希望能够从 RSS feed 检索许多过去的条目。

我不太熟悉 RSS 的技术问题,但我认为这应该是可能的(我可以看到,例如,Google Reader 和 Feedly 可以在我移动滚动条时“按需”执行此操作)。

当我执行以下操作时:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

我只收到十几个条目左右。我正在考虑数百个。如果可能的话,也许是上个月的所有条目。仅使用 feedparser 可以做到这一点吗?

我打算从 rss feed 中仅获取新闻项的链接,并使用 BeautifulSoup 解析整个页面以获得我想要的文本。另一种解决方案是使用爬虫程序,它跟踪页面中的所有本地链接以获取大量新闻项目,但我现在想避免这种情况。

--

出现的一种解决方案是使用 Google Reader RSS 缓存:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/ emcimadahora/rss091.xml?n=1000

但要访问此内容,我必须登录 Google 阅读器。有人知道我如何从 python 中做到这一点吗? (我真的对网络一无所知,我通常只搞数值演算)。

I'm using the feedparser library in python to retrieve news from a local newspaper (my intent is to do Natural Language Processing over this corpus) and would like to be able to retrieve many past entries from the RSS feed.

I'm not very acquainted with the technical issues of RSS, but I think this should be possible (I can see that, e.g., Google Reader and Feedly can do this ''on demand'' as I move the scrollbar).

When I do the following:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

I get only a dozen entries or so. I was thinking about hundreds. Maybe all entries in the last month, if possible. Is it possible to do this only with feedparser?

I intend to get from the rss feed only the link to the news item and parse the full page with BeautifulSoup to obtain the text I want. An alternate solution would be a crawler that follows all local links in the page to get a lot of news items, but I want to avoid that for now.

--

One solution that appeared is to use the Google Reader RSS cache:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml?n=1000

But to access this I must be logged in to Google Reader. Anyone knows how I do that from python? (I really don't know a thing about web, I usually only mess with numerical calculus).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

枕头说它不想醒 2024-08-17 21:26:53

您只会收到十几个条目左右,因为这就是提要所包含的内容。如果您想要历史数据,您将必须找到所述数据的提要/数据库。

查看这篇 ReadWriteWeb 文章,了解有关在网络上查找开放数据的一些资源。

请注意,正如您的标题所示,Feedparser 与此无关。 Feedparser 解析您提供的内容。除非您找到历史数据并将其传递给它,否则它无法找到历史数据。它只是一个解析器。希望事情能澄清! :)

You're only getting a dozen entries or so because that's what the feed contains. If you want historic data you will have to find a feed/database of said data.

Check out this ReadWriteWeb article for some resources on finding open data on the web.

Note that Feedparser has nothing to do with this as your title suggests. Feedparser parses what you give it. It can't find historic data unless you find it and pass it into it. It is simply a parser. Hope that clears things up! :)

就是爱搞怪 2024-08-17 21:26:53

为了扩展 Bartek 的答案:您还可以开始存储您已经看到的提要中的所有条目,并建立您自己的提要内容的历史存档。这会延迟您开始将其用作语料库的能力(因为您必须这样做一个月才能建立一个月的条目集合),但您不会依赖于其他任何人来获取数据。

我可能是错的,但我很确定这就是 Google Reader 能够回到过去的方式:它们将每个提要的过去条目存储在某个地方。

To expand on Bartek's answer: You could also start storing all of the entries in the feed that you've already seen, and build up your own historical archive of the feed's content. This would delay your ability to start using it as a corpus (because you'd have to do this for a month to build up a collection of a month's worth of entries), but you wouldn't be dependent on anyone else for the data.

I may be mistaken, but I'm pretty sure that's how Google Reader can go back in time: They have each feed's past entries stored somewhere.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文