作为 RSS 中的一个练习,我希望能够搜索该组中几乎所有 Unix 讨论。
comp.unix.shell
我了解足够的 Python 和了解基本的 RSS,但我陷入困境......如何获取特定日期之间的所有消息,或者至少是第 N 个最近和第 M 个最近之间的所有消息?
高级描述,欢迎伪代码。
谢谢你!
编辑:
我希望能够返回超过 100 条消息,但不要像一次解析 10 条消息那样抓取,例如使用以下 URL:
http://groups.google.com /group/comp.unix.shell/topics?hl=en&start=2000&sa=N
一定有更好的方法。
as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group.
comp.unix.shell
I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?
High level descriptions, pseudo-code is welcome.
Thank you!
EDIT:
I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:
http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N
There must be a better way.
发布评论
评论(4)
抓取 Google 网上论坛违反了 Google 服务条款,特别是以下短语:
您确定要公开宣布您这样做吗?您是否对结果的后果视而不见?
Crawling google groups violates the Google's Terms of Service, specifically the phrase:
Are you sure you want to announce that you're doing that so openly? And are you blind to the consequences of your result?
对于 N 最近,似乎您可以传递参数
?num=50
或 feed url 中的某些内容例如,来自 comp.unix.shell 组的 50 条新消息
http://groups.google.com/group/comp.unix.shell/feed/ atom_v1_0_msgs.xml?num=50
然后选择一个 feedparser 程序,例如 Universal Feed Parser
有
feedparser 中的 >.update_parsed
选项,您可以使用它来检查特定日期范围内的消息For N recent, seems like you could pass parameter
?num=50
or something in the feed urlFor example, 50 new messages from comp.unix.shell group
http://groups.google.com/group/comp.unix.shell/feed/atom_v1_0_msgs.xml?num=50
and then pick up a feedparser program like Universal Feed Parser
There is
.update_parsed
option in feedparser, you could use that to check the msg within particular date range正如 Randal 提到的,这违反了 Google 的服务条款——但是,作为假设或在没有这些限制的其他网站上使用,您可以很容易地使用 urllib 和 BeautifulSoup。使用 urllib 打开页面,然后使用 BeautifulSoup 抓取所有线程主题(如果您想爬得更深,还可以使用链接)。然后,您可以通过编程方式找到下一页结果的链接,然后创建另一个 urllib 来转到第 2 页 - 然后重复该过程。
此时您应该拥有所有原始数据,然后只需操作数据并实现搜索功能即可。
As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.
At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.
你考虑过雅虎YQL吗?它还不错,可以访问很多 API。 http://developer.yahoo.com/yql/
我不知道是否支持群组但你可以访问 RSS 提要。可能会有帮助。
Have you thought about yahoos YQL? It's not too bad and can access a lot of APIs. http://developer.yahoo.com/yql/
I don't know if groups is suported but u can access rss feeds. Could be helpful.