如何在 Python 中使用 Twitter API 更快速地收集推文?
在一个研究项目中,我正在使用 Python-Twitter 收集推文。然而,当我们的程序在一台计算机上不间断运行一周时,我们每周只能收集大约 20 MB 的数据。我只在一台机器上运行这个程序,这样我们就不会两次收集相同的推文。
我们的程序运行一个循环,每 60 秒调用一次 getPublicTimeline()。我尝试通过对公共时间线中出现的一些用户调用 getUserTimeline() 来改进这一点。然而,这一直让我每次被禁止收集推文大约半个小时。即使没有禁令,添加这段代码似乎也没有什么加速作用。
我知道 Twitter 的“白名单”允许用户每小时提交更多请求。我大约三周前申请了此请求,此后就没有收到回复,因此我正在寻找替代方案,使我们的程序能够更有效地收集推文,而不会超过标准速率限制。有谁知道从 Twitter 收集公共推文的更快方法吗?我们希望每周获得大约 100 MB 的数据。
谢谢。
For a research project, I am collecting tweets using Python-Twitter. However, when running our program nonstop on a single computer for a week we manage to collect about only 20 MB of data per week. I am only running this program on one machine so that we do not collect the same tweets twice.
Our program runs a loop that calls getPublicTimeline() every 60 seconds. I tried to improve this by calling getUserTimeline() on some of the users that appeared in the public timeline. However, this consistently got me banned from collecting tweets at all for about half an hour each time. Even without the ban, it seemed that there was very little speed-up by adding this code.
I know about Twitter's "whitelisting" that allows a user to submit more requests per hour. I applied for this about three weeks ago, and have not hear back since, so I am looking for alternatives that will allow our program to collect tweets more efficiently without going over the standard rate limit. Does anyone know of a faster way to collect public tweets from Twitter? We'd like to get about 100 MB per week.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用流媒体 API 怎么样?这正是它创建的目的。使用流 API,您在收集兆字节推文时不会遇到任何问题。不过,如果没有 Twitter 授予访问权限,您仍然无法访问所有推文,甚至无法访问具有统计意义的样本。
How about using the streaming API? This is exactly the use-case it was created to address. With the streaming API you will not have any problems gathering megabytes of tweets. You still won't be able to access all tweets or even a statistically significant sample without being granted access by Twitter though.
我做了一个类似的项目,分析推文中的数据。如果您只是从纯粹的数据收集/分析角度来处理此问题,则可以抓取出于各种原因收集这些推文的任何更好的网站。许多网站允许您通过主题标签进行搜索,因此只要输入足够流行的主题标签,您就会得到数千个结果。我只是抓取了其中一些网站的流行主题标签,将它们收集到一个大列表中,针对该网站查询该列表,并从结果中抓取了所有可用的信息。有些网站还允许您直接导出数据,使此任务变得更加容易。您会得到很多可能需要过滤的垃圾结果(垃圾邮件、外语等),但这是对我们的项目有效的最快方法。 Twitter 可能不会授予你白名单状态,所以我绝对不会指望这一点。
I did a similar project analyzing data from tweets. If you're just going at this from a pure data collection/analysis angle, you can just scrape any of the better sites that collect these tweets for various reasons. Many sites allow you to search by hashtag, so throw in a popular enough hashtag and you've got thousands of results. I just scraped a few of these sites for popular hashtags, collected these into a large list, queried that list against the site, and scraped all of the usable information from the results. Some sites also allow you to export the data directly, making this task even easier. You'll get a lot of garbage results that you'll probably need to filter (spam, foreign language, etc), but this was the quickest way that worked for our project. Twitter will probably not grant you whitelisted status, so I definitely wouldn't count on that.
有相当不错的 来自 ars technica 的关于在 Python 中使用流式 API 的教程可能会有所帮助。
否则,您可以尝试通过 执行此操作代码>cURL。
。
There is pretty good tutorial from ars technica on using streaming API n Python that might be helpful here.
Otherwise you could try doing it via
cURL
..