Mongo Mapper中高效检索大数据集?
我正在存储大量 Twitter 数据,并且希望一次检索大约 500k 条记录进行数据处理。我有一个 TwitterTweet mongo 文档,其中包含基本推文数据,并尝试按如下方式检索它:
weekly_tweets = TwitterTweet.all(:created_at.gt => 1.week.ago, :fields => [:created_at , :text, :from_user])
问题是,这会占用大量时间和内存 - 有什么方法可以使其更具可扩展性和效率。我曾考虑过使用映射缩减,但对于我想要做的事情来说,它看起来非常复杂 - 推文上的文本处理和正则表达式内容。
I am storing a large amount of Twitter data, and would like to retrieve about 500k records for data processing at a time. I have a TwitterTweet mongo document that contains basic tweet data, and try to retrieve it as follows:
weekly_tweets = TwitterTweet.all(:created_at.gt => 1.week.ago, :fields => [:created_at, :text, :from_user])
Trouble is, this take up a LOT of time and memory - is there any way to make this more scalable and efficient. I have thought of using map reduce, but it looks very complicated for what I want to do - text processing and regexp stuff on the tweets.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不要调用 all,因为这会产生在 mongo 中创建所有 500k 条目的对象的效果,并且正如您注意到的那样,会使用大量内存和时间。使用 find_each 代替并迭代。 Find 返回一个游标,效率更高。
Do not call all as this has the effect of making an object of all 500k of your entries in mongo and will as you noticed use a ton of memory and time. Use find_each instead and iterate through. Find returns a cursor which is way more efficient.