带有Pymongo的Mongo DB的洗牌数据
我有一个带有100万个条目/行的Mongo DB数据库,约为20 GB数据。我想在批处理(使用Python和Pymongo)中随机迭代数据,例如10批次为100K。如果我有少量的数据,可以将其适合在内存中,我只需加载所有数据,然后随机洗牌,然后将其分成10批。但是在这种情况下,我无法将其全部适合记忆。因此,此选项是不可能的。我该如何完成此任务而无需将其安装到内存中?
我的一个想法是在称为“ Count”的Mongo DB中添加一个计数器列,该列将每个条目标记为1,2,3,…,100k。然后,我使用Python Algo将这些数字随机化。然后,我可以使用简单的过滤器提取批处理。这似乎是合理的吗?由于所有过滤器,似乎很慢。它似乎没有有效地扩展。
这似乎是一个非常标准的问题。有人有比我更好的解决方案吗?
I have a Mongo DB database with 1 million entries/rows, which is approx 20 GB of data. I’d like iterate through the data randomly in batches (using python and pymongo), with, say 10 batches of 100K. If I had a small amount of data which I could fit in memory, I would simply load all the data, then shuffle it randomly, then split it into 10 batches. But in this case, I cannot fit it all into memory. So this option is not possible. How could I accomplish this task without being able to fit it into memory?
One idea I had was to add a counter column to my Mongo DB called “count”, which labels each entry as 1,2,3,…, 100K. Then I use a python algo to randomize those numbers. Then I can extract the batches using a simple filter. Does this seem reasonable? It seems pretty slow to be because of all the filters. It seems to not scale efficiently.
This seems like a pretty standard issue. Does someone have a better solution than me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论