mongodb中服务器端设置交集

发布于 2024-10-14 21:52:28 字数 960 浏览 5 评论 0原文

在我正在开发的一个应用程序中，要求进行大规模集合交集，大约 10-1,000,000 个项目左右。我们相交的项只是 ObjectId 的。

例如，有一个盒子文档，盒子文档内有一个 item_ids 数组。每个盒子的 item_ids 数组包含 10-1,000,000 个 ObjectId。

这里的最终目标是说，给定框 A 的 ObjectId 为 4d3dc3898951498107000005，框 B 的 ObjectId 为 4d3dc3898951498107000002，它们有哪些共同的 item_id？

我是这样做的：

db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})

首先只是好奇这是否是一个明智的方法。到目前为止，在我的研究中，地图缩减似乎是大型交叉路口的常见建议，但不建议用于实时查询。

其次，好奇这在分片环境中会如何表现？ mongos 会在 mongod 上运行它需要的一大块查询并神奇地聚合我的结果吗？

最后，如果上面的内容是合理的，那么这样做是否也合理：

db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }})

基本上就是找到框 A 和框 B 共有哪些项目，然后将它们具体化为一个服务器端查询中的对象。这似乎也可以与 .limit 和 .skip 一起使用，以有效地实现数据集的分页。

无论如何，任何反馈都很有价值，谢谢！

原文

In an application I am working on, a requirement is to do massive set intersection, to the tune of 10-1,000,000 items or so. The items that we are intersecting are simply ObjectId's.

So for instance there is a boxes document and inside the boxes document there is an item_ids Array. This item_ids array for each box holds 10-1,000,000 ObjectId's.

The end goal here is to say, given box A with ObjectId 4d3dc3898951498107000005, and box B with ObjectId 4d3dc3898951498107000002, which item_ids do they have in common?

Here is how im doing it:

db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})

Firstly just curious if this seems like a sane approach. In my research so far it seems like map reduce is a common suggestion for large intersections, but that it is not recommended for realtime queries.

Secondly, curious how this would behave in a sharded environment? Will mongos run a chunk of the query on the mongod's it needs to and aggregate my result magically?

Lastly, if the above is sane, is it also sane to do:

db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }})

Which would basically be finding which items both box A and box B have in common, and then materializing them into objects all in one server side query. This appears to also work with .limit and .skip to effectively implement a paging of the data set.

Anyhow, any feedback is valuable, thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

多彩岁月 2024-10-21 21:52:28

我认为您可能需要重新考虑您的架构。如果数组中有 1,000,000 个 ObjectID，每个 12 字节，即 12MB，甚至不计算 BSON 开销，这对于大型数组* 可能很重要（可能还有 8MB 左右）。在 1.8 中，我们将最大文档大小从 4MB 提高到 16MB，但即使这样也不足以满足您要存储的对象。

*由于历史原因，我们存储数组中每个元素的 Stingified 索引，当您有 <100 个元素时，这很好，但当您需要 6 或 7 位数字时，就会累加。

回复收藏 0 原文

~没有更多了~