MongoDB 中海量关系的最佳数据模型
我们正在采用 MongoDB 作为新的解决方案,目前正在尝试设计最有效的数据模型来满足我们的需求,即数据项之间的关系。
我们必须在用户、项目和列表之间保持三向关系。用户可以拥有许多项目和许多列表。列表将包含一个用户和许多项目。一个项目可以属于多个用户和多个列表。后者尤其重要 - 一个项目可能属于潜在的大量列表:数千个,当然也可能是数万或数十万。未来甚至可能有数百万。我们需要能够在两个方向上导航这些关系:例如,获取列表上的所有项目或项目所属的所有列表。我们还需要通用的解决方案,以便我们可以在需要时添加更多类型的文档以及它们之间的关系。
因此,似乎有两种可能的解决方案。第一个是数据库中的每个文档都有一个由 ID 数组组成的“关系”集合。因此,列表文档将具有一个包含所有项目 ID 的项目关系集合,以及一个包含用户单个 ID 的关系集合。在此模型中,当某个项目属于许多用户或许多列表时,这些数组将变得庞大。
第二种模型需要一种新类型的文档,即存储每个合作伙伴的 ID 和关系名称的“关系”文档。这总体上会存储更多数据,因此会影响磁盘空间。它看起来也像是在 NoSQL 中解决这个问题的一种“不自然”的方式。
性能方面、空间方面、架构方面,哪个更好,为什么?
干杯, 马特
We're adopting MongoDB for a new solution and are currently trying to design the most effective data model for our needs are regards relationships between data items.
We've got to hold a three way relationship between users, items and lists. A user can have many items and many lists. A list will have one user and many items. An item can belong to many users and many lists. The latter is especially important - an item can belong to potentially huge numbers of lists: thousands, certainly and potentially tens or hundreds of thousands. Possibly even millions in the future. We need to be able to navigate these relationships in both directions: so, for example, getting all the items on a list or all the lists to which an item belongs. We also need the solution to be generic so that we can add many more types of document and relationships between them if we need to.
So it seems there are two possible solutions to this. The first is for each document in the database to have a "relationships" collection consisting of an array of IDs. So a list document would have a relationships collection for items with the IDs of all the items and a relationship collection with a single ID for the user. In this model these arrays will become massive when an item belongs to many, many users or many, many lists.
The second model requires a new type of document, a "relationship" document that stores the IDs of each partner and the relationship name. This is storing more data overall and so will impact disc space. It also looks like an "unnatural" way to approach this problem in NoSQL.
Performance-wise, space-wise, architecture-wise, which is better and why?
Cheers,
Matt
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这取决于您的访问模式。
嵌入的 id 数组更适合读取。通过快速阅读,您可以获得所有相关对象的 ID,现在可以去获取它们。但如果你的更新率很高,你就会遇到一些麻烦,因为 mongodb 必须一遍又一遍地复制相同的(已经很大的)对象,因为它超出了其磁盘边界。
但是这个解决方案对于写入来说确实很糟糕。想象一下属于几百万个列表的一个项目。你决定删除它。现在您必须遍历所有这些列表并从其引用数组中提取该项目的 id。很令人兴奋,不是吗?
将引用存储为单独的文档有利于写入。添加、编辑和删除新引用的速度非常快。但这个解决方案需要更多的磁盘空间,更重要的是,需要宝贵的 RAM。而且读取速度也不那么快,尤其是当您有很多参考文献时。
考虑到您的数字(“未来可能甚至数百万”),我会选择这个解决方案。您始终可以添加一些硬件来加速查询。传统上,扩展写入是最困难的部分,在此解决方案中,写入速度快且可分片。
It depends on your access patterns.
Embedded id array is better for reading. With one quick read you get ids of all related objects and can now go and fetch them. But if your update rate is high, you'll have some troubles, as mongodb will have to copy the same (already big) object over and over as it outgrows its disk boundaries.
But this solution is really bad for writes. Imagine an item that belongs to a couple of million lists. You decide to delete it. Now you have to walk all those lists and pull this item's id from their reference array. it's exciting, isn't it?
Storing references as separate documents is good for writes. Adding, editing and removing of new references is pretty fast. But this solution takes more disk space and, more importantly, precious RAM. Also reads are not as fast, especially if you have many references.
Given your numbers ("probably even millions in the future") I'd go with this solution. You can always throw in some hardware to accelerate queries. Scaling writes is traditionally the hardest part and in this solution writes are fast and shardable.
我同意 Sergio 的观点,认为数据访问模式是这里的关键。
我还将添加额外的可能解决方案,即存储具有三个属性的第四种文档类型 - 对每个用户、列表和项目的引用。该集合可以建立索引以便对所有 3 个字段进行快速访问,对所有字段建立唯一索引以防止重复,并允许快速插入和删除。
最终,您不会以这种方式存储更多数据,因为如果您需要从双方查找关系(“该用户在哪些列表中拥有哪些项目?”和“哪些用户在其列表中拥有该项目?”)无论如何都需要重复引用。
这感觉是相关的,但有时这是最好的解决方案。
I'd agree with Sergio regarding data access patterns being key here.
I'd also add the additional possible solution of storing a fourth document type with three properties- a reference to each of user, list, and item. That collection can be indexed for fast access on all 3 fields, unique indexed on all fields to prevent duplicates, and allows for fast inserts and deletes.
Ultimately you are not storing much more data this way, because if you need to look up the relationship from both sides ("What items in what lists does this user have?" and "What users have this item in their lists?") you need to duplicate references anyway.
It feels relational, but sometimes that is the best solution.