mongodb 中的数据重复太多？

发布于 2024-09-29 14:22:53 字数 647 浏览 10 评论 0原文

我对 NOSQL 很陌生，最近对 mongoDB 很感兴趣。我正在从头开始创建一个新网站，并决定使用 MONGODB/NORM（适用于 C#）作为我唯一的数据库。我已经阅读了很多有关如何正确设计文档模型数据库的内容，并且我认为在很大程度上我的设计效果很好。我的新网站上线大约 6 个月了，我开始发现数据复制/同步问题，我需要一遍又一遍地处理这些问题。据我了解，这是文档模型中所期望的，并且对于性能来说这是有意义的。 IE 将嵌入的对象粘贴到文档中，以便快速阅读 - 无需连接；但当然你不能总是嵌入，所以 mongodb 有 DbReference 的概念，它基本上类似于关系数据库中的外键。

这是一个例子：我有用户和事件；两者都有自己的文档，用户参加活动，活动有用户参加者。我决定将数据有限的事件列表嵌入到用户对象中。我也将用户列表嵌入到事件对象中作为他们的“与会者”。现在的问题是我必须使用户与也嵌入事件对象中的用户列表保持同步。当我读到它时，这似乎是首选方法，也是 NOSQL 做事的方式。检索速度很快，但回退是当我更新主用户文档时，我还需要进入事件对象，可能找到对该用户的所有引用并进行更新。

所以我的问题是，这是人们需要处理的一个很常见的问题吗？在您开始说“也许 NOSQL 策略不适合我在这里尝试做的事情”之前，这个问题必须发生多少？什么时候不必进行连接的性能优势会变成劣势，因为您很难在嵌入对象中保持数据同步并为此对数据库进行多次读取？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狼性发作 2024-10-06 14:22:53

这就是文档存储的权衡。您可以像任何标准 RDMS 一样以标准化方式进行存储，并且您应该尽可能地争取标准化。只有在性能受到影响的情况下，您才应该打破规范化并扁平化数据结构。权衡是读取效率与更新成本。

Mongo 拥有非常高效的索引，可以像传统 RDMS 一样使规范化变得更容易（大多数文档存储不会免费为您提供此功能，这就是 Mongo 更像是混合而不是纯文档存储的原因）。使用它，您可以创建用户和事件之间的关系集合。它类似于表格数据存储中的代理表。对事件和用户字段建立索引，它应该非常快，并且将帮助您更好地标准化数据。

当涉及到更新记录数据与读出查询中所需内容所需的时间时，我喜欢绘制扁平化结构与保持结构标准化的效率。您可以使用大 O 表示法来完成此操作，但不必那么花哨。只需根据具有不同数据模型的几个用例在纸上写下一些数字，即可对需要多少工作有一个良好的直觉。

基本上我所做的就是首先尝试预测一条记录的更新次数与读取频率的概率。然后，我尝试预测更新的成本与读取的成本（当更新的成本标准化或扁平化时）（或者可能是我能想到的两者的部分组合......很多优化选项）。然后，我可以判断保持平坦所节省的成本与从标准化来源构建数据的成本。一旦我绘制了所有变量，如果保持平坦所节省的费用可以为我节省很多，那么我将保持平坦。

一些提示：

如果您需要快速且原子的快速查找（完全最新），您可能需要一个解决方案，您更喜欢扁平化而不是规范化并接受更新的打击。
如果您需要快速更新并立即访问，则支持标准化。
如果您需要快速查找，但不需要完全最新的数据，请考虑在批处理作业中构建规范化数据（可能使用映射/归约）。
如果您的查询需要快速，并且更新很少，并且不一定要求您的更新可以立即访问或需要事务级别锁定它在 100% 的时间内经历（以保证您的更新写入磁盘），那么您可以考虑将更新写入队列并在后台处理它们。（在此模型中，您稍后可能必须处理冲突解决和协调）。
介绍不同的型号。在代码中构建数据查询抽象层（在某种程度上类似于 ORM），以便您稍后可以重构数据存储结构。

您还可以采用许多其他想法。网上有很多很棒的博客，例如 highscalabilty.org，并确保您了解 CAP 定理。

还要考虑缓存层，例如 Redis 或 memcache。我将把这些产品之一放在我的数据层前面。当我查询 mongo（存储标准化的所有内容）时，我使用数据构建扁平表示并将其存储在缓存中。当我更新数据时，我将使缓存中引用我正在更新的内容的所有数据无效。（尽管您必须花时间使数据无效并跟踪缓存中正在更新的数据，以考虑您的缩放因子）。有人曾经说过“计算机科学中最难的两件事是命名和缓存失效”。

Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.

Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.

I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.

Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.

A few tips:

If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.

There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.

Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."

回复收藏 0 原文