可扩展数据库、索引的最佳并发模型?
我对相对容易实现并且适合扩展(多个节点)的并发技术感兴趣。
另外,如果您知道一些更高级的算法,请提供一些信息。
希望这个主题对其他人有用。 谢谢!
更新
我对 nosql 存储和模型感兴趣。
i'm interesting in concurrency technics which are relatively easy to implement and they are suitable for scaling (multiple nodes).
also if you know some more advanced algorithms, please provide some info.
hope this topic will be useful for others.
thanks!
update
i'm interesting in nosql storages and models.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要仔细考虑您要寻找的内容。 “可扩展”是指数据库的大小吗?读者数量?同时写入的人数?在存在非固有冲突的作者的情况下读者的吞吐量? “索引”是指传统的全序索引(如 B 树)还是散列索引?另外,您希望读者和作者之间的一致性达到什么程度?
前面的评论建议使用散列,如果您不需要类似 btree 的有序索引,那么散列非常好,因为散列操作将键空间分割成单独的部分,以便同时操作避免交互。另一方面,范围查询需要某种类型的树索引,这存在问题,因为单个更新可以在调整索引时锁定所有其他查询。传统的解决方案http://en.wikipedia.org/wiki/Multiversion_concurrency_control
you need to think through exactly what you're looking for. does the "scalable" refer to the size of the DB? number of readers? number of simultaneous writers? throughput of readers in the presence of not-inherently-conflicting writers? by "index", do you mean a traditional totally-ordered index like a btree, or will hashing do? also, what degree of consistency do you want among readers and writers?
the previous comment suggests hashing, which is excellent if you don't need a btree-like ordered index precisely because the hashing operation splits the keyspace into separate pieces, so that simultaneous operations avoid interacting. range queries, on the other hand, want a tree index of some sort, which has problems because a single update can lock out all other queries as it adjusts the index. the traditional solution to this http://en.wikipedia.org/wiki/Multiversion_concurrency_control
最大化数据库并发能力的理想方法是使用带有散列键的“稀疏填充”表。这允许通过 PK(或可以让您快速到达 PK 的代理)即时检索记录,并且几乎消除了冲突。无需读取索引来确定记录在表中的“位置”,因为可以从散列 PK 计算出其位置。然而,通过以这种方式最大化并发能力,您可能会遇到一些其他的权衡,例如无法快速检索特定邮政编码的所有记录,或者无法立即对表进行排序按某个日期值。快速获取给定邮政编码的所有记录,或者立即按日期列对行进行排序,通常会对这些记录进行物理分组,使它们在磁盘上连续,以避免过度的磁盘抖动。当获取一组记录(例如纽约的所有客户)时,具有散列键的稀疏填充表可能会导致严重的磁盘抖动,而当获取单个记录(客户#123456)时,它会表现出色。
编辑:我应该补充一点,在这种散列键稀疏的数据库中,找到诸如邮政编码*客户编号之类的复合主键并不罕见,这样给定邮政编码中的所有客户最终都存储在大致相同的区域中稀疏的表;这样做是为了最大限度地减少运行邮政编码驱动的报告时的抖动。因此,有一些方法可以减轻该方法的不利影响,同时保持其极低的冲突率和无需索引的记录获取。
EDIT2:假设您想让电子邮件地址成为 PK,但又不希望来自 AOL、YAHOO 或 GOOGLE 的每个人的记录都聚集在稀疏填充表的同一区域中,从而导致“凸起”在那里,就像一条吞下了猪的蟒蛇。您可以使用左加权主键哈希算法来更加强调@左侧的值。但如果您要使用数字顺序 PK,则可以使用右加权算法来消除此类凸起。
The ideal approach to maximize concurrency capabilities in a database is to use a "sparsely populated" table with hashed-key. This allows instant retrieval of records by PK (or a surrogate that can get you quickly to the PK) and pretty much eliminates collisions. There is no need to read an index to determine where the record "lives" in the table because its location can be computed from the hashed PK. However, by maximizing concurrency capabilities in this way you are likely to suffer some other tradeoff, such as the inability to retrieve all records for, say, a particular zipcode quickly, or the inability to order the table instantly by some date value. A quick fetch of all records for a given zipcode, or ordering the rows instantly by a date column, would typically physically group those records making them contiguous on disk, in order to avoid excessive disk-thrashing. A sparsely populated table with hashed key can involve significant disk thrashing when groups of records are fetched (e.g. all customers in New York) while it excels when an individual record is fetched (customer # 123456).
EDIT: I should add that in such hashed-key sparsely populated databases, it is not unusual to find composite-primary-keys such as ZIPCODE*CUSTOMERNUMBER so that all of the customers in a given zipcode end up being stored in roughly the same region of the sparsely populated table; this is done to minimize thrashing when zipcode-driven reports are run. So there are ways to mitigate the adverse effects of the approach while preserving its exceptionally low collision rates and no-index-required record fetches.
EDIT2: And let's say that you wanted to make the email address the PK, and yet did not want to have the records from everyone from AOL or YAHOO or GOOGLE be clustered in the same region of the sparesely populated table, resulting in "bulges" there, like an anaconda that had swallowed a pig. You can use a left-weighted primary key hashing algorithm to put more emphasis on the values to the left of the @. But if you were to use a numeric sequential PK, you could use a right-weighted algorithm to eliminate such bulges.