碎片键,大部分分布。如何处理异常值?
我正在学习碎片方法。如何在IO繁重的应用中使用大量碎片来实现良好的水平可伸缩性。下面我描述了我希望在我的应用中看到的案例。我认为这在野外将是一个相对常见的,但是,我无法找到太多的信息
。所有查询将包括一个客户ID(UUID)。更新和读取主要是均匀分布在客户之间。
根据我在这种情况下阅读的内容,我想在客户端ID上使用sharding键。阅读将触摸单个碎片,提供最佳性能。只要客户产生相对相同的负载,写信将均匀分布。
但是,如果有一小部分客户产生如此多的io负载,以至于单个碎片很难处理它,该怎么办?
如果我们更改随机记录ID的碎片键,那么所有客户端都会在所有碎片上分发。但是,阅读将不得不击中所有不高效的碎片,尤其是在有很多碎片时。
我们如何达到平衡:平均客户分配平均,同时允许大型客户占据多个碎片?是否有能够自动执行此操作的数据库解决方案?还是我们必须编写自定义逻辑以跟踪DB负载并在碎片之间重新分配大客户?我应该阅读有关该主题的什么?
I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议在客户端的记录中添加一个新属性,例如,我们可以称其为
part
。将单个值分配给简单客户端,并将相同的值存储在part
的所有记录中。但是,将为
part
的重型客户分配多个值,最多可达碎片数。该客户端的每个记录都会将其部分
设置为这些值之一。随机分配它们或圆形旋转蛋白,但是您认为最有效。要点是将每个零件都以均匀的频率使用。您的哈希算法将客户端映射到碎片中,然后将使用客户端ID +
part
属性。因此,每个简单的客户端仍将所有数据存储在一个碎片上。但是,大型客户将通过多个碎片分发数据。这确实意味着对于繁重的客户来说,读取查询需要搜索多个碎片。代码您的搜索以循环循环
part
客户端的值。对于大多数客户端,此循环只需要执行一次即可。对于重型客户端,每个part
值与该客户端关联的循环将执行一次。老实说,我从未见过如此出色的负担,以至于这是必要的。对于一个数据库实例,一个客户端的流量更有可能是太多的,因为查询不是很好地优化,或者应用程序运行的查询比应有的更多查询。重要的是要确保在使碎片架构更加复杂之前分析查询效率。
I'd suggest adding a new attribute to the client's records, for example we could call it
part
. Assign a single value to simple clients, and store the same value inpart
for all their records.But heavy clients would be assigned multiple values for
part
, up to the number of shards. Every record for that client would set itspart
to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.Your hashing algorithm for mapping clients to a shard would then use the client id + the
part
attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the
part
values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for eachpart
value associated with that client.To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
您已经用
cockroachdb
标记了问题,因此您可能已经怀疑了这一点,但是Cockroachdb可以透明地处理碎片。如果您的主键是复合键,而第一列是客户端ID,则具有相同客户端ID的数据都将属于连续的键范围,因此通常存储在同一节点上。如果范围比可配置的限制大,并且/或获得更多的流量,CockroachdB将自动将范围分为跨节点的重新平衡和流量。您通常不必注意这一点,并且为了您的模式,您将不想进行任何明确的碎片。但是,如果您确实需要检查或调整行为,则有工具可以这样做,例如显示范围。You've tagged your question with
cockroachdb
so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.