面向列的数据库(HBase、Cassandra)中的连续行 ID?
在 HBase 中设计行 ID 时,我看到了两条相互矛盾的建议(具体来说,但我认为它也适用于 Cassandra)。
- 将您经常聚合在一起的键分组,以利用数据局部性。 (White,《Hadoop:权威指南》,我记得在 HBase 站点上看到过它,但找不到它...)
- 分散密钥,以便可以将工作分布在多台计算机上 (Twitter、Pig 和 Twitter 上的 HBase< /a> 幻灯片 14)
我猜测哪一个是最佳的可以取决于您的用例,但是有人对这两种策略有任何经验吗?
I've seen two contradictory pieces of advice when it comes to designing row IDs in HBase, (specifically, but I think it applies to Cassandra as well.)
- Group keys that you'll be aggregating together often to take advantage of data locality. (White, Hadoop: The Definitive Guide and I recall seeing it on the HBase site, but can't find it...)
- Spread keys around so that work can be distributed across multiple machines (Twitter, Pig, and HBase at Twitter slide 14)
I'm guessing which one is optimal can depend on your use case, but does anyone have any experience with either strategy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 HBase 中,表通过划分键空间来划分为区域,键空间按字典顺序排序。表的每个区域都属于单个区域服务器,因此所有读取和写入都由该服务器处理(这可以提供强一致性保证)。这意味着,如果您的所有读取或写入都集中在密钥空间的一小部分上,那么您将只能扩展到单个区域服务器可以处理的范围。例如,如果您的数据是时间序列并由时间戳作为键控,则所有写入都将写入表中的最后一个区域,并且您将被迫以单个服务器可以处理的速率进行写入。
另一方面,如果您可以选择键,使得任何给定查询只需要扫描一小部分行,但整个读写集分布在您的键空间中,那么总负载将被分配和扩展很好,但您仍然可以享受查询的本地化优势。
In HBase, a table is partitioned into regions by dividing up the key space, which is sorted lexicographically. Each region of the table belongs to a single region server, so all reads and writes are handled by that server (which allows for a strong consistency guarantee). This means that if all of your reads or writes are concentrated on a small range of your keyspace, that you will only be able to scale to what a single region server can handle. For example, if your data is a time series and keyed by the timestamp, then all writes are going to the last region in the table, and you will be constrained to writing at the rate that a single server can handle.
On the other hand, if you can choose your keys such that any given query only needs to scan a small range of rows, but that the overall set of reads and writes are spread across your keyspace, then the total load will be distributed and scale nicely, but you can still enjoy the locality benefits for your query.