Hadoop Hbase:是否跨表分布列族
Hbase文档明确指出,应该将相似的列分组到列族中,因为物理存储是按列族完成的。
但是,将两个列族放入同一个表中,而不是每个列组都有单独的表,这意味着什么? 是否存在以这种方式“分区”表更有意义的特定情况,以及一个“宽”表效果更好的情况?
单独的表应该产生单独的“行区域”,当某些列族(作为一个整体)非常稀疏时,这可能会很有用。 相反,什么时候将列族聚集在一起会更有利?
The Hbase documentation makes it clear that you should group similar columns into column families, because the physical storage is done by column family.
But what does it mean to put two column families into the same table, as opposed to having separate tables per column group? Are there specific cases when "partitioning" tables this way makes more sense, and cases when one "wide" table works better?
Separate tables should result in separate "row regions", which could be beneficial when some column families (as a whole) are very sparse. Conversely, when would it be advantageous to have columns families bunched together?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您已经了解了列族的想法:基本上,它只是 HBase 的一个提示,将这些项目存储和复制在一起以实现更快的访问。
如果将两个列族放在同一个表中并且始终使用不同的键来访问它们,那么这实际上与将它们放在两个单独的表中是一样的。 您只能通过在同一个表中拥有两个通过相同键访问的列族来获益。
例如:如果我有给定网站的页面浏览总数、同一网站的唯一浏览次数、用户用于查看该网站的浏览器以及他们的互联网连接的列,我可以决定我想要前两个是一个列族,后两个是另一个列族。 这里所有四个都是通过相同的密钥访问的,即有问题的网站,所以我通过将它们放在同一个表中而获益。
如果它们位于不同的表中,我最终将不得不对两个表执行类似联接的操作。 我并不真正知道这些数字,所以我无法真正告诉你类似连接的操作有多慢(因为我不记得 HBase 有连接,因为它是非关系型的)以及分裂的临界点是什么将它们放入单独的表中胜过将它们放在同一个表中(反之亦然)。
当然,这一切都取决于您要存储的数据,因此,如果您永远不需要跨表进行联接,那么您可能希望将它们保留在单独的表中,因为您可能会认为它们彼此之间没有那么相关首先。
You've got the idea of column families right on: basically it's just a hint to HBase to store and replicate these items together for faster access.
If you put two column families in the same table and always have different keys to access them, then it's really the same thing as having them in two separate tables. You only gain by having two column families in the same table that are accessed via the same keys.
For example: if I have columns for the total number of pageviews for a given web site, the number of unique views for the same site, the browser the user uses to view the site, and their internet connection, I can decide that I want the first two to be a column family and the last two to be another column family. Here all four are accessed by the same key, namely the web site in question, so I'm gaining by having them in the same table.
If they're in different tables I would end up having to do a join-like operation on the two tables. I don't really know the numbers though so I can't really tell you how slow the join-like operation is (since I don't recall HBase having a join since it's non-relational) and what the tipping point is where splitting them into separate tables outweighs having them in the same table (or vice versa).
Of course, this all depends on the data you're trying to store, so if you would never need to join across the tables, you would want to keep them in separate tables since you could argue they're not that related to each other in the first place.
列族是面向行访问与面向列访问之间的折衷方案。 为了扩展 Chris 的网页示例,行访问将获取单个网站的所有数据(列)。 面向列的操作的一个示例是对所有站点的页面浏览量进行求和。
后一个操作不需要浏览器和连接详细信息,这些详细信息比视图计数的数值大得多,并且会显着影响查询性能。 因此,HBase提供了列族作为支持列操作的优化。
至于列是否应该位于同一个表中......我只会遵循正常的数据建模指南,并将所有列放在同一个表中(如果它们是同一实体的属性)。 列族是关于性能而不是模式。
Column families are a compromise between row-oriented vs. column-oriented access. To extend Chris's web page example, a row access would fetch all data (columns) for a single web site. An example of a column-oriented operation would be to sum the number of page views across all sites.
The latter operation does not require the browser and connection details, which are much larger than the numeric values for view counts and would significantly affect query performance. Therefore, HBase provides column families as an optimisation that supports column operations.
As to whether or not the columns should be in the same table... I would just follow normal data modelling guidelines and put all the columns in the same table if they are attributes of the same entity. Column families are about performance not schema.