HBase 如何在集群中分配来自 MapReduce 的新区域?
我的情况是这样的:我有一个 20 节点的 Hadoop/HBase 集群,有 3 个 ZooKeepers。我通过 MapReduce 将 HBase 表中的数据处理到其他 HBase 表中。
现在,如果我创建一个新表,并告诉任何作业使用该表作为输出接收器,则其所有数据都会进入同一个区域服务器。如果只有几个区域,我不会感到惊讶。我的一个特定表大约有 450 个区域,现在出现了问题:这些区域中的大多数(大约 80%)都位于同一区域服务器上!
我现在想知道 HBase 如何在整个集群中分配新区域的分配,以及这种行为是正常/期望的还是错误。不幸的是,我不知道从哪里开始查找代码中的错误。
我问的原因是这使得工作速度极其缓慢。只有当作业完全完成时,表才会在集群中保持平衡,但这并不能解释这种行为。 HBase 不应该在创建新区域时将其分配给不同的服务器吗?
感谢您的投入!
My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.
Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!
I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.
The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?
Thanks for you input!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我相信这是一个已知问题。目前,HBase 将区域作为一个整体分布在集群中,而不考虑它们属于哪个表。
有关背景知识,请参阅 HBase 书籍:
http://hbase.apache.org/book/regions.arch.html
您可能使用的是旧版本的 hbase:
http://comments.gmane.org/gmane.comp.java .hadoop.hbase.user/19155
有关负载均衡和区域移动的讨论请参阅以下内容
http://comments.gmane.org/gmane.comp.java .hadoop.hbase.user/12549
I believe that this is a known issue. Currently HBase distributes regions across the cluster as a whole without regard for which table they belong to.
Consult the HBase book for background:
http://hbase.apache.org/book/regions.arch.html
It could be that you are on an older version of hbase:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155
See the following for a discussion of load balancing and region moving
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549
默认情况下,它只是平衡每个 RS 上的区域,而不考虑表。
您可以设置
hbase.master.loadbalance.bytable
来获取它。By default, it just balance regions on each RS without take table into account.
You can set
hbase.master.loadbalance.bytable
to get it.