HBase 批量加载会产生大量减速器任务 - 任何解决方法

发布于 2024-10-17 11:38:13 字数 190 浏览 11 评论 0原文

HBase 批量加载（使用 configureIncrementalLoad 帮助程序方法）将作业配置为创建与 hbase 表中的区域一样多的减速器任务。因此，如果有几百个区域，那么该作业将产生几百个减速器任务。在小型集群上这可能会变得非常慢。

是否可以通过使用 MultipleOutputFormat 或其他方法来解决问题？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不即不离 2024-10-24 11:38:13

按区域对缩减进行分片会给您带来很多长期利益。一旦导入的数据在线，您就可以获得数据局部性。您还可以确定某个区域何时已负载平衡到另一台服务器。我不会这么快就进入更粗的粒度。
由于reduce阶段是单个文件写入，因此您应该能够设置NumReduceTasks（硬盘驱动器数量）。这可能会加快速度。

很容易出现网络瓶颈。确保您正在压缩 HFile 和您的中间 MR 数据。

 job.getConfiguration().setBoolean("mapred.compress.map.output", true);
  job.getConfiguration().setClass("mapred.map.output.compression.codec",
      org.apache.hadoop.io.compress.GzipCodec.class,
      org.apache.hadoop.io.compress.CompressionCodec.class);
  job.getConfiguration().set("hfile.compression",
      Compression.Algorithm.LZO.getName());

您的数据导入大小可能足够小，您应该考虑使用基于 Put 的格式。这将调用普通的HTable.Put API并跳过reducer阶段。请参阅 TableMapReduceUtil.initTableReducerJob(table, null, job)。

Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.

It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.

  job.getConfiguration().setBoolean("mapred.compress.map.output", true);
  job.getConfiguration().setClass("mapred.map.output.compression.codec",
      org.apache.hadoop.io.compress.GzipCodec.class,
      org.apache.hadoop.io.compress.CompressionCodec.class);
  job.getConfiguration().set("hfile.compression",
      Compression.Algorithm.LZO.getName());

Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).

回复收藏 0 原文