平衡 HDFS 的想法 -> HBase 映射减少作业

发布于 12-11 09:25 字数 627 浏览 0 评论 0原文

对于客户,我一直在研究在 AWS EC2 上运行 Cloudera 风格的 hadoop 集群的短期可行性。在大多数情况下,结果都是预期的,逻辑卷的性能大多不可靠,也就是说,尽我所能,我已经让集群在这种情况下运行得相当好。

昨晚,我对他们的导入器脚本进行了完整测试,从指定的 HDFS 路径中提取数据并将其推送到 Hbase 中。他们的数据有些不寻常,因为每条记录不到 1KB,并且被压缩成 9MB 的 gzip 块。从 gzip 中提取的总共大约 500K 文本记录,经过健全性检查,然后推送到缩减器阶段。

该作业在环境的预期范围内运行(溢出记录的数量是我所预期的),但一个非常奇怪的问题是,当作业运行时,它使用 8 个减速器运行,但 2 个减速器完成 99% 的工作,而其余 6 个减速器完成 99% 的工作。工作的一小部分。

到目前为止,我未经测试的假设是,我在作业配置中缺少关键的洗牌或块大小设置,这导致大部分数据被推送到只能由 2 个减速器消耗的块中。不幸的是,上次我使用 Hadoop 时,另一个客户的数据集位于物理托管集群上的 256GB lzo 文件中。

为了澄清我的问题;有没有办法调整 M/R 作业以实际利用更多可用的减速器,方法是降低映射的输出大小或使每个减速器减少它将解析的数据量。即使在当前的 2 个减速器基础上增加 4 个减速器也将是一个重大改进。

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.

Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.

The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.

My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.

To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浊酒尽余欢2024-12-18 09:25:37

看起来你的减速器中出现了热点。这可能是因为某个特定的密钥非常受欢迎。作为映射器输出的键是什么?

这里有几个选择:

  • 尝试更多的减速器。有时,您会在哈希的随机性中得到奇怪的伪影,因此拥有质数的减速器有时会有所帮助。这可能无法修复它。
  • 编写一个自定义分区器来更好地分散工作。
  • 找出为什么一堆数据被分成两个键。有没有办法让你的钥匙更加独特来分担工作?
  • 您可以使用组合器做些什么来减少流向减速器的流量吗?

It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?

You have a couple of options here:

  • Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
  • Write a custom partitioner that spreads out the work better.
  • Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
  • Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文