平衡 HDFS 的想法 -> HBase 映射减少作业
对于客户,我一直在研究在 AWS EC2 上运行 Cloudera 风格的 hadoop 集群的短期可行性。在大多数情况下,结果都是预期的,逻辑卷的性能大多不可靠,也就是说,尽我所能,我已经让集群在这种情况下运行得相当好。
昨晚,我对他们的导入器脚本进行了完整测试,从指定的 HDFS 路径中提取数据并将其推送到 Hbase 中。他们的数据有些不寻常,因为每条记录不到 1KB,并且被压缩成 9MB 的 gzip 块。从 gzip 中提取的总共大约 500K 文本记录,经过健全性检查,然后推送到缩减器阶段。
该作业在环境的预期范围内运行(溢出记录的数量是我所预期的),但一个非常奇怪的问题是,当作业运行时,它使用 8 个减速器运行,但 2 个减速器完成 99% 的工作,而其余 6 个减速器完成 99% 的工作。工作的一小部分。
到目前为止,我未经测试的假设是,我在作业配置中缺少关键的洗牌或块大小设置,这导致大部分数据被推送到只能由 2 个减速器消耗的块中。不幸的是,上次我使用 Hadoop 时,另一个客户的数据集位于物理托管集群上的 256GB lzo 文件中。
为了澄清我的问题;有没有办法调整 M/R 作业以实际利用更多可用的减速器,方法是降低映射的输出大小或使每个减速器减少它将解析的数据量。即使在当前的 2 个减速器基础上增加 4 个减速器也将是一个重大改进。
For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
看起来你的减速器中出现了热点。这可能是因为某个特定的密钥非常受欢迎。作为映射器输出的键是什么?
这里有几个选择:
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here: