在 Hadoop 中设置压缩输出
什么时候该用、什么时候不该用 FileOutputFormat.setCompressOutput(conf, true);
?
我听说它会压缩映射器输出。有没有可能压缩减速机侧的输出?
(如果我的假设是错误的,请澄清我,如何压缩mapper输出和reducer输出!)
When should use and not to useFileOutputFormat.setCompressOutput(conf, true);
?
I heard that it compresses mapper output. Is there any possibility to compress reducer side output?
(If my assumption is wrong, please clear me, how to compress mapper output and reducer output!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用
mapred.output.compress
控制reducer输出的压缩,并使用mapred.compress.map.output
控制mapper输出的压缩。这些配置键可以在站点范围的配置文件、作业设置中设置(为true
或false
),或者设置为-D
当您运行作业时传递给 Hadoop 的选项。压缩地图输出通常是一个好主意。当输出不是最终结果时,例如当我在前一个作业的输出上运行另一个作业时,我也会压缩reduce输出。
压缩通常有助于更快地完成作业(即使它需要额外的压缩/解压缩处理),因为它可以大大减少 I/O 量。
您也可以选择压缩编解码器。我们使用 LZO,它不随 Hadoop 一起提供,但可以在这里找到:
https://github.com/ kevinweil/hadoop-lzo
LZO 压缩效果非常好,CPU 开销最小。 Bzip2 压缩得很好,但开销更大。 Gzip 压缩效果较差,开销适中。 (这些都是概括。)我认为 LZO 具有最佳的特性平衡。
You can control compression of the reducer output with
mapred.output.compress
, and compression of the mapper output withmapred.compress.map.output
. These configuration keys can be set (totrue
orfalse
) in the site-wide configuration file, in your job setup, or as-D
options passed to Hadoop when you run your job.Compressing map output is generally a good idea. I also compress reduce output when that output is not the final result, e.g. when I am running another job over the output of the previous job.
Compression often helps jobs finish faster (even though it requires extra processing for compression/decompression) because it can greatly decrease the amount of I/O.
You can pick compression codecs, too. We use LZO, which doesn't come with Hadoop but can be found here:
https://github.com/kevinweil/hadoop-lzo
LZO compresses pretty well with minimal CPU overhead. Bzip2 compresses very well, but with more significant overhead. Gzip compresses less well with moderate overhead. (These are generalizations.) I think LZO has the best balance of characteristics.