Hadoop，如何压缩映射器输出而不是减速器输出

发布于 2024-10-30 15:46:13 字数 439 浏览 5 评论 0原文

我有一个map-reduce java 程序，在其中我尝试只压缩mapper 输出而不压缩reducer 输出。我认为可以通过在配置实例中设置以下属性来实现这一点，如下所示。但是，当我运行作业时，reducer 生成的输出仍然被压缩，因为生成的文件是：part-r-00000.gz。有没有人成功地压缩了映射器数据而不是减速器？这可能吗？

//压缩映射器输出

conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);

原文

I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?

//Compress mapper output

conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红焚 2024-11-06 15:46:13

mapred.compress.map.output：是mapper和reducer之间数据的压缩。如果您使用快速编解码器，这很可能会提高读写速度并减少网络开销。不用担心在这里吐痰。这些文件不存储在hdfs中。它们是仅针对 MapReduce 作业而存在的临时文件。

mapred.map.output.compression.codec：我会使用snappy

mapred.output.compress：这个布尔标志将定义整个map/reduce作业将输出压缩数据。我也总是将其设置为 true。更快的读/写速度和更少的磁盘空间使用。

mapred.output.compression.type：我使用块。这将使压缩甚至对于所有压缩格式（gzip、snappy 和 bzip2）都是可分割的，只需确保您使用的是可分割的文件格式，如序列、RCFile 或 Avro。

mapred.output.compression.codec：这是map/reduce作业的压缩编解码器。我主要使用这三个之一：Snappy（最快的r/w 2x-3x压缩），gzip（正常的r/w 5x-8x压缩），bzip2（慢速的r/w 8x-12x压缩）

还记得压缩mapred输出时，由于分割压缩会根据您的排序顺序而有所不同。相似数据越接近，压缩效果越好。

回复收藏 0 原文

残花月 2024-11-06 15:46:13

有了MR2，现在我们应该设置

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

With MR2, now we should set

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

回复收藏 0 原文

大姐，你呐 2024-11-06 15:46:13

“输出压缩”将压缩您的最终输出。要仅压缩地图输出，请使用如下内容：

  conf.set("mapred.compress.map.output", "true")
  conf.set("mapred.output.compression.type", "BLOCK"); 
  conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");

"output compression" will compress your final output. To compress map-outputs only, use something like this:

  conf.set("mapred.compress.map.output", "true")
  conf.set("mapred.output.compression.type", "BLOCK"); 
  conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");

回复收藏 0 原文

ぶ宁プ宁ぶ 2024-11-06 15:46:13

您需要将“mapred.compress.map.output”设置为true。
或者，您可以通过设置“mapred.map.output.compression.codec”来选择压缩编解码器。
注意 1：mapred 输出压缩不应该是 BLOCK。详细信息请参阅以下 JIRA：
https://issues.apache.org/jira/browse/HADOOP-1194
注意 2：GZIP 和 BZ2 是 CPU 密集型的。如果网络速度较慢，并且 GZIP 或 BZ2 提供更好的压缩比，则可能证明 CPU 周期的花费是合理的。否则，请考虑 LZO 或 Snappy 编解码器。
注意3：如果您想使用地图输出压缩，请考虑安装通过 JNI 调用的本机编解码器，并为您提供更好的性能。

回复收藏 0 原文

晨曦÷微暖 2024-11-06 15:46:13

如果您使用 MapR 的 Hadoop 发行版，您可以获得压缩的好处，而无需使用编解码器的所有文件夹。

MapR 在文件系统级别进行本机压缩，因此应用程序无需了解或关心。可以在目录级别打开或关闭压缩，以便您可以压缩输入，但不能压缩输出或任何您喜欢的内容。一般来说，压缩速度非常快（默认情况下它使用类似于 snappy 的算法），大多数应用程序在使用本机压缩时都会看到性能提升。如果您的文件已经被压缩，系统会很快检测到并自动关闭压缩，因此您也不会看到任何惩罚。

回复收藏 0 原文

~没有更多了~