Hadoop,如何压缩映射器输出而不是减速器输出
我有一个map-reduce java 程序,在其中我尝试只压缩mapper 输出而不压缩reducer 输出。我认为可以通过在配置实例中设置以下属性来实现这一点,如下所示。但是,当我运行作业时,reducer 生成的输出仍然被压缩,因为生成的文件是:part-r-00000.gz。有没有人成功地压缩了映射器数据而不是减速器?这可能吗?
//压缩映射器输出
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?
//Compress mapper output
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
mapred.compress.map.output:是mapper和reducer之间数据的压缩。如果您使用快速编解码器,这很可能会提高读写速度并减少网络开销。不用担心在这里吐痰。这些文件不存储在hdfs中。它们是仅针对 MapReduce 作业而存在的临时文件。
mapred.map.output.compression.codec:我会使用snappy
mapred.output.compress:这个布尔标志将定义整个map/reduce作业将输出压缩数据。我也总是将其设置为 true。更快的读/写速度和更少的磁盘空间使用。
mapred.output.compression.type:我使用块。这将使压缩甚至对于所有压缩格式(gzip、snappy 和 bzip2)都是可分割的,只需确保您使用的是可分割的文件格式,如序列、RCFile 或 Avro。
mapred.output.compression.codec:这是map/reduce作业的压缩编解码器。我主要使用这三个之一:Snappy(最快的r/w 2x-3x压缩),gzip(正常的r/w 5x-8x压缩),bzip2(慢速的r/w 8x-12x压缩)
还记得压缩mapred输出时,由于分割压缩会根据您的排序顺序而有所不同。相似数据越接近,压缩效果越好。
mapred.compress.map.output: Is the compression of data between the mapper and the reducer. If you use snappy codec this will most likely increase read write speed and reduce network overhead. Don't worry about spitting here. These files are not stored in hdfs. They are temp files that exist only for the map reduce job.
mapred.map.output.compression.codec: I would use snappy
mapred.output.compress: This boolean flag will define is the whole map/reduce job will output compressed data. I would always set this to true also. Faster read/write speeds and less disk spaced used.
mapred.output.compression.type: I use block. This will make the compression splittable even for all compression formats (gzip, snappy, and bzip2) just make sure you're using a splitable file format like sequence, RCFile, or Avro.
mapred.output.compression.codec: this is the compression codec for the map/reduce job. I mostly use one of the three: Snappy (Fastest r/w 2x-3x compression), gzip (normal r fast w 5x-8x compression), bzip2 (slow r/w 8x-12x compression)
Also remember when compression mapred output, that because of splitting compression will differ base on your sorting order. The close like data is together the better the compression.
有了MR2,现在我们应该设置
更多详细信息,请参阅:http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
With MR2, now we should set
For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
“输出压缩”将压缩您的最终输出。要仅压缩地图输出,请使用如下内容:
"output compression" will compress your final output. To compress map-outputs only, use something like this:
注意 1:mapred 输出压缩不应该是 BLOCK。详细信息请参阅以下 JIRA:
https://issues.apache.org/jira/browse/HADOOP-1194
注意 2:GZIP 和 BZ2 是 CPU 密集型的。如果网络速度较慢,并且 GZIP 或 BZ2 提供更好的压缩比,则可能证明 CPU 周期的花费是合理的。否则,请考虑 LZO 或 Snappy 编解码器。
注意3:如果您想使用地图输出压缩,请考虑安装通过 JNI 调用的本机编解码器,并为您提供更好的性能。
NOTE1: mapred output compression should never be BLOCK. See the following JIRA for detail:
https://issues.apache.org/jira/browse/HADOOP-1194
NOTE2: GZIP and BZ2 are CPU intensive. If you have slow network and GZIP or BZ2 gives better compression ratio, it may justify the spending of CPU cycles. Otherwise, consider LZO or Snappy codec.
NOTE3: if you want to use map output compression, consider install the native codec which is invoked via JNI and gives you better performance.
如果您使用 MapR 的 Hadoop 发行版,您可以获得压缩的好处,而无需使用编解码器的所有文件夹。
MapR 在文件系统级别进行本机压缩,因此应用程序无需了解或关心。可以在目录级别打开或关闭压缩,以便您可以压缩输入,但不能压缩输出或任何您喜欢的内容。一般来说,压缩速度非常快(默认情况下它使用类似于 snappy 的算法),大多数应用程序在使用本机压缩时都会看到性能提升。如果您的文件已经被压缩,系统会很快检测到并自动关闭压缩,因此您也不会看到任何惩罚。
If you use MapR's distribution for Hadoop, you can get the benefits of compression without all the folderol with the codecs.
MapR compresses natively at the file system level so that the application doesn't need to know or care. Compression can be turned on or off at the directory level so you can compress inputs, but not outputs or whatever you like. Generally, the compression is so fast (it uses an algorithm similar to snappy by default) that most applications see a performance boost when using native compression. If your files are already compressed, that is detected very quickly and compression is turned off automatically so you don't see a penalty there, either.