关于 Hadoop 和压缩输入文件的非常基本的问题
我已经开始研究 Hadoop。如果我的理解是正确的,我可以处理一个非常大的文件,它会被分割到不同的节点上,但是如果文件被压缩,那么文件就无法分割,并且需要由单个节点处理(有效地破坏了运行一个mapreduce(一个并行机器集群)。
我的问题是,假设上述内容是正确的,是否可以将大文件手动分割为固定大小的块或每日块,压缩它们,然后传递压缩输入文件的列表来执行映射缩减?
I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines).
My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
BZIP2 在 hadoop 中是可分割的 - 它提供了非常好的压缩比,但从 CPU 时间和性能来看并不能提供最佳结果,因为压缩非常消耗 CPU。
LZO 在 hadoop 中是可拆分的 - 利用hadoop-lzo< /strong> 您有可分割的压缩 LZO 文件。您需要有外部 .lzo.index 文件才能并行处理。该库提供了以本地或分布式方式生成这些索引的所有方法。
LZ4 在 hadoop 中是可拆分的 - 利用 hadoop-4mc 您有可分割的压缩 4mc 文件。您不需要任何外部索引,并且可以使用提供的命令行工具或通过 Java/C 代码在 hadoop 内部/外部生成档案。 4mc 可以在 hadoop LZ4 上以任何速度/压缩比级别使用:从达到 500 MB/s 压缩速度的快速模式到提供更高压缩比的高/超模式,几乎与 GZIP 相当。
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
考虑使用 LZO 压缩。它是可拆分的。这意味着一个大的 .lzo 文件可以由许多映射器处理。 Bzip2 可以做到这一点,但速度很慢。
Cloudera 有一个 介绍 关于它。对于 MapReduce,LZO 听起来在压缩比和压缩/解压缩速度之间取得了很好的平衡。
Consider using LZO compression. It's splittable. That means a big .lzo file can be processed by many mappers. Bzip2 can do that, but it's slow.
Cloudera had an introduction about it. For MapReduce, LZO sounds a good balance between compression ratio and compress/decompress speed.
是的,您可以有一个大的压缩文件,或多个压缩文件(使用 -files 或 api 指定的多个文件)。
TextInputFormat 和后代应该自动处理 .gz 压缩文件。您还可以实现自己的 InputFormat< /a> (它将把输入文件分割成块进行处理)和 RecordReader(一次从块中提取一条记录)
通用压缩的另一种替代方法可能是使用压缩文件系统(例如带有压缩补丁的 ext3、zfs、compFUSEd) ,或 FuseCompress...)
yes, you could have one large compressed file, or multiple compressed files (multiple files specified with -files or the api).
TextInputFormat and descendants should automatically handle .gz compressed files. you can also implement your own InputFormat (which will split the input file into chunks for processing) and RecordReader (which extract one record at a time from the chunk)
another alternative for generic copmression might be to use a compressed file system (such as ext3 with the compression patch, zfs, compFUSEd, or FuseCompress...)
您可以使用 bz2 作为压缩编解码器,并且这种格式也可以被分割。
You can use bz2 as your compress codec, and this format also can been split.