仅使用一个映射器的 Hadoop gzip 输入文件
我发现,当使用gzip压缩的输入文件时,Hadoop选择只分配一个map任务来处理我的数据。映射/减少作业。
gzipped 文件超过 1.4 GB,因此我希望许多映射器能够并行运行(就像使用未压缩文件时一样)
是否有任何配置可以改进它?
Possible Duplicate:
Why can't hadoop split up a large text file and then compress the splits using gzip?
I found that when using input file that is gzipped the Hadoop chooses to allocate only one map task to handle my map/reduce job.
The gzipped file is more than 1.4 GB, so I would expect many mappers to run in parallel (exacly like when using un-zipped file)
Is there any configuration I can do to improve it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Gzip 文件无法分割,因此所有数据仅由一张地图处理。必须使用其他一些可以分割压缩文件的压缩算法,然后数据将被多个映射处理。这是一篇关于它的好文章。 (1)
编辑:这是另一篇关于 Snappy 的文章 (2),来自 Google。
(1) http:// /blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
(2) http://blog.cloudera.com/blog/2011/09/snappy-和-hadoop/
Gzip files can't be split, so all the data is being processed by only one map. Some other compression algorithm in which compressed files can be split has to be used, then the data will be processed by multiple maps. Here is a nice article on it. (1)
Edit: Here is another article on Snappy (2) which is from Google.
(1) http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
(2) http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/