gzip 输入文件大小 > 的问题64MB
我正在运行 Hadoop 流作业,它只有映射器,没有减速器。我为这项工作提供了 4 个输入文件,这些文件都经过 gzip 压缩,以确保每个输入文件都发送到一个映射器。两个 gzip 压缩输入文件的大小小于 64 MB,而另外两个 gzip 压缩输入文件的大小大于 64 MB。作业运行了近 40 分钟的很长一段时间,然后失败并显示“错误:失败的映射任务数超出了允许的限制”。通常该作业不应超过 1 分钟,不知道为什么它持续了 40 分钟
当我检查输出目录时,我发现输出是为两个大小 < 的 gzip 压缩输入文件生成的。 64 MB,并且对于大小大于 64 MB 的 gzip 压缩输入文件不会生成输出64MB。
有人见过这样的行为吗?
启动作业时,我看到以下消息(如果我将较小的文件(< 64 MB)作为作业的输入传递,则不会看到此消息)
12/02/06 10:39:10 INFO mapred.FileInputFormat:总输入路径待处理:2 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.191.0/10.209.191.57:1004 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.191.0/10.209.191.50:1004 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.186.0/10.209.186.28:1004 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.188.0/10.209.188.48:1004 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.185.0/10.209.185.50:1004 12/02/06 10:39:10 INFO net.NetworkTopology:添加新节点:/10.209.188.0/10.209.188.35:1004
I am running a Hadoop streaming job, it has only mappers, no reducers. I am giving this job 4 input files which are all gzipped to make sure that each input file goes to one mapper. Two gzipped input files have size less than 64 MB, whereas two other gzipped input files have size greater than 64MB. Job runs for a long time nearly 40 min and then fails saying "Error: # of failed Map Tasks exceeded allowed limit." Normally the job should not take more than 1 min, not sure why it went on for 40 min
When I check the output directory I see that the output is generated for two gzipped input files with size < 64 MB and output is not generated for gzipped input files with size > 64 MB.
Has anybody seen such a behaviour?
I see following messages when the job is launched (I dont see this if I pass smaller size files ( < 64 MB) as input to the job)
12/02/06 10:39:10 INFO mapred.FileInputFormat: Total input paths to process : 2
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.191.0/10.209.191.57:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.191.0/10.209.191.50:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.186.0/10.209.186.28:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.188.0/10.209.188.48:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.185.0/10.209.185.50:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.188.0/10.209.188.35:1004
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您定义了自己的 FileInputFormat 衍生物,那么我怀疑您遇到了这个错误:
https://issues.apache.org/jira/browse/MAPREDUCE-2094
如果您有,那么我建议将 isSplitable 方法的实现从 TextInputFormat 复制到您自己的类中。
In case you have defined your own derivative of FileInputFormat then I suspect you ran into this bug:
https://issues.apache.org/jira/browse/MAPREDUCE-2094
If you have then I recommend copying the implementation of the isSplitable method from TextInputFormat into your own class.