在小型 HDFS 文件上长时间运行地图任务
想象一个场景,您有一个文本文件,其中有 10000 行,因此当您将其保存在 HDFS 中时,它会非常小。现在您的目标是在这个小文件上运行映射/归约作业,期望将文本文件的每一行传递给映射器。然而,每个映射 k,v 的处理时间很长,您希望在集群上启动尽可能多的映射器以获得最大可能的并行度,从而尽快完成映射作业。
因为文件很小,所以它只能保存在一两个 hdfs 块中,并且我假设 hadoop 为该作业配置的映射数量将等于恰好是一两个 hdfs 块的数量。但这是不希望的,因为您希望启动尽可能多的映射器。这是我的问题?
- 对于小文件处理场景,我可以选择哪些选项来控制适当数量的映射器?
- 每个可用选项的优点和缺点是什么?
Imagine a scenario you have a text file with say 10000 rows in it and so it will be very small when you save it in the HDFS. Now your goal is to run a map/reduce job on this small file expecting every line of the text file being passed to the mapper. However the processing time for each map k,v is long and you want to launch as many as mappers on the cluster to get the maximum possible parallelism to finish the mapping job the soonest.
Because the file is small, it could be save in one or two hdfs blocks only and I assume the number of provisioned maps by hadoop for the job is going to be equal to the number of hdfs blocks which happened to be one or two. But this is undesired as you want to launch as many as possible mappers. Here are my questions?
- What are my options on controlling the proper number of mappers for my scenrio of processing on an small file?
- What are the pros and cons of each available options?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最好的方法是使用 NLineInputFormat。根据文档
NLineInputFormat,它将 N 行输入拆分为一个拆分
。这种方法的缺点是数据局部性丢失,因为数据很小,所以应该不重要。Best approach is to use the NLineInputFormat. According to the documentation
NLineInputFormat which splits N lines of input as one split
. The con of this approach is that data locality is lost, since the data is small it shouldn't matter.