在小型 HDFS 文件上长时间运行地图任务

发布于 2025-01-01 13:16:25 字数 351 浏览 2 评论 0原文

想象一个场景,您有一个文本文件,其中有 10000 行,因此当您将其保存在 HDFS 中时,它会非常小。现在您的目标是在这个小文件上运行映射/归约作业,期望将文本文件的每一行传递给映射器。然而,每个映射 k,v 的处理时间很长,您希望在集群上启动尽可能多的映射器以获得最大可能的并行度,从而尽快完成映射作业。

因为文件很小,所以它只能保存在一两个 hdfs 块中,并且我假设 hadoop 为该作业配置的映射数量将等于恰好是一两个 hdfs 块的数量。但这是不希望的,因为您希望启动尽可能多的映射器。这是我的问题?

  1. 对于小文件处理场景,我可以选择哪些选项来控制适当数量的映射器?
  2. 每个可用选项的优点和缺点是什么?

Imagine a scenario you have a text file with say 10000 rows in it and so it will be very small when you save it in the HDFS. Now your goal is to run a map/reduce job on this small file expecting every line of the text file being passed to the mapper. However the processing time for each map k,v is long and you want to launch as many as mappers on the cluster to get the maximum possible parallelism to finish the mapping job the soonest.

Because the file is small, it could be save in one or two hdfs blocks only and I assume the number of provisioned maps by hadoop for the job is going to be equal to the number of hdfs blocks which happened to be one or two. But this is undesired as you want to launch as many as possible mappers. Here are my questions?

  1. What are my options on controlling the proper number of mappers for my scenrio of processing on an small file?
  2. What are the pros and cons of each available options?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

九局 2025-01-08 13:16:25

最好的方法是使用 NLineInputFormat。根据文档NLineInputFormat,它将 N 行输入拆分为一个拆分。这种方法的缺点是数据局部性丢失,因为数据很小,所以应该不重要。

Best approach is to use the NLineInputFormat. According to the documentation NLineInputFormat which splits N lines of input as one split. The con of this approach is that data locality is lost, since the data is small it shouldn't matter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文