如何告诉 MapReduce 使用多少个映射器?
我正在尝试加速优化 MapReduce 作业。
有什么方法可以告诉 hadoop 使用特定数量的映射器/减速器进程吗?或者,至少,映射器进程的数量最少?
在文档中,指定您可以使用
public void setNumMapTasks(int n)
JobConf 类的方法来执行此操作。
这种方式并没有过时,所以我从 Job 类开始 Job。这样做的正确方法是什么?
I am trying to speed optimize MapReduce job.
Is there any way I can tell hadoop to use a particular number of mapper/reducer processes? Or, at least, minimal number of mapper processes?
In the documentation, it is specified, that you can do that with the method
public void setNumMapTasks(int n)
of the JobConf class.
That way is not obsolete, so I am starting the Job with Job class. What is the right way of doing this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
映射任务的数量由输入中的块数量决定。如果输入文件为 100MB,HDFS 块大小为 64MB,则输入文件将占用 2 个块。因此,将产生 2 个地图任务。 JobConf.setNumMapTasks() (1) 对框架的提示。
减速器的数量由 JboConf.setNumReduceTasks() 函数设置。这决定了作业的reduce 任务总数。此外,mapred.tasktracker.tasks.maximum 参数确定可以在单个作业跟踪器节点上并行运行的reduce 任务的数量。
找到有关映射和归约作业数量的更多信息
您可以在 (2) (1) - http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks%28int%29
(2) - http://wiki.apache.org/hadoop/HowManyMapsAndReduces
The number of map tasks is determined by the number of blocks in the input. If the input file is 100MB and the HDFS block size is 64MB then the input file will take 2 blocks. So, 2 map tasks will be spawned. JobConf.setNumMapTasks() (1) a hint to the framework.
The number of reducers is set by the JboConf.setNumReduceTasks() function. This determines the total number of reduce tasks for the job. Also, the mapred.tasktracker.tasks.maximum parameter determines the number of reduce tasks which can run parallely on a single job tracker node.
You can find more information here on the number of map and reduce jobs at (2)
(1) - http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks%28int%29
(2) - http://wiki.apache.org/hadoop/HowManyMapsAndReduces