我在Hadoop平台(cloudera发行版)中编写了一个相对简单的map-reduce程序。每张地图和除了常规的 Map-Reduce 任务之外,Reduce 还将一些诊断信息写入标准输出。
然而,当我查看这些日志文件时,我发现 Map 任务在节点之间分布相对均匀(我有 8 个节点)。但reduce任务标准输出日志只能在一台机器上找到。
我想,这意味着所有的reduce 任务最终都在一台机器上执行,这是有问题且令人困惑的。
有人知道这里发生了什么事吗?是配置问题吗?
我怎样才能让减少的工作也均匀分布?
I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
发布评论
评论(2)
如果映射器的输出都具有相同的键,它们将被放入单个减速器中。
如果您的作业有多个减速器,但它们都在一台机器上排队,那么您就会遇到配置问题。
使用 Web 界面 (
http://MACHINE_NAME:50030
) 监视作业并查看其具有的减速器以及正在运行它们的机器。还可以深入研究其他信息,这些信息将提供有助于解决问题的信息。关于您的配置的几个问题:
硬件比其他节点好?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (
http://MACHINE_NAME:50030
) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.Couple questions about your configuration:
hardware than the other nodes?
Hadoop 通过使用 分区器
如果您只输出几个键并希望在您的减速器之间均匀分布,那么您最好为输出数据实现自定义分区器。例如,
您可以在作业配置中设置此自定义分区程序,
如果您想根据作业设置进行任何进一步的配置,您还可以在自定义分区程序中实现可配置接口。
另外,检查您是否没有在配置中的任何位置(查找“mapred.reduce.tasks”)或代码中将减少任务的数量设置为 1,例如
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
You can then set this custom partitioner in the job configuration with
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg