我正在尝试为有些倾斜的输入数据编写一个新的 Hadoop 作业。对此的一个类比是 Hadoop 教程中的字数统计示例,除非假设某个特定单词出现了很多次。
我想要一个分区函数,其中这个键将根据通常的哈希分区映射到多个减速器和其余键。这可能吗?
提前致谢。
I am trying to write a new Hadoop job for input data that is somewhat skewed. An analogy for this would be the word count example in Hadoop tutorial except lets say one particular word is present lot of times.
I want to have a partition function where this one key will be mapped to multiple reducers and remaining keys according to their usual hash paritioning. Is this possible?
Thanks in advance.
发布评论
评论(2)
不要认为在 Hadoop 中相同的键可以映射到多个减速器。但是,可以对键进行分区,以便减速器或多或少均匀地加载。为此,应对输入数据进行采样并对键进行适当的分区。有关自定义分区程序的更多详细信息,请查看 Yahoo Paper。 Yahoo Sort 代码位于 org.apache.hadoop.examples.terasort 包。
假设输入中键 A 有 10 行,B 有 20 行,C 有 30 行,D 有 60 行。然后可以将密钥A、B、C发送到reducer 1,将密钥D发送到reducer 2,以使reducer上的负载均匀分布。要对键进行分区,必须进行输入采样以了解键的分布方式。
这里还有一些可以让作业更快完成的建议。
在 Combiner .apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobConf.html#setCombinerClass%28java.lang.Class%29" rel="noreferrer">JobConf 来减少发送到reducer的key的数量。这也减少了映射器和减速器任务之间的网络流量。尽管如此,并不能保证 Hadoop 框架会调用组合器。
另外,由于数据是倾斜的(一些键一次又一次重复,比如说“工具”),您可能需要 增加减少任务的数量以更快地完成作业。这确保了当一个减速器处理“工具”时,其他数据正在由其他减速器并行处理。
Don't think that in Hadoop the same key can be mapped to multiple reducers. But, the keys can be partitioned so that the reducers are more or less evenly loaded. For this, the input data should be sampled and the keys be partitioned appropriately. Check the Yahoo Paper for more details on the custom partitioner. The Yahoo Sort code is in the org.apache.hadoop.examples.terasort package.
Lets say Key A has 10 rows, B has 20 rows, C has 30 rows and D has 60 rows in the input. Then keys A,B,C can be sent to reducer 1 and key D can be sent to reducer 2 to make the load on the reducers evenly distributed. To partition the keys, input sampling has to be done to know how the keys are distributed.
Here are some more suggestions to make the Job complete faster.
Specify a Combiner on the JobConf to reduce the number of keys sent to the reducer. This also reduces the network traffic between the mapper and the reducer tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop framework.
Also, since the data is skewed (some of the keys are repeated again and again, lets say 'tools'), you might want to increase the # of reduce tasks to complete the Job faster. This ensures that while a reducer is processing 'tools', the other data is getting processed by other reducers in parallel.
如果出于性能原因将数据拆分到多个缩减器中,那么您需要第二个缩减器将数据聚合到最终结果集中。
Hadoop 有一个内置功能可以实现类似的功能:组合器。
组合器是一种“减速器”功能。
这确保了在映射任务中可以对数据进行部分缩减,从而减少稍后需要处理的记录数量。
在基本字数统计示例中,组合器与减速器完全相同。
请注意,某些算法将需要这两种算法的不同实现。
我还有一个项目,由于算法的原因,组合器是不可能的。
If you split your data over multiple reducers for performance reasons then you need a second reducer to aggregate the data into the final result set.
Hadoop has a feature built in that does something like that: the combiner.
The combiner is a "reducer" kind of functionality.
This ensures that within the map task a partial reduce can be done of the data and as such reduces the number of records that need to be processed later on.
In the basic wordcount example the combiner is exactly the same as the reducer.
Note that some algorithms you will need a different implementation for these two.
I've also had a project where a combiner was not possible because of the algorithm.