为什么不用 hadoop TeraSort 的映射器/减速器

发布于 2024-11-18 10:31:45 字数 1815 浏览 1 评论 0原文

我计划在 Hadoop 0.20.2 中的 TeraSort 类的映射器中插入一些代码。然而,查看源代码后,我找不到实现mapper的部分。 通常,我们会看到一个名为 job.setMapperClass() 的方法,它指示映射器类。但是,对于 TeraSort,我只能看到像 setInputformat、setOutputFormat 这样的东西。我找不到调用mapper和reduce方法的地方? 任何人都可以对此提供一些提示吗?谢谢, 源代码是这样的,

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

对于其他类,如 TeraValidate,我们可以找到这样的代码,

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

我看不到 TeraSort 的此类方法。

谢谢,

I am planning to insert some code into the mapper of the TeraSort class in Hadoop 0.20.2. However, after reviewing the source code, I cannot locate the segment that mapper is implemented.
Normally, we will see a method called job.setMapperClass() which indicates the mapper class. However, for the TeraSort, I can only see thing like setInputformat, setOutputFormat. I canno t find where the mapper and reduce methods are called?
can any one please give some hints about this? Thanks,
The source code is something like this,

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

For other classes, like TeraValidate, we can find the code like,

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

I cannot see such methods for TeraSort.

Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

过期情话 2024-11-25 10:31:46

托马斯的答案是正确的,即映射器和缩减器是同一的,因为在应用缩减函数之前对混洗数据进行排序。 terasort 的特别之处在于它的自定义分区器(不是默认的哈希函数)。您应该从这里阅读更多相关信息Hadoop 的 Terasort 实现。它指出

“TeraSort 是一种标准的映射/归约排序,但自定义分区器除外,该分区程序使用 N − 1 个采样键的排序列表来定义每个归约的键范围。特别是,所有键都满足样本 [i − 1] <= key

Thomas answer is right i.e mapper and reducers are identity since shuffled data is sorted before applying your reduce function . Whats special about terasort is its custom partitioner (which is not default hash function). You should read more about it from here Hadoop's implementation for Terasort. It states

"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

世界等同你 2024-11-25 10:31:45

为什么排序需要为其设置Mapper和Reducer类?

默认值是标准Mapper(以前的identity Mapper)和标准Reducer
这些是您通常继承的类。

基本上可以说,您只是从输入中发出所有内容,然后让 Hadoop 自己进行排序。所以排序是“默认”进行的。

Why should a sort need to set the Mapper and Reducer class for it?

The default value is the standard Mapper (former identity Mapper) and standard Reducer.
These are the classes you usually inherit from.

You can basically say, that you're just emitting everything from the input and let Hadoop do its own sorting stuff. So sorting is working by "default".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文