为什么不用 hadoop TeraSort 的映射器/减速器

发布于 2024-11-18 10:31:45 字数 1815 浏览 7 评论 0原文

我计划在 Hadoop 0.20.2 中的 TeraSort 类的映射器中插入一些代码。然而，查看源代码后，我找不到实现mapper的部分。通常，我们会看到一个名为 job.setMapperClass() 的方法，它指示映射器类。但是，对于 TeraSort，我只能看到像 setInputformat、setOutputFormat 这样的东西。我找不到调用mapper和reduce方法的地方？任何人都可以对此提供一些提示吗？谢谢，源代码是这样的，

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

对于其他类，如 TeraValidate，我们可以找到这样的代码，

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

我看不到 TeraSort 的此类方法。

谢谢，

原文

I am planning to insert some code into the mapper of the TeraSort class in Hadoop 0.20.2. However, after reviewing the source code, I cannot locate the segment that mapper is implemented.
Normally, we will see a method called job.setMapperClass() which indicates the mapper class. However, for the TeraSort, I can only see thing like setInputformat, setOutputFormat. I canno t find where the mapper and reduce methods are called?
can any one please give some hints about this? Thanks,
The source code is something like this,

public int run(String[] args) throws Exception {
   LOG.info("starting");
   JobConf job = (JobConf) getConf();
   Path inputDir = new Path(args[0]);
   inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
   Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
   URI partitionUri = new URI(partitionFile.toString() +
                           "#" + TeraInputFormat.PARTITION_FILENAME);
   TeraInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   job.setJobName("TeraSort");
   job.setJarByClass(TeraSort.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
   job.setInputFormat(TeraInputFormat.class);
   job.setOutputFormat(TeraOutputFormat.class);
   job.setPartitionerClass(TotalOrderPartitioner.class);
   TeraInputFormat.writePartitionFile(job, partitionFile);
   DistributedCache.addCacheFile(partitionUri, job);
   DistributedCache.createSymlink(job);
   job.setInt("dfs.replication", 1);
   // TeraOutputFormat.setFinalSync(job, true);                                                                                                                                                                                             
   job.setNumReduceTasks(0);
   JobClient.runJob(job);
   LOG.info("done");
   return 0;
 }

For other classes, like TeraValidate, we can find the code like,

job.setMapperClass(ValidateMapper.class);
job.setReducerClass(ValidateReducer.class);

I cannot see such methods for TeraSort.

Thanks,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

过期情话 2024-11-25 10:31:46

托马斯的答案是正确的，即映射器和缩减器是同一的，因为在应用缩减函数之前对混洗数据进行排序。 terasort 的特别之处在于它的自定义分区器（不是默认的哈希函数）。您应该从这里阅读更多相关信息Hadoop 的 Terasort 实现。它指出

“TeraSort 是一种标准的映射/归约排序，但自定义分区器除外，该分区程序使用 N − 1 个采样键的排序列表来定义每个归约的键范围。特别是，所有键都满足样本 [i − 1] <= key

回复收藏 0 原文

世界等同你 2024-11-25 10:31:45

为什么排序需要为其设置Mapper和Reducer类？

默认值是标准Mapper（以前的identity Mapper）和标准Reducer。
这些是您通常继承的类。

基本上可以说，您只是从输入中发出所有内容，然后让 Hadoop 自己进行排序。所以排序是“默认”进行的。

回复收藏 0 原文

~没有更多了~

关于作者

述情

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

为什么不用 hadoop TeraSort 的映射器/减速器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

为什么不用 hadoop TeraSort 的映射器/减速器

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。