在 Hadoop 中,框架将正常 Map-Reduce 应用程序中 Map 任务的输出保存在哪里?

发布于 2024-12-22 17:21:27 字数 1202 浏览 5 评论 0原文

我试图找出 Map 任务的输出在被Reduce 任务使用之前保存到磁盘的位置。

注意: - 使用的版本是带有新 API 的 Hadoop 0.20.204

例如,当覆盖 Map 类中的 map 方法时:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }

    // code that starts a new Job.

}

我有兴趣找出 context.write() 最终在哪里写入数据。到目前为止,我遇到了:

FileOutputFormat.getWorkOutputPath(context);

这给了我 hdfs 上的以下位置:

hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

当我尝试将其用作另一个作业的输入时,它给了我以下错误:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

注意: 该作业开始于Mapper,因此从技术上讲,当新作业开始时,Mapper 任务写入其输出的临时文件夹就存在。话又说回来,它仍然说输入路径不存在。

关于临时输出写入何处有什么想法吗?或者,在同时具有 Map 和 Reduce 阶段的作业期间,我可以在哪里找到 Map 任务的输出?

I am trying to find out where does the output of a Map task is saved to disk before it can be used by a Reduce task.

Note: - version used is Hadoop 0.20.204 with the new API

For example, when overwriting the map method in the Map class:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }

    // code that starts a new Job.

}

I am interested to find out where does context.write() ends up writing the data. So far i've ran into the:

FileOutputFormat.getWorkOutputPath(context);

Which gives me the following location on hdfs:

hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

When i try to use it as input for another job it gives me the following error:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

Note: the job is started in the Mapper, so technically, the temporary folder where the Mapper task is writing it's output exists when the new job begins. Then again, it still says that the input path does not exist.

Any ideas to where the temporary output is written to? Or maybe what is the location where i can find the output of a Map task during a job that has both a Map and a Reduce stage?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

回眸一笑 2024-12-29 17:21:27

MapReduce框架会将中间输出存储到本地磁盘而不是HDFS中,因为这会导致不必要的文件复制。

Map reduce framework will store intermediate output into local disk rather than HDFS as this would cause unnecessarily replication of files.

雨落□心尘 2024-12-29 17:21:27

所以,我已经弄清楚到底发生了什么。

映射器的输出会被缓冲,直到达到其大小的 80% 左右,此时它开始将结果转储到本地磁盘,并继续将项目放入缓冲区。

我想在映射器仍在运行时获取映射器的中间输出并将其用作另一个作业的输入。事实证明,如果不大量修改 hadoop 0.20.204 部署,这是不可能的。系统的工作方式是,即使在地图上下文中指定了所有内容:

map .... {
  setup(context)
  .
  .
  cleanup(context)
}

并且调用了清理,仍然没有转储到临时文件夹。

之后,整个 Map 计算的所有内容最终都会被合并并转储到磁盘,并成为减速器之前的洗牌和排序阶段的输入。

到目前为止,从我阅读和查看的所有内容来看,输出最终应位于的临时文件夹是我事先猜测的。

FileOutputFormat.getWorkOutputPath(context)

我设法以不同的方式完成了我想做的事情。反正
如果对此有任何疑问,请告诉我。

So, I've figured out what is really going on.

The output of the mapper is buffered until it gets to about 80% of its size, and at that point it begins to dump the result to its local disk and continues to admit items into the buffer.

I wanted to get the intermediate output of the mapper and use it as input for another job, while the mapper was still running. It turns out that this is not possible without heavily modifying the hadoop 0.20.204 deployment. The way the system works is even after all the things that are specified in the map context:

map .... {
  setup(context)
  .
  .
  cleanup(context)
}

and the cleanup is called, there is still no dumping to the temporary folder.

After, the whole Map computation everything eventually gets merged and dumped to disk and becomes the input for the Shuffling and Sorting stages that precede the Reducer.

So far from all I've read and looked at, the temporary folder where the output should be eventually, is the one that I was guessing beforehand.

FileOutputFormat.getWorkOutputPath(context)

I managed to the what I wanted to do in a different way. Anyway
any questions there might be about this, let me know.

擦肩而过的背影 2024-12-29 17:21:27

任务跟踪器为每个Map 或Reduce 任务启动一个单独的JVM 进程。

Mapper输出(中间数据)被写入每个Mapper从节点的本地文件系统(不是HDFS)。一旦数据传输到Reducer,我们将无法访问这些临时文件。

如果您想查看 Mapper 输出,我建议使用 IdentityReducer

Task tracker starts a separate JVM process for every Map or Reduce task.

Mapper output (intermediate data) is written to the Local file system (NOT HDFS) of each mapper slave node. Once data transferred to Reducer, We won’t be able to access these temporary files.

If you what to see your Mapper output, I suggest using IdentityReducer?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文