Map在Hadoop下运行时临时文件应该放在哪里

发布于 2024-09-12 08:57:35 字数 699 浏览 5 评论 0原文

我在 SLES 10 (SUSE) 下运行 Hadoop 0.20.1。

我的地图任务获取一个文件并生成更多文件，然后从这些文件生成结果。我想知道应该将这些文件放置在哪里，这样性能良好并且不会发生冲突。如果 Hadoop 可以自动删除该目录，那就太好了。

现在，我正在使用临时文件夹和任务 ID 创建一个唯一的文件夹，然后在该文件夹的子文件夹中工作。

reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir"); 
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;       
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);

这种方法的问题是我不确定它是否是最佳的，而且我还必须删除每个新文件夹，否则我开始用完空间。谢谢阿金塔约

（编辑）我发现保存您不希望超出地图生命周期的文件的最佳位置是 job.get("job.local.dir") ，它提供了一个在以下情况下将被删除的路径：地图任务完成。我不确定删除是基于每个键还是针对每个任务跟踪器完成的。

原文

I am running Hadoop 0.20.1 under SLES 10 (SUSE).

My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.

Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.

reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir"); 
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;       
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);

The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo

(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冧九 2024-09-19 08:57:35

这种方法的问题在于，排序和洗牌会将数据移离数据本地化的位置。

我对您的数据了解不多，但分布式缓存可能适合您

${mapred.local.dir}/taskTracker/archive/ ：分布式缓存。该目录保存本地化的分布式缓存。因此，本地分布式缓存在所有任务和作业之间共享

http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

“这是MapReduce 程序通常要求每个映射或化简任务在执行之前读取一个或多个文件，例如，您可能有一个在处理一组记录之前需要解析的查找表，为了解决这种情况。 MapReduce 实现包括一个分布式文件缓存，用于管理将文件复制到任务执行节点。DistributedCache

是在 Hadoop 0.7.0 中引入的；有关其起源的更多详细信息，请参阅 HADOOP-288。 DistributedCache 的现有文档：请参阅 Hadoop FAQ、MapReduce 教程、Hadoop Javadoc 和 Hadoop Streaming 教程。阅读完现有文档并了解如何使用 DistributedCache 后，请回来。”

回复收藏 0 原文

~没有更多了~