Map在Hadoop下运行时临时文件应该放在哪里
我在 SLES 10 (SUSE) 下运行 Hadoop 0.20.1。
我的地图任务获取一个文件并生成更多文件,然后从这些文件生成结果。我想知道应该将这些文件放置在哪里,这样性能良好并且不会发生冲突。如果 Hadoop 可以自动删除该目录,那就太好了。
现在,我正在使用临时文件夹和任务 ID 创建一个唯一的文件夹,然后在该文件夹的子文件夹中工作。
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
这种方法的问题是我不确定它是否是最佳的,而且我还必须删除每个新文件夹,否则我开始用完空间。 谢谢 阿金塔约
(编辑) 我发现保存您不希望超出地图生命周期的文件的最佳位置是 job.get("job.local.dir") ,它提供了一个在以下情况下将被删除的路径:地图任务完成。我不确定删除是基于每个键还是针对每个任务跟踪器完成的。
I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo
(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这种方法的问题在于,排序和洗牌会将数据移离数据本地化的位置。
我对您的数据了解不多,但分布式缓存可能适合您
${mapred.local.dir}/taskTracker/archive/ :分布式缓存。该目录保存本地化的分布式缓存。因此,本地分布式缓存在所有任务和作业之间共享
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
“这是MapReduce 程序通常要求每个映射或化简任务在执行之前读取一个或多个文件,例如,您可能有一个在处理一组记录之前需要解析的查找表,为了解决这种情况。 MapReduce 实现包括一个分布式文件缓存,用于管理将文件复制到任务执行节点。DistributedCache
是在 Hadoop 0.7.0 中引入的;有关其起源的更多详细信息,请参阅 HADOOP-288。 DistributedCache 的现有文档:请参阅 Hadoop FAQ、MapReduce 教程、Hadoop Javadoc 和 Hadoop Streaming 教程。阅读完现有文档并了解如何使用 DistributedCache 后,请回来。”
The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.
I do not know much about your data but the distributed cache might work well for you
${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."