输入格式是否负责在 Hadoop 的 MapReduce 中实现数据局部性？

发布于 2024-11-09 16:50:23 字数 372 浏览 0 评论 0原文

我试图了解数据局部性，因为它与 Hadoop 的 Map/Reduce 框架相关。特别是我试图了解什么组件处理数据局部性（即它是输入格式？）

雅虎的开发者网络页面指出“Hadoop框架然后使用来自分布式文件系统的知识将这些进程安排在数据/记录位置附近。”这似乎意味着 HDFS 输入格式可能会查询名称节点以确定哪些节点包含所需数据，并在可能的情况下在这些节点上启动映射任务。人们可以想象 HBase 可以采用类似的方法，通过查询来确定哪些区域正在服务某些记录。

如果开发人员编写自己的输入格式，他们将负责实现数据局部性吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扬花落满肩 2024-11-16 16:50:23

你说得对。如果您正在查看 FileInputFormat 类和 getSplits() 方法。它搜索块位置：

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

这意味着文件系统查询。这发生在 JobClient 内部，结果被写入 SequenceFile（实际上它只是原始字节代码）。
因此，Jobtracker 稍后在初始化作业时读取此文件，并且几乎只是将任务分配给输入分割。

但数据的分发是NameNode 的工作。

现在回答你的问题：
通常，您是从 FileInputFormat 扩展的。所以你将被迫返回一个InputSplit列表，并且在初始化步骤中需要设置分割的位置。例如FileSplit：

public FileSplit(Path file, long start, long length, String[] hosts)

所以实际上你并没有实现数据局部性本身，你只是告诉在哪个主机上可以找到分割。这可以通过FileSystem 接口轻松查询。

You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code).
So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.

BUT the distribution of the data is the NameNodes job.

To your question now:
Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:

public FileSplit(Path file, long start, long length, String[] hosts)

So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.

回复收藏 0 原文

波浪屿的海角声 2024-11-16 16:50:23

Mu的理解是，数据局部性是由HDFS和InputFormat共同决定的。前者确定（通过机架感知）并存储 HDFS 块跨数据节点的位置，而后者将确定哪些块与哪个拆分相关联。 jobtracker 将尝试通过确保与每个拆分（1 个拆分到 1 个映射任务映射）关联的块对于任务跟踪器而言是本地的，来尝试优化将哪些拆分传递到哪个映射任务。

不幸的是，这种保证局部性的方法保留在同质集群中，但在非同质集群中会崩溃，即每个数据节点有不同大小的硬盘的集群。如果您想更深入地了解这一点，您应该阅读这篇论文（提高 MapReduce 性能通过异构 hadoop 集群中的数据放置），这也涉及与您的问题相关的几个主题。

回复收藏 0 原文

~没有更多了~