输入格式是否负责在 Hadoop 的 MapReduce 中实现数据局部性?
我试图了解数据局部性,因为它与 Hadoop 的 Map/Reduce 框架相关。特别是我试图了解什么组件处理数据局部性(即它是输入格式?)
雅虎的开发者网络页面指出“Hadoop框架然后使用来自分布式文件系统的知识将这些进程安排在数据/记录位置附近。”这似乎意味着 HDFS 输入格式可能会查询名称节点以确定哪些节点包含所需数据,并在可能的情况下在这些节点上启动映射任务。人们可以想象 HBase 可以采用类似的方法,通过查询来确定哪些区域正在服务某些记录。
如果开发人员编写自己的输入格式,他们将负责实现数据局部性吗?
I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你说得对。如果您正在查看
FileInputFormat
类和getSplits()
方法。它搜索块位置:BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
这意味着文件系统查询。这发生在 JobClient 内部,结果被写入 SequenceFile(实际上它只是原始字节代码)。
因此,Jobtracker 稍后在初始化作业时读取此文件,并且几乎只是将任务分配给输入分割。
但数据的分发是NameNode 的工作。
现在回答你的问题:
通常,您是从
FileInputFormat
扩展的。所以你将被迫返回一个InputSplit
列表,并且在初始化步骤中需要设置分割的位置。例如FileSplit
:所以实际上你并没有实现数据局部性本身,你只是告诉在哪个主机上可以找到分割。这可以通过
FileSystem
接口轻松查询。You're right. If you're looking at the
FileInputFormat
class and thegetSplits()
method. It searches for the Blocklocations:BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
This implies the FileSystem query. This happens inside the
JobClient
, the results getting written into a SequenceFile (actually it's just raw byte code).So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.
BUT the distribution of the data is the NameNodes job.
To your question now:
Normally you are extending from the
FileInputFormat
. So you will be forced to return a list ofInputSplit
, and in the initialization step it is required for such a thing to set the location of the split. For example theFileSplit
:So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the
FileSystem
interface.Mu的理解是,数据局部性是由HDFS和InputFormat共同决定的。前者确定(通过机架感知)并存储 HDFS 块跨数据节点的位置,而后者将确定哪些块与哪个拆分相关联。 jobtracker 将尝试通过确保与每个拆分(1 个拆分到 1 个映射任务映射)关联的块对于任务跟踪器而言是本地的,来尝试优化将哪些拆分传递到哪个映射任务。
不幸的是,这种保证局部性的方法保留在同质集群中,但在非同质集群中会崩溃,即每个数据节点有不同大小的硬盘的集群。如果您想更深入地了解这一点,您应该阅读这篇论文(提高 MapReduce 性能通过异构 hadoop 集群中的数据放置),这也涉及与您的问题相关的几个主题。
Mu understanding is that data locality is jointly determined by HDFS and the InputFormat. The former determines (via rack awareness) and stores the location of HDFS blocks across datanodes while the latter will determine which blocks are associated with which split. The jobtracker will try to optimize which splits are delivered to which map task by making sure that the blocks associated for each split (1 split to 1 map task mapping) are local to the tasktracker.
Unfortunately, this approach to guaranteeing locality is preserved in homogeneous clusters but would break down in inhomogeneous ones i.e. ones where there are different sizes of hard disks per datanode. If you want to dig deeper on this you should read this paper (Improving MapReduce performance through data placement in heterogeneous hadoop clusters) that also touches on several topics relative to your question.