关于Hadoop/HDFS文件分割
只是想确认一下。请验证这是否正确: 1.根据我的理解,当我们将文件复制到HDFS时,即文件(假设其大小> 64MB = HDFS块大小)被分割成多个块,并且每个块存储在不同的数据节点上。
当文件复制到 HDFS 时,文件内容已经被分割成块,并且在运行映射作业时不会发生文件分割。 Map 任务仅以在 max 的每个块上工作的方式进行调度。大小 64 MB,具有数据局部性(即映射任务在包含数据/块的节点上运行)
如果文件被压缩(gzipped),也会发生文件分割,但 MR 确保每个文件仅由一个映射器处理,即MR 将收集位于其他数据节点的所有 gzip 文件块,并将它们全部提供给单个映射器。
如果我们定义 isSplitable() 返回 false,就会发生与上面相同的情况,即文件的所有块将由运行在一台机器上的一个映射器处理。 MR 将从不同的数据节点读取文件的所有块,并使它们可供单个映射器使用。
Want to just confirm on following. Please verify if this is correct:
1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size > 64MB = HDFS block size) is split into multiple chunks and each chunk is stored on different data-nodes.
File contents are already split into chunks when file is copied into HDFS and that file-split does not happen at the time of running map job. Map tasks are only scheduled in such a way that they work on each chunk of max. size 64 MB with data-locality (i.e. map task runs on that node which contains the data/chunk)
File splitting also happens if file is compressed (gzipped) but MR ensures that each file is processed by just one mapper, i.e. MR will collect all the chunks of gzip file lying at other data nodes and give them all to the single mapper.
Same thing as above will happen if we define isSplitable() to return false, i.e. all the chunks of a file will be processed by one mapper running on one machine. MR will read all the chunks of a file from different data-nodes and make them available to a single mapper.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
大卫的回答几乎一语中的,我只是在这里详细阐述一下。
这里有两个不同的概念在起作用,每个概念都由 hadoop 框架中的不同实体处理。
首先 --
1) 将文件划分为块 -- 当文件写入 HDFS 时,HDFS 划分将文件分成块并负责其复制。此操作(大部分)执行一次,然后可供集群上运行的所有 MR 作业使用。这是集群范围的配置
其次 --
2) 将文件分割为输入分割 -- 当输入路径传递到 MR 作业时,MR 作业使用该路径以及配置为分割的输入格式将输入路径中指定的文件分成多个 split,每个 split 由一个 map 任务处理。输入分割的计算是通过每次执行作业时的输入格式来完成的
现在一旦我们掌握了这一点,我们就可以理解 isSplitable() 方法属于第二类。
要真正确定这一点,请查看 HDFS 写入数据流(概念 1)
第二点该图可能是发生拆分的地方,请注意,这与 MR 作业的运行无关
现在看一下 MR 作业的执行步骤
这里第一步是通过为作业配置的输入格式计算输入分割。
您的很多困惑源于您同时使用这两个概念这一事实,我希望这能让您更清楚一些。
David's answer pretty much hits the nail on its head, i am just elaborating on it here.
There are two distinct concepts at work here, each concept is handled by a different entity in the hadoop framework
Firstly --
1) Dividing a file into blocks -- When a file is written into HDFS, HDFS divides the file into blocks and takes care of its replication. This is done once (mostly), and then is available to all MR jobs running on the cluster. This is a cluster wide configuration
Secondly --
2) Splitting a file into input splits -- When input path is passed into a MR job, the MR job uses the path along with the input format configured to divide the files specified in the input path into splits, each split is processed by a map task. Calculation of input splits is done by the input format each time a job is executed
Now once we have this under our belt, we can understand that isSplitable() method comes under the second category.
To really nail this down have a look at the HDFS write data flow (Concept 1)
The second point in the diagram is probably where the split happens, note that this has nothing to do with running of a MR Job
Now have a look at the execution steps of a MR job
Here the first step is the calculation of the input splits via the inputformat configured for the job.
A lot of your confusion stems from the fact that you are clubbing both of these concepts, i hope this makes it a little clearer.
你的理解并不理想。
我想指出的是,有两个几乎独立的过程:将文件拆分为 HDFS 块,以及拆分文件以供不同映射器处理。
HDFS 根据定义的块大小将文件分割成块。
每种输入格式都有自己的逻辑,如何将文件分割成多个部分,以便不同的映射器独立处理。 FileInputFormat的默认逻辑是按HDFS块分割文件。您可以实现任何其他逻辑
压缩通常是分割的敌人,因此我们采用块压缩技术来实现压缩数据的分割。这意味着文件(块)的每个逻辑部分都是独立压缩的。
Your understanding is not ideal.
I would point out that there are two, almost independent processes: splitting files into HDFS blocks, and splitting files for processing by the different mappers.
HDFS split files into blocks based on the defined block size.
Each input format has its own logic how files can be split into part for the independent processing by different mappers. Default logic of the FileInputFormat is to split file by HDFS blocks. You can implement any other logic
Compression, usually is a foe of the splitting, so we employ block compression technique to enable splitting of the compressed data. It means that each logical part of the file (block) is compressed independently.
是的,当文件复制到 HDFS 时,文件内容会被分割成块。块大小是可配置的,如果说 128 MB,那么整个 128 MB 将是一个块,而不是单独的 2 个 64 MB 块。而且文件的每个块都没有必要存储在单独的数据节点上。数据节点可能具有多个特定文件的块。并且基于复制因子,特定块可能存在于多个数据节点中。
Yes, file contents are split into chunks when the file is copied into the HDFS. The block size is configurable, and if it is say 128 MB, then whole 128 MB would be one block, not 2 blocks of 64 MB separately.Also it is not necessary that each chunk of a file is stored on a separate datanode.A datanode may have more than one chunk of a particular file.And a particular chunk may be present in more than one datanodes based upon the replication factor.