给定 Hadoop 0.21.0 框架对数量做出什么假设相对于每个单独的映射和化简操作的打开文件描述符的数量?具体来说,哪些子操作会导致 Hadoop 在作业执行期间打开新的文件描述符或溢出到磁盘?
(这是故意忽略 MultipleOutputs
,因为它非常明显地破坏了系统提供的保证。)
我的理由很简单:我想确保我为 Hadoop 编写的每项作业保证每个映射器或化简器所需的文件描述符数量有限。 Hadoop 乐于将其从程序员手中抽象出来,如果不是服务器管理过程中的其他问题,这通常是一件好事。
我最初在服务器故障上问了这个问题从集群管理方面来看。由于我也负责编程,所以这个问题在这里同样相关。
Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?
(This is deliberately ignoring use of MultipleOutputs
, as it very clearly screws with the guarantees provided by the system.)
My rationale here is simple: I'd like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.
I'd originally asked this question on Server Fault from the cluster management side of things. Since I'm also responsible for programming, this question is equally pertinent here.
发布评论
评论(1)
这里有一篇文章,提供了对该问题的一些见解:
这意味着,对于正常行为,映射器的数量完全等于打开的文件描述符的数量。
MultipleOutputs
显然会通过映射器数量乘以可用分区数量来扭曲这个数字。然后,Reducer 会照常进行,为每个 Reduce 操作生成一个文件(从而生成一个文件描述符)。那么问题就变成了:在溢出操作期间,大多数文件都被每个映射器保持打开状态,因为输出被拆分愉快地进行了处理。因此出现了可用文件描述符问题。
因此,当前假设的最大文件描述符限制应该是:
正如我们所说,就是这样。
Here's a post that offers some insight into the problem:
This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors.
MultipleOutputs
obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.The problem then becomes: during a
spill
operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.Thus, the currently-assumed, maximum file descriptor limit should be:
And that, as we say, is that.