Hadoop 0.21.0 中打开文件描述符的预期消耗

发布于 2024-10-05 16:33:31 字数 709 浏览 6 评论 0 原文

给定 Hadoop 0.21.0 框架对数量做出什么假设相对于每个单独的映射和化简操作的打开文件描述符的数量?具体来说,哪些子操作会导致 Hadoop 在作业执行期间打开新的文件描述符或溢出到磁盘?

(这是故意忽略 MultipleOutputs,因为它非常明显地破坏了系统提供的保证。)

我的理由很简单:我想确保我为 Hadoop 编写的每项作业保证每个映射器或化简器所需的文件描述符数量有限。 Hadoop 乐于将其从程序员手中抽象出来,如果不是服务器管理过程中的其他问题,这通常是一件好事。

我最初在服务器故障上问了这个问题从集群管理方面来看。由于我也负责编程,所以这个问题在这里同样相关。

Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?

(This is deliberately ignoring use of MultipleOutputs, as it very clearly screws with the guarantees provided by the system.)

My rationale here is simple: I'd like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.

I'd originally asked this question on Server Fault from the cluster management side of things. Since I'm also responsible for programming, this question is equally pertinent here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

与往事干杯 2024-10-12 16:33:31

这里有一篇文章,提供了对该问题的一些见解:

发生这种情况是因为使用 MultipleOutputs 类时会创建更多小文件。
假设您有 50 个映射器,然后假设您没有倾斜数据,Test1 将始终生成 50 个文件,但 Test2 将生成 50 到 1000 个文件(50Mappers x 20TotalPartitionsPossible),这会导致 I/O 性能下降。在我的基准测试中,为 Test1 生成了 199 个输出文件,为 Test2 生成了 4569 个输出文件。

这意味着,对于正常行为,映射器的数量完全等于打开的文件描述符的数量。 MultipleOutputs 显然会通过映射器数量乘以可用分区数量来扭曲这个数字。然后,Reducer 会照常进行,为每个 Reduce 操作生成一个文件(从而生成一个文件描述符)。

那么问题就变成了:在溢出操作期间,大多数文件都被每个映射器保持打开状态,因为输出被拆分愉快地进行了处理。因此出现了可用文件描述符问题。

因此,当前假设的最大文件描述符限制应该是:

映射阶段:映射器数量 * 可能的分区总数

归约阶段:归约操作数 * 可能的分区总数

正如我们所说,就是这样。

Here's a post that offers some insight into the problem:

This happens because more small files are created when you use MultipleOutputs class.
Say you have 50 mappers then assuming that you don't have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.

This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors. MultipleOutputs obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.

The problem then becomes: during a spill operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.

Thus, the currently-assumed, maximum file descriptor limit should be:

Map phase: number of mappers * total partitions possible

Reduce phase: number of reduce operations * total partitions possible

And that, as we say, is that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文