HDFS中是否允许使用合并或重新分区来合并小文件(但合并后会很大)?
我正在使用 hdfs-sink-connector 将 Kafka 的数据消费到 HDFS 中。
Kafka连接器每10分钟写入一次数据,有时写入的文件非常小;它的大小从 2MB 到 100MB 不等。因此,写入的文件实际上浪费了我的 HDFS 存储,因为每个块大小为 256MB。
该目录是按日期创建的;所以我想知道如果每天批量将许多小文件合并为一个大文件会很棒。 (我预计 HDFS 会自动将一个大文件划分为块大小。)
我知道有很多答案说我们可以使用 Spark 的 coalesce(1)
或 repartition(1)
,但我担心如果我读取整个目录并使用这些函数会出现 OOM 错误;如果我把每个文件都读一遍的话,可能会超过90GB~100GB。
HDFS 中允许 90~100GB 吗?我不需要担心吗? 谁能告诉我是否有合并小型 HDFS 文件的最佳实践?谢谢!
I'm using an hdfs-sink-connector to consume Kafka's data into HDFS.
The Kafka connector writes data every 10 minutes, and sometimes the written file's size is really small; it varies from 2MB to 100MB. So, the written files actually waste my HDFS storage since each block size is 256MB.
The directory is created per date; so I wondered it would be great to merge many small files into one big file by daily batch. (I expected the HDFS will automatically divide one large file into block size as a result.)
I know there are many answers which say we could use spark's coalesce(1)
or repartition(1)
, but I worried about OOM error if I read the whole directory and use those functions; it might be more than 90GB~100GB if I read every file.
Will 90~100GB in HDFS be allowed? Am I don't need to be worried about it?
Could anyone let me know if there is a best practice for merging small HDFS files? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
HDFS 不会“填充”块中未使用的部分。因此,2MB 文件仅使用磁盘上的 2MB(如果考虑 3 次复制,则为 6MB)。 HDFS 上小文件的主要问题是数十亿个小文件可能会导致问题。
Spark 可能是一个内存中处理框架,但如果数据不适合内存,它仍然可以工作。在这种情况下,处理会溢出到磁盘上,并且速度会慢一些。
这绝对没问题——毕竟这是大数据。正如您所指出的,实际文件将在后台分割成更小的块(但除非您使用hadoop fsck,否则您不会看到这一点)。
HDFS doesn't "fill out" the unused parts of the block. So a 2MB file only uses 2MB on disk (well, 6MB if you account for 3x replication). The main concern with small files on HDFS is that billions of small files can cause problems.
Spark may be an in-memory processing framework, but it still works if the data doesn't fit into memory. In such situations processing spills over onto disk and will be a bit slower.
That is absolutely fine - this is big data after all. As you noted, the actual file will be split into smaller blocks in the background (but you won't see this unless you use
hadoop fsck
).