小文件和 HDFS 块

发布于 2024-12-22 01:31:06 字数 46 浏览 2 评论 0原文

Hadoop分布式文件系统中的一个块存储多个小文件,还是一个块只存储1个文件?

Does a block in Hadoop Distributed File System store multiple small files, or a block stores only 1 file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

陈年往事 2024-12-29 01:31:06

多个文件不存储在单个块中。顺便说一句,单个文件可以存储在多个块中。文件和块 ID 之间的映射保存在 NameNode 中。

根据Hadoop:权威指南

与单个磁盘的文件系统不同,HDFS 中小于单个块的文件不会占用整个块的底层存储。

HDFS 旨在处理大文件。如果有太多小文件,则可能会加载 NameNode,因为它存储 HDFS 的名称空间。查看这篇文章了解如何缓解问题小文件太多。

Multiple files are not stored in a single block. BTW, a single file can be stored in multiple blocks. The mapping between the file and the block-ids is persisted in the NameNode.

According to the Hadoop : The Definitive Guide

Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

HDFS is designed to handle large files. If there are too many small files then the NameNode might get loaded since it stores the name space for HDFS. Check this article on how to alleviate the problem with too many small files.

樱桃奶球 2024-12-29 01:31:06

Hadoop 块大小是 Hadoop 存储概念。每次在 Hadoop 中存储文件时,它都会分为块大小,并根据复制因子和数据局部性将其分布在集群上。

详细信息:

  • 当您将文件推送到 HDFS 上时,它将被分为块。每个块就像一个单独的文件,具有由块大小描述的最大大小。

  • 每个块都会包含一个.meta文件,用于存储该块在Hadoop上的元数据信息。

  • 如果文件非常小,那么整个文件将在一个块中,并且该块(存储文件)将与文件和元文件具有相同的大小。

一些命令:

  • 连接到集群上的任何数据节点[如果您有访问权限;)]。然后转到该节点的存储目录,您可以看到数据节点上存储的实际块,如下所示。

(目录按照我的集群 - /data2/dfs/dn/):

块大小:1 GB

cd /data/dfs/dn ->当前->最终确定->子目录0 -> (这里是黄金

块仅使用小文件的 KB 存储空间,或者当文件大小是我的块大小 + 一些 KB 时可能会这样

-rw-r--r- - 1 hdfs hdfs 91K 九月 13 16:19 blk_1073781504

-rw-r--r-- 1 hdfs hdfs 19K 九月13 16:21 blk_1073781504_40923.meta

当文件大于块大小时,块将如下所示

-rw-r--r-- 1 hdfs hdfs 1.0G Aug 31 12:03 blk_1073753814

-rw-r--r-- 1 hdfs hdfs 8.1M Aug 31 12:04 blk_1073753814_12994.meta

我希望它能解释块存储的内容。如果您想了解文件如何存储在块中的详细信息,请运行

hdfs fsck -blocks -locations

如果我在这里遗漏了任何内容,请告诉我。

Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.

For Details:

  • When you Push a File on HDFS, it will be divided into blocks. Each Block is like a individual file having a maximum size as described by the block size.

  • Every block will contain a .meta file along with it, to store the metadata information of the block on Hadoop.

  • If the file is very small, then the whole file will be in one block and the block (a storage file) will have same size as file and a Meta File.

Some Commands:

  • Connect to any data Node on Your cluster [ if you have access ;)]. Then go to the storage directories for that node and you can see the actual blocks stored on the data node as below.

(Dir's are as per my cluster - /data2/dfs/dn/):

BLOCK Size: 1 GB

cd /data/dfs/dn -> current -> Finalized -> subDir0 -> (here is the Gold)

Block used only KB of storage for small files or might be when the file size is my blocksize + some KB's

-rw-r--r-- 1 hdfs hdfs 91K Sep 13 16:19 blk_1073781504

-rw-r--r-- 1 hdfs hdfs 19K Sep 13 16:21 blk_1073781504_40923.meta

When the File is Bigger then the block size the block will look like something as below

-rw-r--r-- 1 hdfs hdfs 1.0G Aug 31 12:03 blk_1073753814

-rw-r--r-- 1 hdfs hdfs 8.1M Aug 31 12:04 blk_1073753814_12994.meta

I hope it will explain the block storage stuff. If you want to know the detail how your files is stored in blocks then run

hdfs fsck -blocks -locations

Let me know if I missed out anything here.

尬尬 2024-12-29 01:31:06

您可以使用 HAR(Hadoop Archive)文件系统来做到这一点,它尝试将多个小文件打包到由 HAR 文件系统管理的特殊部分文件的 HDFS 块中。

Well you could do that using HAR (Hadoop Archive) filesystem which tries to pack multiple small files into HDFS block of special part file managed by HAR filesystem.

故人的歌 2024-12-29 01:31:06

一个块将存储一个文件。如果您的文件大于 BlockSize(64/128/..),那么它将被分区为具有各自 BlockSize 的多个块。

A block will store a single file. If your file is bigger that BlockSize(64/128/..) then it will be partitioned in multiple blocks with respective BlockSize.

羁客 2024-12-29 01:31:06

在hdfs中需要理解的要点是,文件根据大小分为块,而不是内存中会有一些块,用于存储文件(这是误解)

基本上多个文件不会存储在单个块中(除非是 Archive 或 Har 文件)。

The main point need to understand in hdfs , file is partioned into blocks based on size and not that there will be some blocks in memory, where files are stored(this is misconception)

Basically multiple files are not stored in a single block(unless it is Archive or Har file).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文