HDFS 介绍和使用
cli
全部的命令行参考 Hadoop-The File System (FS) shell
ls
查看目录下的文件
hdfs dfs -ls /path
: 查看/path
下面有什么文件hdfs dfs -ls -t -r /path
: 按时间倒序hdfs dfs -ls /path | awk '{ print $8 }' | xargs -I % echo 'hdfs dfs -du -s %' | sh
: 查看指定路径下各个文件夹的容量大小
相关的 option:
-h
: 人更易读懂的描述-r
: 反转,倒序-t
: 按照时间排序-R
: 递归文件夹,知道显示文件
rm
删除指定文件或文件夹
相关的 option:
-r
: 删除文件夹-skipTrash
: 跳过回收站,直接删除文件
cat
读取文件 hdfs dfs -cat <HDFS_FILE_PATH>
du
查看文件/文件夹大小 hdfs dfs -du <HDFS_PATH>
相关 option:
-h
: 人更易读懂的描述
other
hdfs dfs -expunge
: 清空回收站
FAQ
hadoop fs 和 hdfs dfs 的区别
difference between "hadoop fs" and "hdfs dfs" shell commands
- 根据官网 The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others.
bin/hadoop fs <args>
- All FS shell commands take path URIs as arguments. The URI format is scheme://authority/ path. For HDFS the scheme is hdfs , and for the Local FS the scheme is file . The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost ).
- 如果 File System 使用的是 HDFS,则
hadoop fs
等同于hdfs dfs
避免产生计算引擎生成文件的时候产生 SUCCESS metadata common_metadata 文件夹
hive 或者 spark 的文件保存到 HDFS 时,会产生了 _SUCCESS
以及 _metadata
和 _common_metadata
文件夹,如果不想要产生这些文件夹(因为这些文件夹 没有用 并且会导致 sql 报错 ).可以调整相关配置:
mapreduce.fileoutputcommitter.marksuccessfuljobs=false
: 不产生_SUCCESS
文件parquet.enable.summary-metadata=false
: 不产生_metadata
及_common_metadata
文件
如果是 spark-on-yarn 的情况,运行 spark-thrift-server 相关的配置是:
spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
spark.hadoop.parquet.enable.summary-metadata=false
(parquet 的配置仅需要在 spark2.0 之前配置,因为 2.0 之后已经默认是不产生_metadata
以及_common_metadata
的了, 相关 jira )
如果使用 sparkContext,相关的配置是:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Non DFS used 是什么意思
Non DFS used 的计算公式是: Non DFS Used = Configured Capacity - Remaining Space - DFS Used
- 因为 Configured Capacity = Total Disk Space - Reserved Space ,所以 Non DFS used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used
- reserved space 对应的配置是
dfs.datanode.du.reserved
下面是一个具体例子:
Let's take a example. Assuming I have 100 GB disk, and I set the reserved space (
dfs.datanode.du.reserved
) to 30 GB.In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.
In HDFS web UI, it will show
Non DFS used = 100GB(Total) - 30 GB(Reserved) - 10 GB (DFS used) - 50GB(Remaining) = 10 GB
也可以理解成40 GB - 30 GB(Reserved)
So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!
The term "Non DFS used" should really be renamed to something like
How much configured DFS capacity are occupied by non dfs use
HDFS 几个 block 的含义
详细查看 这里
- Over-replicated blocks: These are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas.
- Under-replicated blocks: These are blocks that do not meet their target replication for the file they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication. You can get information about the blocks being replicated (or waiting to be replicated) using hdfs dfsadmin -metasave .
- Misreplicated blocks: These are blocks that do not satisfy the block replica placement policy (see Replica Placement). For example, for a replication level of three in a multirack cluster, if all three replicas of a block are on the same rack, then the block is misreplicated because the replicas should be spread across at least two racks for resilience. HDFS will automatically re-replicate misreplicated blocks so that they satisfy the rack placement policy.
- Corrupt blocks: These are blocks whose replicas are all corrupt. Blocks with at least one noncorrupt replica are not reported as corrupt; the namenode will replicate the noncorrupt replica until the target replication is met.
- Missing replicas: These are blocks with no replicas anywhere in the cluster.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论