当前位置：文江博客文章教程详情

HDFS 介绍和使用

发布于 2024-11-03 15:07:21 字数 6996 浏览 12 评论 0

cli

全部的命令行参考 Hadoop-The File System (FS) shell

ls

查看目录下的文件

hdfs dfs -ls /path : 查看 /path 下面有什么文件
hdfs dfs -ls -t -r /path : 按时间倒序
hdfs dfs -ls /path | awk '{ print $8 }' | xargs -I % echo 'hdfs dfs -du -s %' | sh : 查看指定路径下各个文件夹的容量大小

rm

删除指定文件或文件夹

cat

读取文件 hdfs dfs -cat <HDFS_FILE_PATH>

du

查看文件/文件夹大小 hdfs dfs -du <HDFS_PATH>

other

hdfs dfs -expunge : 清空回收站

FAQ

hadoop fs 和 hdfs dfs 的区别

difference between "hadoop fs" and "hdfs dfs" shell commands

hadoop_fs_hdfs_dfs

根据官网 The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others.
bin/hadoop fs <args>
All FS shell commands take path URIs as arguments. The URI format is scheme://authority/ path. For HDFS the scheme is hdfs , and for the Local FS the scheme is file . The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost ).
如果 File System 使用的是 HDFS,则 hadoop fs 等同于 hdfs dfs

避免产生计算引擎生成文件的时候产生 SUCCESS metadata common_metadata 文件夹

hive 或者 spark 的文件保存到 HDFS 时,会产生了 _SUCCESS 以及 _metadata 和 _common_metadata 文件夹,如果不想要产生这些文件夹（因为这些文件夹 没有用 并且会导致 sql 报错 ).可以调整相关配置:

mapreduce.fileoutputcommitter.marksuccessfuljobs=false : 不产生 _SUCCESS 文件
parquet.enable.summary-metadata=false : 不产生 _metadata 及 _common_metadata 文件

如果是 spark-on-yarn 的情况,运行 spark-thrift-server 相关的配置是:

spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
spark.hadoop.parquet.enable.summary-metadata=false (parquet 的配置仅需要在 spark2.0 之前配置,因为 2.0 之后已经默认是不产生 _metadata 以及 _common_metadata 的了, 相关 jira )

如果使用 sparkContext,相关的配置是:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Non DFS used 是什么意思

What exactly Non DFS Used means

Non DFS used 的计算公式是: Non DFS Used = Configured Capacity - Remaining Space - DFS Used

因为 Configured Capacity = Total Disk Space - Reserved Space ,所以 Non DFS used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used
reserved space 对应的配置是 dfs.datanode.du.reserved

下面是一个具体例子:

Let's take a example. Assuming I have 100 GB disk, and I set the reserved space ( dfs.datanode.du.reserved ) to 30 GB.
In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.
In HDFS web UI, it will show
Non DFS used = 100GB(Total) - 30 GB(Reserved) - 10 GB (DFS used) - 50GB(Remaining) = 10 GB 也可以理解成 40 GB - 30 GB(Reserved)
So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!
The term "Non DFS used" should really be renamed to something like How much configured DFS capacity are occupied by non dfs use

HDFS 几个 block 的含义

详细查看这里

Over-replicated blocks: These are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas.
Under-replicated blocks: These are blocks that do not meet their target replication for the file they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication. You can get information about the blocks being replicated (or waiting to be replicated) using hdfs dfsadmin -metasave .
Misreplicated blocks: These are blocks that do not satisfy the block replica placement policy (see Replica Placement). For example, for a replication level of three in a multirack cluster, if all three replicas of a block are on the same rack, then the block is misreplicated because the replicas should be spread across at least two racks for resilience. HDFS will automatically re-replicate misreplicated blocks so that they satisfy the rack placement policy.
Corrupt blocks: These are blocks whose replicas are all corrupt. Blocks with at least one noncorrupt replica are not reported as corrupt; the namenode will replicate the noncorrupt replica until the target replication is met.
Missing replicas: These are blocks with no replicas anywhere in the cluster.