HDFS 介绍和使用

发布于 2024-11-03 15:07:21 字数 6996 浏览 2 评论 0

cli

全部的命令行参考 Hadoop-The File System (FS) shell

ls

查看目录下的文件

  • hdfs dfs -ls /path : 查看 /path 下面有什么文件
  • hdfs dfs -ls -t -r /path : 按时间倒序
  • hdfs dfs -ls /path | awk '{ print $8 }' | xargs -I % echo 'hdfs dfs -du -s %' | sh : 查看指定路径下各个文件夹的容量大小

相关的 option:

  • -h : 人更易读懂的描述
  • -r : 反转,倒序
  • -t : 按照时间排序
  • -R : 递归文件夹,知道显示文件

rm

删除指定文件或文件夹

相关的 option:

  • -r : 删除文件夹
  • -skipTrash : 跳过回收站,直接删除文件

cat

读取文件 hdfs dfs -cat <HDFS_FILE_PATH>

du

查看文件/文件夹大小 hdfs dfs -du <HDFS_PATH>

相关 option:

  • -h : 人更易读懂的描述

other

  • hdfs dfs -expunge : 清空回收站

FAQ

hadoop fs 和 hdfs dfs 的区别

difference between "hadoop fs" and "hdfs dfs" shell commands

hadoop_fs_hdfs_dfs

  • 根据官网 The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others.
  • bin/hadoop fs <args>
  • All FS shell commands take path URIs as arguments. The URI format is scheme://authority/ path. For HDFS the scheme is hdfs , and for the Local FS the scheme is file . The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost ).
  • 如果 File System 使用的是 HDFS,则 hadoop fs 等同于 hdfs dfs

避免产生计算引擎生成文件的时候产生 SUCCESS metadata common_metadata 文件夹

hive 或者 spark 的文件保存到 HDFS 时,会产生了 _SUCCESS 以及 _metadata_common_metadata 文件夹,如果不想要产生这些文件夹(因为这些文件夹 没有用 并且会导致 sql 报错 ).可以调整相关配置:

  • mapreduce.fileoutputcommitter.marksuccessfuljobs=false : 不产生 _SUCCESS 文件
  • parquet.enable.summary-metadata=false : 不产生 _metadata_common_metadata 文件

如果是 spark-on-yarn 的情况,运行 spark-thrift-server 相关的配置是:

  • spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
  • spark.hadoop.parquet.enable.summary-metadata=false (parquet 的配置仅需要在 spark2.0 之前配置,因为 2.0 之后已经默认是不产生 _metadata 以及 _common_metadata 的了, 相关 jira )

如果使用 sparkContext,相关的配置是:

  • sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
  • sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Non DFS used 是什么意思

Non DFS used 的计算公式是: Non DFS Used = Configured Capacity - Remaining Space - DFS Used

  • 因为 Configured Capacity = Total Disk Space - Reserved Space ,所以 Non DFS used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used
  • reserved space 对应的配置是 dfs.datanode.du.reserved

下面是一个具体例子:

Let's take a example. Assuming I have 100 GB disk, and I set the reserved space ( dfs.datanode.du.reserved ) to 30 GB.

In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.

In HDFS web UI, it will show

Non DFS used = 100GB(Total) - 30 GB(Reserved) - 10 GB (DFS used) - 50GB(Remaining) = 10 GB 也可以理解成 40 GB - 30 GB(Reserved)

So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!

The term "Non DFS used" should really be renamed to something like How much configured DFS capacity are occupied by non dfs use

HDFS 几个 block 的含义

详细查看 这里

  • Over-replicated blocks: These are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas.
  • Under-replicated blocks: These are blocks that do not meet their target replication for the file they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication. You can get information about the blocks being replicated (or waiting to be replicated) using hdfs dfsadmin -metasave .
  • Misreplicated blocks: These are blocks that do not satisfy the block replica placement policy (see Replica Placement). For example, for a replication level of three in a multirack cluster, if all three replicas of a block are on the same rack, then the block is misreplicated because the replicas should be spread across at least two racks for resilience. HDFS will automatically re-replicate misreplicated blocks so that they satisfy the rack placement policy.
  • Corrupt blocks: These are blocks whose replicas are all corrupt. Blocks with at least one noncorrupt replica are not reported as corrupt; the namenode will replicate the noncorrupt replica until the target replication is met.
  • Missing replicas: These are blocks with no replicas anywhere in the cluster.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据

关于作者

傲鸠

暂无简介

0 文章
0 评论
23 人气
更多

推荐作者

lee_heart

文章 0 评论 0

往事如风

文章 0 评论 0

春风十里

文章 0 评论 0

纸短情长

文章 0 评论 0

qq_pdEUFz

文章 0 评论 0

    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文