HDFS 表示文件仍然打开,但写入该文件的进程已被终止
我是 hadoop 的新手,过去几个小时我一直在尝试用 google 搜索这个问题,但我找不到任何有帮助的东西。我的问题是 HDFS 说该文件仍然打开,即使写入该文件的进程早已死亡。这使得无法从文件中读取。
我对目录运行了 fsck,它报告一切正常。但是,当我运行“hadoop fsck -fs hdfs://hadoop /logs/raw/directory_having_file -openforwrite”时,我得到
Status: CORRUPT
Total size: 222506775716 B
Total dirs: 0
Total files: 630
Total blocks (validated): 3642 (avg. block size 61094666 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 30366208 B
********************************
Minimally replicated blocks: 3641 (99.97254 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.9991763
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 23
Number of racks: 1
在 openforwrite 的文件上再次执行 fsck 命令,我得到
.Status: HEALTHY
Total size: 793208051 B
Total dirs: 0
Total files: 1
Total blocks (validated): 12 (avg. block size 66100670 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 23
Number of racks: 1
有没有人知道发生了什么以及如何我可以修复吗?
I'm new to hadoop and I've spent the past couple hours trying to google this issue, but I couldn't find anything that helped. My problem is HDFS says the file is still open, even though the process writing to it is long dead. This makes it impossible to read from the file.
I ran fsck on the directory and it reports everything is healthy. However when I run "hadoop fsck -fs hdfs://hadoop /logs/raw/directory_containing_file -openforwrite" I get
Status: CORRUPT
Total size: 222506775716 B
Total dirs: 0
Total files: 630
Total blocks (validated): 3642 (avg. block size 61094666 B)
********************************
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 30366208 B
********************************
Minimally replicated blocks: 3641 (99.97254 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.9991763
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 23
Number of racks: 1
Doing the fsck command again on the file that is openforwrite I get
.Status: HEALTHY
Total size: 793208051 B
Total dirs: 0
Total files: 1
Total blocks (validated): 12 (avg. block size 66100670 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 23
Number of racks: 1
Does anyone have any ideas what is going on and how I can fix it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我发现这些块似乎丢失了,因为名称节点服务器暂时不可用,从而损坏了该文件的文件系统。看来文件中没有丢失块的部分仍然可以读取/复制。有关处理 hdfs 损坏的更多信息,请访问 https://twiki.grid .iu.edu/bin/view/Storage/HadoopRecovery(镜像:http://www.webitation. org/5xMTitU0r)
编辑:这个问题似乎是由于 Scribe(或更具体地说 Scribe 使用的 DFSClient)在尝试写入 HDFS 时挂起的问题造成的。我们使用 HADOOP-6099 和 HDFS-278 手动修补 hadoop 集群的源代码,重建二进制文件并使用新版本重新启动集群。我们运行新版本的两个月里没有出现任何问题。
I figured out the blocks seem to be missing because the namenode server was temporarily unavailable, thus corrupting the filesystem for that file. It appeared the part of the file without the missing blocks could still be read/copied. Some more information on dealing with corruption in hdfs is available at https://twiki.grid.iu.edu/bin/view/Storage/HadoopRecovery (mirror: http://www.webcitation.org/5xMTitU0r)
Edit: It seems this issue was due to an issue with Scribe (or more specifically the DFSClient used by Scribe) hanging when trying to write to HDFS. We manually patched the source of our hadoop cluster with HADOOP-6099 and HDFS-278, rebuilt the binaries and restarted the cluster with the new version. There have been no issues in the two months we have been running with the new version.