昨天UPS检修时间太长, 服务器掉电, 硬盘自检不通过, 机器起不来的处理过程

发布于 2022-08-29 14:04:28 字数 5190 浏览 9 评论 4

昨天UPS检修时间太长, 服务器掉电,

起来之后, 刚开始10分钟是正常的, 10分钟之后出现问题:
1) 数据库读写不正常;
2) RAID-1中的其中1块硬盘灯不亮, 报警灯也没闪; (这个是后来人去机房才看到, 远程连接的时候不清楚这个情况)

查看了一番, 决定重启一下机器, 但是重启之后, 就起不来了, 晕 -_-#!

接下来就只有远程了,
让机房切上KVM发现, 卡在CentOS挂载硬盘上, 系统自检硬盘有报错, 不通过, 需要输入root密码手动修复, 报错信息类似下面的:

Your system appears to have shut down uncleanly
Press Y within 1 seconds to force file system integrity check...
Checking root filesystem

/dev/sda2:UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(I.E., without -a or -p options)

***An error occurred during the file system check.
***Dropping you to a shell; the system will reboot
***when you leave the shell.
Give root password for maintenance
(or type Control-D continue):

输入密码, 居然不正确, 也没改过阿, 晕

接下来google, baidu查找不要密码的方案:
1) grub中添加singel, 进入singel模式也一样需要root密码
2) grub中添加init=/bin/bash, 能够进入系统, 但是进去之后就卡住不动了, 晕死
3) 上光盘用rescue模式, 修改/etc/passwd, pwconv
(这时才发现可能KVM的键盘处理有问题, 我输个df, 结果出来ddffffffffffffffffffffffffffff, 狂晕阿)
让机房的人手工输入root密码后, 进入临时系统了

fsck /dev/sda2, 修复根分区吧, 再漫长的自检等了可能有半小时之后, 终于重启通过
进去系统后, 把不亮灯的SCSI硬盘拔下来, 又插上, 数据又开始同步了 ... 修复硬盘的日志类似如下:

Here the console-output:
==================================================================
rescue:~# fsck /dev/hda3
fsck 1.27 (8-Mar-2002)
e2fsck 1.27 (8-Mar-2002)
/dev/hda3 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 4784430: 9569668 9569669 9569670 9569671 9569672 9569673 9569674 9569675 9569676 9569677
Duplicate/bad block(s) in inode 4784807: 9569668 9569669 9569670 9569671 9569672 9569673 9569674 9569675 9569676 9569677
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 2 inodes containing duplicate/bad blocks.)

File ... (inode #4784807, mod time Wed Dec 29 23:20:59 2004)
has 10 duplicate block(s), shared with 1 file(s):
... (inode #4784430, mod time Wed Dec 29 23:20:40 2004)
Clone duplicate/bad blocks<y>? no

Delete file<y>? yes

File ... (inode #4784430, mod time Wed Dec 29 23:20:40 2004)
has 10 duplicate block(s), shared with 1 file(s):
... (inode #4784807, mod time Wed Dec 29 23:20:59 2004)
Duplicated blocks already reassigned or cloned.

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 4784430
Connect to /lost+found<y>? yes

Inode 4784430 ref count is 2, should be 1. Fix<y>? yes

Unattached zero-length inode 4784804. Clear<y>? yes

Unattached zero-length inode 4784806. Clear<y>? yes

Pass 5: Checking group summary information
Block bitmap differences: +5211533 +5218027 +5218036 +5219907 +5220360 +5221306 +5232339 +5232360 +5232511 +5232574 +5232990 +5234810 +(5238579--5238582) +5245529 +5245830 +5248928 +5249234 -(5355824--5355827) -(5364267--5364271) -(5366592--536659 -(5366685--5366689) -(5400399--5400405) -(5414077--5414083) -(5438621--5438626) -5440654 -(5440664--5440670) -(5440678--5440679) -(5440682--5440687) +(9569668--9569677)
Fix<y>? yes

Free blocks count wrong for group #163 (409, counted=430).
Fix<y>? yes

Free blocks count wrong for group #164 (2528, counted=2535).
Fix<y>? yes

Free blocks count wrong for group #165 (7575, counted=7589).
Fix<y>? yes

Free blocks count wrong for group #166 (31243, counted=31259).
Fix<y>? yes

Free blocks count wrong for group #292 (30999, counted=30989).
Fix<y>? yes

Free blocks count wrong (3942158, counted=3942196).
Fix<y>? yes

Inode bitmap differences: -66005 +4784430
Fix<y>? yes

Free inodes count wrong for group #4 (15249, counted=15250).
Fix<y>? yes

Free inodes count wrong for group #292 (15705, counted=15706).
Fix<y>? yes

Free inodes count wrong (4740607, counted=4740611).
Fix<y>? yes

/dev/hda3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/hda3: 158205/4898816 files (2.1% non-contiguous), 5843397/9785593 blocks
==================================================================

接下来总结一下教训吧, 希望大家补充:
1) 密码一定要定期登录, 检查可用性
2) 需要找一个检测硬件状态的软件, 定期检查硬件的状态, 包括KVM, 千万不要出了故障还浪费时间弄这些小问题
3) 掉电以后, 检查到问题, 一定不要轻易重启, 最好在电就把问题解决

分享到QQ

分享到微博