一次Solaris 10 SMF服务管理恢复案例
客户rac节点之一SMF服务损坏,导致节点无法正常启动。
1、在ok模式下启动系统
{2} ok
{2} ok boot
Resetting ...
Enabling system bus....... Done
Initializing CPUs......... Done
Initializing boot memory.. Done
Initializing OpenBoot
Probing system devices
Probing I/O buses
Probing system devices
Probing I/O buses
Sun Fire V490, No Keyboard
Copyright 2007 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.22.34, 8192 MB memory installed, Serial #72019670.
Ethernet address 0:14:4f:4a:ee:d6, Host ID: 844aeed6.
Rebooting with command: boot
Boot device: /pci@9,600000/SUNW,qlc@2/fp@0,0/disk@w2100001862805323,0:a File and args:
SunOS Release 5.10 Version Generic_118833-36 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: hhzdb2
Nov 3 10:39:39 svc.startd[8]: svc:/system/cluster/scmountdev:default: Method "/usr/cluster/lib/svc/method/scmountdev start" failed with exit status 1.
Nov 3 10:39:39 svc.startd[8]: svc:/system/cluster/scmountdev:default: Method "/usr/cluster/lib/svc/method/scmountdev start" failed with exit status 1.
Nov 3 10:39:39 svc.startd[8]: svc:/system/cluster/scmountdev:default: Method "/usr/cluster/lib/svc/method/scmountdev start" failed with exit status 1.
Nov 3 10:39:39 svc.startd[8]: system/cluster/scmountdev:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:39:43 svc.startd[8]: svc:/application/management/seaport:default: Method "/usr/lib/sma_snmp/setseaport" failed with exit status 1.
Nov 3 10:39:43 svc.startd[8]: svc:/application/management/seaport:default: Method "/usr/lib/sma_snmp/setseaport" failed with exit status 1.
Nov 3 10:39:43 svc.startd[8]: svc:/application/management/seaport:default: Method "/usr/lib/sma_snmp/setseaport" failed with exit status 1.
Nov 3 10:39:43 svc.startd[8]: application/management/seaport:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:39:45 svc.startd[8]: svc:/system/consadm:default: Method "/lib/svc/method/svc-consadm" failed with exit status 1.
Nov 3 10:39:45 svc.startd[8]: svc:/system/consadm:default: Method "/lib/svc/method/svc-consadm" failed with exit status 1.
Nov 3 10:39:45 svc.startd[8]: svc:/system/consadm:default: Method "/lib/svc/method/svc-consadm" failed with exit status 1.
……
Nov 3 10:39:46 svc.startd[8]: system/cluster/bootcluster:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:39:47 svc.startd[8]: svc:/system/cvc:default: Method "/lib/svc/method/svc-cvcd" failed with exit status 96.
Nov 3 10:39:47 svc.startd[8]: system/cvc:default misconfigured: transitioned to maintenance (see 'svcs -xv' for details)
checking ufs filesystems
/dev/md/rdsk/d240: is logging.
……
Nov 3 10:40:02 inetd[370]: Property exec for method inetd_start of instance svc:/network/rpc/rusers:default is invalid
Nov 3 10:40:02 inetd[370]: Invalid configuration for instance svc:/network/rpc/rusers:default, placing in maintenance
Nov 3 10:40:02 inetd[370]: Property exec for method inetd_start of instance svc:/network/rpc/spray:default is invalid
Nov 3 10:40:12 hhzdb2 svc.startd[8]: application/graphical-login/cde-login:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 svc.startd[8]: svc:/application/management/common-agent-container-1:default: Method "/usr/lib/cacao/lib/tools/scripts/cacao_smf start default" failed with exit status 1.
Nov 3 10:40:12 hhzdb2 svc.startd[8]: application/management/common-agent-container-1:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 last message repeated 2 times
Nov 3 10:40:12 hhzdb2 svc.startd[8]: application/management/common-agent-container-1:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 svc.startd[8]: svc:/system/webconsole:console: Method "/lib/svc/method/svc-webconsole start" failed with exit status 95.
Nov 3 10:40:12 hhzdb2 svc.startd[8]: system/webconsole:console failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 svc.startd[8]: system/webconsole:console failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 svc.startd[8]: svc:/system/basicreg:default: Method "/usr/sbin/sconadm register -c -m autoreg" failed with exit status 1.
Nov 3 10:40:12 hhzdb2 svc.startd[8]: system/basicreg:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Nov 3 10:40:12 hhzdb2 last message repeated 2 times
Nov 3 10:40:12 hhzdb2 svc.startd[8]: system/basicreg:default failed: transitioned to maintenance (see 'svcs -xv' for details)
INIT: Command is respawning too rapidly. Check for possible errors.
id: h1 "/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null"
INIT: hhzdb2 console login: Command is respawning too rapidly. Check for possible errors.
id: h2 "/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null"
INIT: Command is respawning too rapidly. Check for possible errors.
id: h3 "/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null"
hhzdb2 console login: root
Password:
Nov 3 10:40:30 hhzdb2 login: ROOT LOGIN /dev/console
Last login: Tue Nov 3 09:18:05 on console
-sh: /bin/cat: 没找到
-sh: /bin/mail: 没找到
Sourcing //.profile-EIS.....
root@hhzdb2 #
通过诊断,发现存在大量Solaris服务无法正常启动,需要手动进行干预。
三、故障恢复
1、初次恢复SMF服务配置信息
root@hhzdb2 # /lib/svc/bin/restore_repository
See http://sun.com/msg/SMF-8000-MY for more information on the use of
this script to restore backup copies of the smf(5) repository.
If there are any problems which need human intervention, this script will
give instructions and then exit back to your shell.
Note that upon full completion of this script, the system will be rebooted
using reboot(1M), which will interrupt any active services.
/lib/svc/bin/restore_repository: /bin/sed: not found
/lib/svc/bin/restore_repository: /bin/ls: not found
There are no available backups of /etc/svc/repository.db.
The only available repository is "-seed-". Note that restoring the seed
will lose all customizations, including those made by the system during
the installation and/or upgrade process.
Enter -seed- to restore from the seed, or -quit- to exit:
/lib/svc/bin/restore_repository: test: argument expected
root@hhzdb2 # ls /bin | grep ls
/bin: 无此文件或目录
root@hhzdb2 #
在第一次进行SMF服务配置时,系统无法发现SMF服务备份文件。诊断发现/bin目录不存在,/bin目录为/usr/bin的一个链接目录,可进行手动链接。
2、手动链接/bin目录
root@hhzdb2 # ln -s /usr/bin /bin
root@hhzdb2 # ls -atl | grep bin
lrwxrwxrwx 1 root root 8 11月 3日 10:52 bin -> /usr/bin
drwxr-xr-x 7 root bin 5632 2007 12月 4 lib
drwxr-xr-x 2 root sys 1024 2007 12月 1 sbin
root@hhzdb2 #
3、恢复SMF配置及主机
root@hhzdb2 # /lib/svc/bin/restore_repository
See http://sun.com/msg/SMF-8000-MY for more information on the use of
this script to restore backup copies of the smf(5) repository.
If there are any problems which need human intervention, this script will
give instructions and then exit back to your shell.
Note that upon full completion of this script, the system will be rebooted
using reboot(1M), which will interrupt any active services.
The following backups of /etc/svc/repository.db exist, from
oldest to newest:
manifest_import-20071130_235016
manifest_import-20071201_001428
manifest_import-20071201_012445
boot-20090816_131855
boot-20090831_094035
boot-20091103_091605
boot-20091103_103938
The backups are named based on their type and the time what they were taken.
Backups beginning with "boot" are made before the first change is made to
the repository after system boot. Backups beginning with "manifest_import"
are made after svc:/system/manifest-import:default finishes its processing.
The time of backup is given in YYYYMMDD_HHMMSS format.
Please enter either a specific backup repository from the above list to
restore it, or one of the following choices:
CHOICE ACTION
---------------- ----------------------------------------------
boot restore the most recent post-boot backup
manifest_import restore the most recent manifest_import backup
-seed- restore the initial starting repository (All
customizations will be lost, including those
made by the install/upgrade process.)
-quit- cancel script and quit
Enter response [boot]: boot-20090816_131855
After confirmation, the following steps will be taken:
svc.startd(1M) and svc.configd(1M) will be quiesced, if running.
/etc/svc/repository.db
-- renamed --> /etc/svc/repository.db_old_20091103_105407
/etc/svc/repository-boot-20090816_131855
-- copied --> /etc/svc/repository.db
and the system will be rebooted with reboot(1M).
Proceed [yes/no]? yes
Quiescing svc.startd(1M) and svc.configd(1M): done.
/etc/svc/repository.db
-- renamed --> /etc/svc/repository.db_old_20091103_105407
/etc/svc/repository-boot-20090816_131855
-- copied --> /etc/svc/repository.db
The backup repository has been successfully restored.
Rebooting in 5 seconds.
Nov 3 10:54:19 hhzdb2 reboot: rebooted by root
Nov 3 10:54:19 hhzdb2 rpcbind: rpcbind terminating on signal.
Nov 3 10:54:19 hhzdb2 syslogd: going down on signal 15
Nov 3 10:54:19 hhzdb2 rpcbind: rpcbind terminating on signal.
syncing file systems... done
rebooting...
Resetting ...
Software Reset
Enabling system bus....... Done
Initializing CPUs......... Done
Initializing boot memory.. Done
Initializing OpenBoot
Probing system devices
Probing I/O buses
Probing system devices
Probing I/O buses
Sun Fire V490, No Keyboard
Copyright 2007 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.22.34, 8192 MB memory installed, Serial #72019670.
Ethernet address 0:14:4f:4a:ee:d6, Host ID: 844aeed6.
Rebooting with command: boot
Boot device: /pci@9,600000/SUNW,qlc@2/fp@0,0/disk@w2100001862805323,0:a File and args:
SunOS Release 5.10 Version Generic_118833-36 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: hhzdb2
Nov 3 10:55:26 /usr/lib/snmp/snmpdx: can't open the file
Nov 3 10:55:26 /usr/lib/snmp/snmpdx: can't open the file
Loading smf(5) service descriptions: 1/1
Booting as part of a cluster
NOTICE: CMM: Node hhzdb1 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node hhzdb2 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d4s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
NOTICE: clcomm: Adapter ce3 constructed
NOTICE: clcomm: Path hhzdb2:ce3 - hhzdb1:ce3 being constructed
NOTICE: clcomm: Adapter ce2 constructed
NOTICE: clcomm: Path hhzdb2:ce2 - hhzdb1:ce2 being constructed
NOTICE: CMM: Node hhzdb2: attempting to join cluster.
NOTICE: clcomm: Path hhzdb2:ce3 - hhzdb1:ce3 being initiated
NOTICE: clcomm: Path hhzdb2:ce2 - hhzdb1:ce2 being initiated
NOTICE: CMM: Node hhzdb1 (nodeid: 1, incarnation #: 1253030763) has become reachable.
NOTICE: clcomm: Path hhzdb2:ce2 - hhzdb1:ce2 online
NOTICE: CMM: Cluster has reached quorum.
NOTICE: CMM: Node hhzdb1 (nodeid = 1) is up; new incarnation number = 1253030763.
NOTICE: CMM: Node hhzdb2 (nodeid = 2) is up; new incarnation number = 1257216936.
NOTICE: CMM: Cluster members: hhzdb1 hhzdb2.
NOTICE: CMM: node reconfiguration #2 completed.
NOTICE: CMM: Node hhzdb2: joined cluster.
ip: joining multicasts failed (1 on clprivnet0 - will use link layer broadcasts for multicast
checking ufs filesystems
NOTICE: clcomm: Path hhzdb2:ce3 - hhzdb1:ce3 online
/dev/md/rdsk/d240: is logging.
hhzdb2 console login: 正在获取对所有已连接的磁盘的访问权限
Nov 3 10:55:48 hhzdb2 sendmail[448]: My unqualified host name (hhzdb2) unknown; sleeping for retry
Nov 3 10:55:48 hhzdb2 sendmail[449]: My unqualified host name (hhzdb2) unknown; sleeping for retry
Nov 3 10:55:49 hhzdb2 Cluster.Framework: stdout: 正在重置与非群集节点共享的 scsi 总线
Nov 3 10:55:54 hhzdb2 xntpd[565]: xntpd 3-5.93e+sun 03/08/29 16:23:05 (1.4)
Nov 3 10:55:54 hhzdb2 xntpd[565]: tickadj = 5, tick = 10000, tvu_maxslew = 495, est. hz = 100
Nov 3 10:55:54 hhzdb2 xntpd[565]: using kernel phase-lock loop 0041, drift correction 0.00000
Nov 3 10:55:55 hhzdb2 xntpd[565]: using kernel phase-lock loop 0041, drift correction 43.32100
starting NetWorker daemons:
nsrexecd
nsrd
Nov 3 10:56:23 hhzdb2 root: Sun StorEdge(TM) Enterprise Backup server: (notice) started
Nov 3 10:56:24 hhzdb2 root: S99sneep:root: Chassis Serial not available from system eeprom
Nov 3 10:56:24 hhzdb2 root: S99sneep:root: Repair Chassis Serial with /opt/SUNWsneep/bin/sneep
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The status of device: /dev/did/rdsk/d6s0 is set to MONITORED
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d6s0 has changed to OK
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The status of device: /dev/did/rdsk/d7s0 is set to MONITORED
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d7s0 has changed to OK
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The status of device: /dev/did/rdsk/d4s0 is set to MONITORED
Nov 3 10:56:31 hhzdb2 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d4s0 has changed to OK
Nov 3 10:56:33 hhzdb2 root: Oracle Cluster Ready Services starting up automatically.
Nov 3 10:56:33 hhzdb2 Cluster.RGM.rgmd: CMM: Node hhzdb1 (nodeid: 1, incarnation #: 1253030901) has become reachable.
Nov 3 10:56:33 hhzdb2 Cluster.RGM.rgmd: CMM: Cluster has reached quorum.
Nov 3 10:56:33 hhzdb2 Cluster.RGM.rgmd: CMM: Node hhzdb1 (nodeid = 1) is up; new incarnation number = 1253030901.
Nov 3 10:56:33 hhzdb2 Cluster.RGM.rgmd: CMM: Node hhzdb2 (nodeid = 2) is up; new incarnation number = 1257216993.
Nov 3 10:56:33 hhzdb2 root: Oracle Cluster Ready Services waiting for SunCluster and UDLM to start.
Nov 3 10:56:34 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_framework_boot> for resource <rac-framework-rs>, resource group <rac-rg>, timeout <900> seconds
Nov 3 10:56:40 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_framework_boot> completed successfully for resource <rac-framework-rs>, resource group <rac-rg>, time used: 0% of timeout <900 seconds>
Nov 3 10:56:40 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_framework_start> for resource <rac-framework-rs>, resource group <rac-rg>, timeout <600> seconds
Nov 3 10:56:40 hhzdb2 Cluster.OPS.UCMMD: CMM: Node hhzdb1 (nodeid: 1, incarnation #: 1253030963) has become reachable.
Nov 3 10:56:42 hhzdb2 ID[SUNWudlm.udlm]: Unix DLM version (2) and SUN Unix DLM library version (1): compatible.
Nov 3 10:56:42 hhzdb2 Cluster.OPS.UCMMD: CMM: Cluster has reached quorum.
Nov 3 10:56:42 hhzdb2 Cluster.OPS.UCMMD: CMM: Node hhzdb1 (nodeid = 1) is up; new incarnation number = 1253030963.
Nov 3 10:56:42 hhzdb2 Cluster.OPS.UCMMD: CMM: Node hhzdb2 (nodeid = 2) is up; new incarnation number = 1257217000.
Nov 3 10:56:52 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_framework_start> completed successfully for resource <rac-framework-rs>, resource group <rac-rg>, time used: 2% of timeout <600 seconds>
Nov 3 10:56:52 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_framework_monitor_start> for resource <rac-framework-rs>, resource group <rac-rg>, timeout <3600> seconds
Nov 3 10:56:52 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_svm_start> for resource <rac-svm-rs>, resource group <rac-rg>, timeout <600> seconds
Nov 3 10:56:52 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_udlm_start> for resource <rac-udlm-rs>, resource group <rac-rg>, timeout <600> seconds
Nov 3 10:56:52 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_framework_monitor_start> completed successfully for resource <rac-framework-rs>, resource group <rac-rg>, time used: 0% of timeout <3600 seconds>
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_svm_start> completed successfully for resource <rac-svm-rs>, resource group <rac-rg>, time used: 0% of timeout <600 seconds>
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_svm_monitor_start> for resource <rac-svm-rs>, resource group <rac-rg>, timeout <600> seconds
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_udlm_start> completed successfully for resource <rac-udlm-rs>, resource group <rac-rg>, time used: 0% of timeout <600 seconds>
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: launching method <bin/rac_udlm_monitor_start> for resource <rac-udlm-rs>, resource group <rac-rg>, timeout <600> seconds
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_svm_monitor_start> completed successfully for resource <rac-svm-rs>, resource group <rac-rg>, time used: 0% of timeout <600 seconds>
Nov 3 10:56:53 hhzdb2 Cluster.RGM.rgmd: method <bin/rac_udlm_monitor_start> completed successfully for resource <rac-udlm-rs>, resource group <rac-rg>, time used: 0% of timeout <600 seconds>
hhzdb2 console login: root
Password:
Nov 3 10:57:14 hhzdb2 login: ROOT LOGIN /dev/console
Last login: Tue Nov 3 10:45:52 on console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
You have new mail.
Sourcing //.profile-EIS.....
root@hhzdb2 #
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
会不会bin符号链接丢失造成的smf服务起动异常?
2、手动链接/bin目录
root@hhzdb2 # ln -s /usr/bin /bin
root@hhzdb2 # ls -atl | grep bin
lrwxrwxrwx 1 root root 8 11月 3日 10:52 bin -> /usr/bin
drwxr-xr-x 7 root bin 5632 2007 12月 4 lib
drwxr-xr-x 2 root sys 1024 2007 12月 1 sbin
谢谢
很不错的总结。
/lib/svc/bin/restore_repository
是亮点
up !
gangxie风箱
实战经验,必须收藏
root@hhzdb2 # scstat | more
------------------------------------------------------------------
-- 群集节点 --
节点名称 状态
-------- ----
群集节点: hhzdb1 联机
群集节点: hhzdb2 联机
------------------------------------------------------------------
-- 群集传输路径 --
端点 端点 状态
---- ---- ----
传输路径: hhzdb1:ce3 hhzdb2:ce3 Path online
传输路径: hhzdb1:ce2 hhzdb2:ce2 Path online
------------------------------------------------------------------
-- 法定摘要 --
可能的法定选票: 3
所需的法定选票: 2
现有的法定选票: 3
-- 按节点计算的法定选票 --
节点名称 现有的 可能的 状态
-------- ------ ------ ----
节点选票: hhzdb1 1 1 联机
节点选票: hhzdb2 1 1 联机
-- 按设备计算的法定选票 --
设备名称 现有的 可能的 状态
-------- ------ ------ ----
设备选票: /dev/did/rdsk/d4s2 1 1 联机
------------------------------------------------------------------
-- 设备组服务器 --
设备组 主 次
------ -- --
设备组服务器: hhdbset hhzdb1 hhzdb2
设备组服务器: rmt/2 - -
设备组服务器: rmt/1 - -
-- 设备组状态 --
设备组 状态
------ ----
设备组状态: hhdbset 联机
设备组状态: rmt/2 脱机
设备组状态: rmt/1 脱机
-- 多所有者设备组 --
设备组 联机状态
------ --------
------------------------------------------------------------------
-- 资源组和资源 --
组名称 资源
------ ----
资源: rac-rg rac-framework-rs rac-udlm-rs rac-svm-rs
-- 资源组 --
组名称 节点名称 状态
------ -------- ----
组: rac-rg hhzdb1 联机
组: rac-rg hhzdb2 联机
-- 资源 --
资源名称 节点名称 状态 状态消息
-------- -------- ---- --------
资源: rac-framework-rs hhzdb1 联机 联机
资源: rac-framework-rs hhzdb2 联机 联机
资源: rac-udlm-rs hhzdb1 联机 联机
资源: rac-udlm-rs hhzdb2 联机 联机
资源: rac-svm-rs hhzdb1 联机 联机
资源: rac-svm-rs hhzdb2 联机 联机
-- IPMP 组 --
节点名称 组 状态 适配器 状态
-------- -- ---- ------ ----
IPMP 组: hhzdb1 sc_ipmp1 联机 ce1 联机
IPMP 组: hhzdb1 sc_ipmp0 联机 ce0 联机
IPMP 组: hhzdb2 sc_ipmp1 联机 ce1 联机
IPMP 组: hhzdb2 sc_ipmp0 联机 ce0 联机
------------------------------------------------------------------
root@hhzdb2 #
SMF服务恢复正常后,系统启动正常,SUN Cluster运行正常。
附件为Solaris SMF管理讲解。
谢谢!