求助SUN 6540中逻辑卷LU不在最佳路径的处理方法
本帖最后由 xgene 于 2010-10-31 21:44 编辑
两台M4000、一台sun6540、两台SAN交换机组成的双机环境,安装的是Solaris10+SUN CLUSTER 3.2+sybase 12.5
今天,接到客户电话说,6540上有几个逻辑卷不在最佳路径上B控上,全部飘到A控上了。人工去设置到B控上,可还会自己飘到A控上。
经他们检查发现,6540阵列没有硬件问题,两台M4000的HBA卡和SAN交换机也都没有问题。
以下检查是在M4000-db1上执行的,M4000-db2上并无此类错误。
root@M4000-db1 # luxadm -e dump_map /devices/pci@fd,600000/SUNW,qlc@1/fp@0,0:devctl
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0 20100 0 2100001b32051006 2000001b32051006 0x1f (Unknown Type)
1 20200 0 2100001b3205a005 2000001b3205a005 0x1f (Unknown Type)
2 20300 0 2100001b3205cd06 2000001b3205cd06 0x1f (Unknown Type)
3 20400 0 201500a0b8325b3e 200400a0b8325b3e Failed to get the type.
4 20500 0 500e09e00b512300 500e09e00b512300 0x8 (Medium changer device)
5 20600 0 500e09e00b512320 500110a001022156 0x1 (Tape device)
6 20700 0 500e09e00b512310 500110a001021026 0x1 (Tape device)
7 20800 0 2101001b32255e05 2001001b32255e05 0x1f (Unknown Type)
8 20900 0 2101001b32251006 2001001b32251006 0x1f (Unknown Type)
9 20a00 0 2101001b3225a005 2001001b3225a005 0x1f (Unknown Type)
10 20b00 0 2101001b3225cd06 2001001b3225cd06 0x1f (Unknown Type)
11 20f00 0 200500a0b819af36 200400a0b819af34 0x0 (Disk device)
12 20000 0 2100001b32055e05 2000001b32055e05 0x1f (Unknown Type,Host Bus Adapter)
Sep 25 16:52:45 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::GPN_ID for D_ID=20800 failed
Sep 25 16:52:45 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=20800, PWWN=2101001b32255e05 disappeared from fabric
Sep 25 16:52:52 M4000-db1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(3): Link ONLINE
Sep 25 16:52:53 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=20800, PWWN=2101001b32255e05 reappeared in fabric
Sep 25 16:54:37 M4000-db1 fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 20400 failed state=Timeout, reason=Hardware Error
Sep 25 16:54:37 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(1):LOGI to 20400 failed. state=c reason=1.
Sep 25 16:54:37 M4000-db1 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
Sep 25 16:54:37 M4000-db1 Lun=0 for target=20f00 disappeared
Sep 25 16:54:37 M4000-db1 scsi: [ID 243001 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
Sep 25 16:54:37 M4000-db1 FCP: target=20f00 reported NO Luns
^Croot@M4000-db1 # tail -f messages
Sep 25 16:52:52 M4000-db1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(3): Link ONLINE
Sep 25 16:52:53 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=20800, PWWN=2101001b32255e05 reappeared in fabric
Sep 25 16:54:37 M4000-db1 fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 20400 failed state=Timeout, reason=Hardware Error
Sep 25 16:54:37 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(1):LOGI to 20400 failed. state=c reason=1.
Sep 25 16:54:37 M4000-db1 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
Sep 25 16:54:37 M4000-db1 Lun=0 for target=20f00 disappeared
Sep 25 16:54:37 M4000-db1 scsi: [ID 243001 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
Sep 25 16:54:37 M4000-db1 FCP: target=20f00 reported NO Luns
Sep 25 17:00:29 M4000-db1 fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 20400 failed state=Timeout, reason=Hardware Error
Sep 25 17:00:29 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(1):LOGI to 20400 failed. state=c reason=1.
Sep 25 17:03:11 M4000-db1 fp: [ID 517869 kern.info] NOTICE: fp(2): PLOGI to 20400 failed state=Timeout, reason=Hardware Error
Sep 25 17:03:11 M4000-db1 fctl: [ID 517869 kern.warning] WARNING: fp(2):LOGI to 20400 failed. state=c reason=1.
客户说机器最近没有重启过,也没做过什么变动,是不是M4000-db1上识别的阵列的路径不正常,重新识别是否应该会正常?
如果把资源先切换到M4000-db2上,然后把M4000-db1停机,然后用boot -r重新引导识别阵列路径,这样可行吗?
请教Solaris的高手指点指点,多谢了!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
确保物理链路及交换机的Zone没问题的前提下,到ok状态下probe-scsi-all,boot -r再启动操作系统后cfgadm -al查看状态是否正常,然后devfsadm -Cv
谢谢!应该和您推断的这个原因差不多。
节点2上有和b控对应的lun,cfgadm-al看,也有unsable的。看来得好好检查下链路了。但是月初是系统最繁忙的时候,得等两三天后,客户才同意检查。
你的节点2上有b控对应的lun吗?现在看起来的错误就是节点1没有和b空的端口建立连接,如果主机端HBA没有问题那就是盘阵端的光纤,接口等等有问题.
好的,谢谢
目前不能停机,等忙过这几天
到时候停机好好检查一下
谢谢!
目前应用一切正常,主机的光纤卡里面也有光,SAN交换机的口也有光,就是阵列上有个报警。
下面是CAM抓的错误
Severity: Major
Date: 10/01/2010 02:15:47
State: Open
Acknowledged By:
Auto Clear Yes
Description: 以下卷(一个或多个)未受其首选控制器的管理 zx zxbackup zxsybase zxdata
Info:
Device: ShiYou_Infor_center_STK6540
Component: zx
Event Code: 63.66.1010
Aggregated Count: 0
Probable Cause
有一个或多个卷当前未受其首选控制器的管理。
可能的原因有:
* 对该控制器手动启动的诊断测试失败,导致该控制器被置为脱机。
* 该控制器被手动置为脱机。
* 存在断开连接或发生故障的电缆。
* 某个集线器或光纤网络交换机未正常运转。
* 某个主机适配器发生故障。
* 存储阵列包含发生故障的 RAID 控制器。
Recommended Action
检查是否存在发生故障的控制器,如果存在,请将其更换。
检查是否存在脱机控制器,如果存在,请手动将其置为联机。
检查是否存在发生故障的电缆,根据需要进行更换或重新安放。
检查 SAN 连接,包括电缆、交换机和 HBA。
修复问题后,需要重新分配卷。
下面是从服务器上用cfgadm -l检查的内容:
c0 scsi-bus connected configured unknown
c3 fc-fabric connected configured unknown
c4 fc-fabric connected configured unknown
c5 scsi-bus connected configured unknown
c6 fc-fabric connected configured unknown
c7 fc-fabric connected configured unknown
下面是用cfgadm -al检查的内容:
c3 fc-fabric connected configured unknown
c3::200500a0b819af36 disk connected configured unusable
c3::201500a0b8325b3e unavailable connected configured unusable
c3::2100001b32051006 unknown connected unconfigured unknown
c3::2100001b3205a005 unknown connected unconfigured unknown
c3::2100001b3205cd06 unknown connected unconfigured unknown
c3::2101001b32251006 unknown connected unconfigured unknown
c3::2101001b32255e05 unknown connected unconfigured unknown
c3::2101001b3225a005 unknown connected unconfigured unknown
c3::2101001b3225cd06 unknown connected unconfigured unknown
c3::500e09e00b512300 med-changer connected configured unknown
c3::500e09e00b512310 tape connected configured unknown
c3::500e09e00b512320 tape connected configured unknown
c4 fc-fabric connected configured unknown
c4::200500a0b819af36 disk connected unconfigured unknown
c4::201500a0b8325b3e unavailable connected configured failed
c4::2100001b32051006 unknown connected unconfigured unknown
c4::2100001b32055e05 unknown connected unconfigured unknown
c4::2100001b3205a005 unknown connected unconfigured unknown
c4::2100001b3205cd06 unknown connected unconfigured unknown
c4::2101001b32251006 unknown connected unconfigured unknown
c4::2101001b3225a005 unknown connected unconfigured unknown
c4::2101001b3225cd06 unknown connected unconfigured unknown
c4::500e09e00b512300 med-changer connected configured unknown
c4::500e09e00b512310 tape connected configured unknown
c4::500e09e00b512320 tape connected configured unknown
c6 fc-fabric connected configured unknown
c6::200400a0b819af36 disk connected configured unusable
c6::201400a0b8325b3e disk connected configured unknown
c6::2100001b3205ab05 unknown connected unconfigured unknown
c6::2100001b3205ad05 unknown connected unconfigured unknown
c6::2100001b3205ba06 unknown connected unconfigured unknown
c6::2101001b32257d05 unknown connected unconfigured unknown
c6::2101001b3225ab05 unknown connected unconfigured unknown
c6::2101001b3225ad05 unknown connected unconfigured unknown
c6::2101001b3225ba06 unknown connected unconfigured unknown
c7 fc-fabric connected configured unknown
c7::200400a0b819af36 disk connected unconfigured unknown
c7::201400a0b8325b3e disk connected configured unknown
c7::2100001b32057d05 unknown connected unconfigured unknown
c7::2100001b3205ab05 unknown connected unconfigured unknown
c7::2100001b3205ad05 unknown connected unconfigured unknown
c7::2100001b3205ba06 unknown connected unconfigured unknown
c7::2101001b3225ab05 unknown connected unconfigured unknown
c7::2101001b3225ad05 unknown connected unconfigured unknown
c7::2101001b3225ba06 unknown connected unconfigured unknown
Sep 25 16:54:37 M4000-db1 fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 20400 failed state=Timeout, reason=Hardware Error
光纤卡就没有正常登录交换机?对应的这两个光纤卡应该是看不到设备啊.
飘到a控,说明你b控对应的I/O总失败,或者说I/O根本就没有从b控对应的路径下来(比如B控连接的HBA卡都不正常),所以盘阵只能把I/O切到对应的a控好让I/O继续.
你抓出来的信息没有一个是特别有价值的。集群换成B控后应用是否正常?先排除一切可能性再做操作吧。不然你切回A控一样还会切回来的。最佳做法应该是把所有资源组停掉,手工切换到A控,再启动所有资源组。个人感觉你还是应该连上CAM软件去看看错误日志,看看为什么会进行切换的。