Linx AS3.0 Cluster HA Oracle服务不行,求助!
1、硬件环境: 2 台HP rx 4640 2*1.3G ,4G ,1台HP MSA1000 阵列4块146G做raid 5
2、系统环境:Redhat linux AS3.0 + Cluster HA + Oracle 9020 for Linux IA64
(操作系统和oracle 软件是放在本机硬盘,oracle 数据文件/oradata 放在阵列上)
目前问题是:1、双机可以配置成,但是oracle 服务起来时,数据库不能自动起来,只能监听可以自动启动!为什么,是不是我oracle 服务脚本写错了!
2、今天中午数据库莫明其妙自己关闭,查看cluster 日志发现是cluster oracle 服务自动重启过,但数据库无法自己加载起来!麻烦有经验的兄弟一起看看!
脚本如下:
oraserver.sh
----------------------
#!/bin/sh
# description: Oracle auto start-stop script.
# chkconfig: - 20 80
#
# Set ORA_HOME to be equivalent to the $ORACLE_HOME
# from which you wish to execute dbstart and dbshut;
#
# Set ORA_OWNER to the user id of the owner of the
# Oracle database in ORA_HOME.
ORA_HOME=/oracle/product/9.2.0.4.0
ORA_OWNER=oracle
if [ ! -f $ORA_HOME/bin/dbstart ]
then
echo "Oracle startup: cannot start"
exit
fi
case "$1" in
'start')
# Start the Oracle databases:
# The following command assumes that the oracle login
# will not prompt the user for any values
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl start"
sh /oracle/dbstart
;;
'stop')
# Stop the Oracle databases:
# The following command assumes that the oracle login
# will not prompt the user for any values
sh /oracle/dbshut
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl stop"
;;
esac
---------------
dbstart.sh
---------------
su - oracle <<EOF
sqlplus /nolog
connect SYS/change_on_install as SYSDBA
shutdown immediate
exit
----------
dbshut
---------
su - oracle <<EOF
sqlplus /nolog
connect SYS/change_on_install as SYSDBA
shutdown immediate
exit
-----------
这样子脚本有问题吗?可以吗?
3、另外贴出主机的/var/log/messages 日志出来让大家一起分析一下!是不是我一个cpu0坏了,把是报错
Aug 31 12:19:51 localhost last message repeated 15 times
Aug 31 12:31:09 localhost kernel: oracle(27266): floating-point assist fault at ip 4000000004174822
Aug 31 12:31:09 localhost last message repeated 3 times
Aug 31 12:33:33 localhost kernel: oracle(1544): floating-point assist fault at ip 4000000004174822
Aug 31 12:33:33 localhost last message repeated 3 times
Aug 31 12:50:41 localhost modprobe: modprobe: Can't locate module
Aug 31 12:50:41 localhost clusvcmgrd: [24892]: <err> service error: IP address 172.17.116.175 missing
Aug 31 12:50:41 localhost clusvcmgrd: [24892]: <err> service error: : error fetching interface information: Device not found
Aug 31 12:50:41 localhost clusvcmgrd: [24892]: <err> service error: Check status failed on IP addresses for oracle
Aug 31 12:50:41 localhost clusvcmgrd[24891]: <warning> Restarting locally failed service oracle
Aug 31 12:50:42 localhost clusvcmgrd: [24990]: <notice> service notice: Stopping service oracle ...
Aug 31 12:50:42 localhost clusvcmgrd: [24990]: <notice> service notice: Running user script '/oracle/dbserver stop'
Aug 31 12:50:42 localhost su(pam_unix)[25020]: session opened for user oracle by (uid=0)
Aug 31 12:54:45 localhost su(pam_unix)[25020]: session closed for user oracle
Aug 31 12:54:45 localhost su(pam_unix)[25256]: session opened for user oracle by (uid=0)
Aug 31 12:54:46 localhost su(pam_unix)[25256]: session closed for user oracle
Aug 31 12:54:46 localhost modprobe: modprobe: Can't locate module
Aug 31 12:54:47 localhost clusvcmgrd: [24990]: <notice> service notice: Stopped service oracle ...
Aug 31 12:54:47 localhost clusvcmgrd[24891]: <notice> Starting stopped service oracle
Aug 31 12:54:47 localhost clusvcmgrd: [25426]: <notice> service notice: Starting service oracle ...
Aug 31 12:54:47 localhost kernel: kjournald starting. Commit interval 5 seconds
Aug 31 12:54:47 localhost kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,34), internal journal
Aug 31 12:54:47 localhost kernel: EXT3-fs: mounted filesystem with ordered data mode.
Aug 31 12:54:47 localhost clusvcmgrd: [25426]: <notice> service notice: Running user script '/oracle/dbserver start'
Aug 31 12:54:47 localhost su(pam_unix)[25621]: session opened for user oracle by (uid=0)
Aug 31 12:54:47 localhost su(pam_unix)[25621]: session closed for user oracle
Aug 31 12:54:47 localhost clusvcmgrd: [25426]: <notice> service notice: Started service oracle ...
Aug 31 13:04:56 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:12:27 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:16:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:18:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:18:46 localhost login(pam_unix)[24488]: session opened for user root by (uid=0)
Aug 31 13:18:46 localhost -- root[24488]: ROOT LOGIN ON pts/2 FROM 172.17.113.200
Aug 31 13:18:50 localhost su(pam_unix)[24768]: session opened for user oracle by root(uid=0)
Aug 31 13:19:13 localhost su(pam_unix)[24768]: session closed for user oracle
Aug 31 13:19:45 localhost su(pam_unix)[26043]: session opened for user oracle by root(uid=0)
Aug 31 13:20:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:24 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 40000000041743e2
Aug 31 13:20:30 localhost kernel: oracle(26511): floating-point assist fault at ip 4000000004174822
Aug 31 13:21:09 localhost kernel: oracle(26506): floating-point assist fault at ip 4000000004174822
Aug 31 13:21:09 localhost last message repeated 3 times
Aug 31 13:21:17 localhost su(pam_unix)[26043]: session closed for user oracle
Aug 31 13:21:29 localhost su(pam_unix)[28276]: session opened for user oracle by root(uid=0)
Aug 31 13:22:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 40000000041743e2
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 400000000418a0b1
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 4000000004b3d261
Aug 31 13:22:36 localhost kernel: oracle(26504): floating-point assist fault at ip 40000000041743e2
Aug 31 13:22:43 localhost su(pam_unix)[28276]: session closed for user oracle
Aug 31 13:23:39 localhost su(pam_unix)[31027]: session opened for user oracle by root(uid=0)
Aug 31 13:23:46 localhost kernel: oracle(2650: floating-point assist fault at ip 4000000004174822
Aug 31 13:23:46 localhost last message repeated 3 times
Aug 31 13:24:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
Aug 31 13:26:12 localhost kernel: mca: CPU 0 SAL log contains CPE error record
从日志中那个175IP就是我的服务IP,有提示missing 后数据库就是那时断的!
是什么原因!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
置顶帖中有
http://bbs.chinaunix.net/viewthr ... &extra=page%3D1
哪个兄弟,知道那里还有HA软件 不是Redhat linux cluster HA
/oracle/dbstart
------------------------------
sqlplus /nolog <<eof
conn / as sysdba
startup
eof
-----------------------------
/oracle/dbshut
------------------------------
sqlplus /nolog <<eof
conn / as sysdba
shutdown immediate
eof
-----------------------------
1 以oracle身份运行/oracle/dbstart和/oracle/dbshut确认这两个脚本可以正常启动和关闭数据库
2 以root身份运行su - oracle -c "/oracle/dbstart", 和su - oracle -c "/oracle/dbstop"看看oracle环境变量设置是否正确
3 将su - oracle -c "/oracle/dbstart", 和su - oracle -c "/oracle/dbstop"添加到cluster的脚本中,测试在机群中是否正常工作
咳....
你dbstart.sh脚本写错了吧,写的是关闭不是启动数据库
非常感谢楼上兄弟们的耐心回答!
那这样子情况我的oracle 服务脚本应该怎么写!
/oracle/dbstart和/oracle/dbshut脚本不能成功启动数据库导致的
ps: 问个小问题
为什么不使用和
su - $ORA_OWNER -c "$ORA_HOME/bin/lsnrctl stop"
同样的机制来启动数据库呢?
1. 直接手工运行你写的脚本,然后另外一个console 看tail log, 我怀疑你的sh /oracle/dbstart 执行失败由于执行到这句的脚本权限,执行脚本的环境参数造成启动条件不满足.
这种failover 的Oracle 集群调试有基本的方法和操作手段的,你这样的问题应该不大,仔细点调把,建一个checklist,然后一个个去核.
2. CPU报这样的错,是明显的 触发了 race condition 拉,这是一个2.4的kernel bug. /proc包含了 salinfo的处理代码,你现在用的操作系统版本的kernel有瑕疵,会在安腾2的服务器上产生race condition, 有三个方法来解决.
a) 打一个针对kernel的salinfo的补丁,不过这需要非常专业的人来做,你如果没有kernel patch 和debug经验是很难搞定的.
b) 单独2个节点的upgrade kernel 到最新的版本(现在是U8了)
c) upgrade 2个节点的redhat 到最新的Update(现在是Update 8).
我个人意见,如果客观条件都满足的话,采用方法c.
很可惜这三个解决方法都需要操作的人对于解决集群环境的操作系统级问题有足够的经验和能力,特别是如果你的cluster在生产环境根本无法停顿很久的情况下.
还有在操作的时候,注意数据和系统备份,便于rollback
Good Luck,
[ 本帖最后由 nntp 于 2006-9-1 01:28 编辑 ]
你发错板了, linux板块不是有集群板块么? 我把你这个帖从系统管理移动到这里来了. 你的帖要是挂在哪里的话,恐怕不容易找到答案.
[ 本帖最后由 nntp 于 2006-9-1 01:22 编辑 ]