节点1reboot之后,节点1的资源为何没有failover到节点2?
现象:
客户咨询了一个问题,即在节点1的reboot过程中,通过监控,始终没有发现节点1的资源failover到了节点2,如下:
[Oracle@rac2 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.rac.db application ONLINE ONLINE rac2
ora....c1.inst application OFFLINE OFFLINE
ora....c2.inst application ONLINE ONLINE rac2
ora....SM1.asm application ONLINE OFFLINE
ora....C1.lsnr application OFFLINE OFFLINE
ora....ac1.gsd application OFFLINE OFFLINE
ora....ac1.ons application OFFLINE OFFLINE
ora....ac1.vip application OFFLINE OFFLINE
ora....SM2.asm application ONLINE ONLINE rac2
ora....C2.lsnr application ONLINE ONLINE rac2
ora....ac2.gsd application ONLINE ONLINE rac2
ora....ac2.ons application ONLINE ONLINE rac2
ora....ac2.vip application ONLINE ONLINE rac2
客户认为对于RAC这种高可用系统,当一个节点发生宕机或中断,理所当然运行于之上的资源应该会在另一个节点上运行,
否则像上面这种情况,应用会部分业务中断。
分析:
其实这是一个很基础的问题,对于资源,分为2种:local和global,
local包括:instance,asm,lsnr,gsd,ons,这些资源只能在本节点运行。
VIP是global资源,当1个节点发生故障导致VIP不能再该节点运行时,会failover到存活节点上继续提供服务。
既然是这样,那么我们便可以理解,节点1reboot时,gsd,ons,lsnr,asm,instance没有failover是正常的,
但是VIP呢?当节点1在reboot时,VIP应该会failover到节点2才是,为什么这一过程没有发生呢?
继续检查相关日志:
crsd.log
------------
2013-10-21 10:14:25.608: [ CRSRES][1495542080] Attempting to stop `ora.rac1.vip` on member `rac1`
2013-10-21 10:14:26.628: [ CRSRES][1495542080] Stop of `ora.rac1.vip` on member `rac1` succeeded.
ocssd.log
---------------
[ CSSD]2013-10-21 10:06:03.987 [1332435264] >TRACE: clssgmReconfigThread: completed for reconfig(277552174), with status(1)
[ CSSD]2013-10-21 10:06:04.632 [1269496128] >TRACE: clssgmCommonAddMember: clsomon joined (1/0x1000000/#CSS_CLSSOMON)
[ CSSD]2013-10-21 10:28:25.946 >USER: Oracle Database 10g CSS Release 11.1.0.6.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ CSSD]2013-10-21 10:28:25.946 >USER: CSS daemon log for node rac1, number 1, in cluster rac_cluster
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_CSSD))
日志记载在节点reboot前有手动停止节点1VIP的操作,这就是原因所在了,手动停止VIP并不会触发VIP failover的动作,此时CRS会认为这是一个正常的维护操作。
CRS只有探测到节点1出现故障(例如网卡故障,PUBLIC IP网络故障)时才会进行failover的操作。