对于oracle 11g版本以下数据库当控制文件损坏后,我们在mount数据库时,会有很明显的ora-600错误,这样就很容易知道控制文件损坏的错误,但是对于oracle 11g r2就不是很明显了,
当时是一个ORACLE 11g 的RAC系统,出现问题时数据库实例可以nomount打开但是在mount控制文件时就会出现如下告警:
ORA-3113 "end of file on communication channel"
然后整个sqlplus连接终止,需要重新连接,当然我们知道通常mount阶段无法进行,问题就出在控制文件本身的存在损坏的问题,但是对于专业的人员来说,如果仅仅满足这样的心态,显然是不行的,所以需要对其进行进一步分析:
Tue Mar 27 13:35:11 2012
NOTE: IP地址数字互转
iis7站长之家 PROD1:PROD registered, osid 6726, mbr 0x1
Tue Mar 27 13:35:24 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_6726.trc
Tue Mar 27 13:40:35 2012
NOTE: client PROD1:PROD registered, osid 7477, mbr 0x1
Tue Mar 27 13:41:45 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_7477.trc
Tue Mar 27 13:41:47 2012
NOTE: client PROD1:PROD registered, osid 7736, mbr 0x1
Tue Mar 27 13:42:01 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_7736.tr
对于生成的trace文件我们仅能够看到如下些信息:
2012-03-27 13:41:08.022438 :802EEFE8:KFNS:kfn.c@702:kfnDispatch(): calling server stub for KFNOP=5
2012-03-27 13:41:13.027006 :802EF0F4:KFNU:kfns.c@1924:kfnsBackground(): kfnsBackground completed in 5 seconds (KFNPM=0)
2012-03-27 13:41:13.027012 :802EF0F5:KFNS:kfn.c@729:kfnDispatch(): completed KFNOP=5
2012-03-27 13:41:13.027122 :802EF0F6:KFNS:kfn.c@702:kfnDispatch(): calling server stub for KFNOP=5
对于此问题显然没什么用处,并且问题应该还是在数据库方面。
所以对数据库实例的alert告警检查,当执行alter database mount状态时的日志如下:
Tue Mar 27 11:42:01 2012
alter database mount
This instance was first to mount
Tue Mar 27 11:42:01 2012
note: loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
Tue Mar 27 11:42:01 2012
SUCCESS: diskgroup PRODDATA was mounted
Tue Mar 27 11:42:01 2012
NOTE: dependency between database PROD and diskgroup resource ora.PRODDATA.dg is established
USER (ospid: 26774): terminating the instance
Tue Mar 27 11:42:07 2012
System state dump requested by (instance=1, osid=26774), summary=[abnormal instance termination].
System State dumped to trace file /d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_diag_26656.trc
Dumping diagnostic data in directory=[cdmp_20120327114207], requested by (instance=1, osid=26774), summary=[abnormal instance termination].
Instance terminated by USER, pid = 26774
还是不明显的日志提示,检查告警trace文件:/d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_diag_26656.trc也无明细的信息
后来采用10046事件来跟踪mount这个过程,才看到了比较明细的提示,
alter session set events='10046 trace name context forever,level 12';
Trace file /d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_ora_7764.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /d01/oracle/11.2.0
System name: Linux
Node name: db01.clc.com
Release: 2.6.18-238.el5
Version: #1 SMP Sun Dec 19 14:22:44 EST 2010
Machine: x86_64
Instance name: PROD1
Redo thread mounted by this instance: 0
Oracle process number: 31
Unix process pid: 7764, image: oracle@db01.clc.com (TNS V1-V3)
*** 2012-03-27 13:41:55.101
*** SESSION ID:(1751.3) 2012-03-27 13:41:55.101
*** CLIENT ID:() 2012-03-27 13:41:55.101
*** SERVICE NAME:() 2012-03-27 13:41:55.101
*** MODULE NAME:(oraagent.bin@db01.clc.com (TNS V1-V3)) 2012-03-27 13:41:55.101
*** ACTION NAME:() 2012-03-27 13:41:55.101
*** 2012-03-27 13:41:55.101
Submitting synchronized dump request [268435460]. summary=[Controlfile header dump (kccpbsc)].
*** 2012-03-27 13:41:57.102
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1325<-kccpb_sanity_check()+341<-kccbmp_get()+309<-kccsed_rbl()+111<-kccocx()+1154<-kccocf()+136<-kcfcmb()+1025<-kcfmdb()+54<-adbdrv()+63122<-opiexe()+18173<-opiosq0()+3993<-kpooprx()+274
<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+1670
----- End of Abridged Call Stack Trace -----
*** 2012-03-27 13:41:57.141
USER (ospid: 7764): terminating the instance
ksuitm: waiting up to [5] seconds before killing DIAG(7652)
如上红色字段可以看到,是控制文件中序列号不匹配造成控制文件一致性验证损坏,而无法正常mount数据库。
这样问题就明了了,可以修改或重建控制文件方式来打开数据库。
更多Oracle相关信息见 专题页面