1、故障现象
Extract进程(SEXTR01)状态为running,但是Lag at Chkpt却达到5个多小时,且时间一直在增长,根本就不抽取新日志,状态信息如下:
GGSCI (caXXadgdb) 21> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPEYWGL 00:00:00 00:00:04
EXTRACT RUNNING SEXTR01 05:14:58 00:00:03
2、故障分析
2.1 debug一下进程状态信息
GGSCI (caXXadgdb) 23> info sextr01, showch debug
EXTRACT SEXTR01 Last Started 2015-01-25 22:51 Status RUNNING
Checkpoint Lag 05:21:26 (updated 00:00:04 ago)
Log Read Checkpoint Oracle Redo Logs
2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536
SCN 3126.136656462 (13426204423758)
Log Read Checkpoint Oracle Redo Logs
1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 1024
SCN 44527.110828288 (191242119617280)
Current Checkpoint Detail:
Read Checkpoint #1
……
Read Checkpoint #2
Oracle Threaded Redo Log
Startup Checkpoint (starting position in the data source):
Thread #: 2
Sequence #: 4015
RBA: 640984080
Timestamp: 2015-01-25 18:03:51.000000
SCN: 3126.136602577 (13426204369873)
Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 2
Sequence #: 4015
RBA: 640984080
Timestamp: 2015-01-25 18:03:51.000000
SCN: 3126.136602577 (13426204369873)
Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791
Current Checkpoint (position of last record read in the data source):
Thread #: 2
Sequence #: 4016
RBA: 1024
Timestamp: 1988-01-01 00:00:00.000000
SCN: 44527.110828288 (191242119617280)
Redo File: +DGROUP1/caXXdb/onlinelog/group_16.266.799673793
……
从这里发现一个不可思议的问题:
Thread #: 2的current checkpoint 的timestamp竟然是1988-01-0100:00:00.000000,SCN的十进制数,竟然为191242119617280,远远大于当前数据库的SCN,多出一位数。
时间倒退,SCN号反而大涨,时间还那么有个性。这个问题挺有意思。
2.2 view extract进程的report
使用view report sextr01查看进程的报告,提示有ERROR,thread#2 4016 archive log 不能打开。(由于笔者没有将信息复制下来,所以当时的状态只能这么描述了),但是4016号archivelog实际物理文件是存在的,而且权限正常。
2.3 将进程停止下来,showch其状态
GGSCI (calladgdb) 24>stop sextr01
GGSCI (calladgdb) 25> info SEXTR01, showch
EXTRACT SEXTR01 Last Started 2015-01-26 01:00 Status STOPPED
Checkpoint Lag 06:56:55 (updated 00:00:58 ago)
Log Read Checkpoint Oracle Redo Logs
2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536
SCN 3126.136656462 (13426204423758)
Log Read Checkpoint Oracle Redo Logs
1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 1024
SCN 44527.110828288 (191242119617280)
Current Checkpoint Detail:
Read Checkpoint #1
……
Read Checkpoint #2
Oracle Threaded Redo Log
Startup Checkpoint (starting position in the data source):
Thread #: 2
Sequence #: 4015
RBA: 640984080
Timestamp: 2015-01-25 18:03:51.000000
SCN: 3126.136602577 (13426204369873)
Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 2
Sequence #: 4015
RBA: 640984080
Timestamp: 2015-01-25 18:03:51.000000
SCN: 3126.136602577 (13426204369873)
Redo File: +DGROUP1/caXXdb/onlinelog/group_15.267.799673791
Current Checkpoint (position of last record read in the data source):
Thread #: 2
Sequence #: 4016
RBA: 1024
Timestamp: 1988-01-01 00:00:00.000000
SCN: 44527.110828288 (191242119617280)
Redo File: +DGROUP1/caXXdb/onlinelog/group_16.266.799673793
……
3 分析原因
进程的thread#2的checkpoint time竟然是1988-01-01 00:00:00.000000,肯定是不正常,结合error信息为找不到4016归档日志文件,以及进程的current checkpoint thread#2的sequence#:4016,而RBA却为1024。
既然RBA都已经运行取1024了,就不应该找不到archivelog的情况,如果4016号归档日志文件还没有开始抽取,RBA号就应该是0。
根据此分析,决定手动的将extract进程的chkeckpoint恢得到recover chkeckpoint点,以及next extseqno调整到4016的0号RBA试一下,看能否解决。
4、解决办法
4.1 调整netxt sequno为4016,RBA为0
GGSCI (caXXadgdb)2> alter SEXTR01, thread 2, extseqno 4016, extrba 0
4.2 调整ioextseqno与ioextrba到进程的rechvercheckpoint状态
GGSCI (calladgdb)3> alter SEXTR01, thread 2, ioextseqno 4015, ioextrba 640984080
4.3 验证调整结果
GGSCI (caXXadgdb)4> info sextr01, showch
EXTRACT SEXTR01 Initialized 2015-01-26 01:48 Status STOPPED
Checkpoint Lag 07:43:41 (updated 00:00:51 ago)
Log Read Checkpoint Oracle Redo Logs
2015-01-25 18:05:16 Thread 1, Seqno 5554, RBA 611373536
SCN 3126.136656462 (13426204423758)
Log Read Checkpoint Oracle Redo Logs
1988-01-01 00:00:00 Thread 2, Seqno 4016, RBA 0
SCN 0.0 (0)
Current Checkpoint Detail:
Read Checkpoint #1
……
Read Checkpoint #2
Oracle Threaded Redo Log
Startup Checkpoint (starting position in the data source):
Thread #: 2
Sequence #: 4016
RBA: 0
Timestamp: 1988-01-01 00:00:00.000000
SCN: Not available
Redo File: +DGROUP1/caXXdb/onlinelog/group_16.266.799673793
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 2
Sequence #: 4015
RBA: 640984080
Timestamp:
SCN: Not available
Redo File:
Current Checkpoint (position of last record read in the data source):
Thread #: 2
Sequence #: 4016
RBA: 0
Timestamp: 1988-01-01 00:00:00.000000
SCN: Not available
Redo File: +DGROUP1/caXXdb/onlinelog/group_16.266.799673793
……
Start checkpoint和current chkeckpoint的sequence#都变成了4016,RBA号都变成了0
4.4 启动extract进程sextr01,并做brreset操作
GGSCI (caXXadgdb)6> startSEXTR01, brreset
GGSCI (caXXadgdb) 6> start SEXTR01, brreset
Sending START request to MANAGER ...
EXTRACT SEXTR01 starting
GGSCI (caXXadgdb) 7> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPEYWGL 00:00:00 00:00:01
EXTRACT STARTING SEXTR01 07:43:41 00:03:47
4.5 多做几次进程状态信息显示,验证延迟下降效果
GGSCI (caXXadgdb) 10> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPEYWGL 04:46:20 00:00:01
EXTRACT RUNNING SEXTR01 04:23:13 00:00:01
GGSCI (caXXadgdb) 11> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPEYWGL 01:53:22 00:00:09
EXTRACT RUNNING SEXTR01 01:27:25 00:00:08
GGSCI (caXXadgdb) 12> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPEYWGL 00:00:00 00:00:01
EXTRACT RUNNING SEXTR01 00:00:02 00:00:00
通过几次info all查看进程状态,看到lag at chkpt时间快速下降。问题解决。
: