一个朋友遇到ORA-04031问题。虽然这个错误是非常常见的,然而这里的Case 也有点让人为之震惊!
Tue Aug 26 11:51:13 2014
Errors in file /Oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_p485_28873.trc (incident=1589637)
ORA-04031: 无法分配 32792 字节的共享内存 ("shared pool","unknown object","sga heap(2,0)","PX msg pool")
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Aug 26 11:51:15 2014
。。。。。。。。
DDE: Problem Key 'ORA 4031' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
Tue Aug 26 11:57:12 2014
PMON failed to acquire latch, see PMON dump
Tue Aug 26 11:57:15 2014
DIAG (ospid: 14345) has not called a wait for 49 secs.
Tue Aug 26 11:57:58 2014
Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_qmnc_15686.trc (incident=1608420)
ORA-00445: background process "q00T" did not start after 120 seconds
Tue Aug 26 11:58:17 2014
。。。。。。。。
Tue Aug 26 12:07:31 2014
WARNING: inbound connection timed out (ORA-3136)
Tue Aug 26 12:07:31 2014
WARNING: inbound connection timed out (ORA-3136)
Tue Aug 26 12:07:46 2014
PMON failed to acquire latch, see PMON dump
上述信息为Rac 节点1的信息,对于节点2,信息也基本类似,这里就不贴了。从日志来看,是p621出异常,根据这一点
我们可以判断为是并行进程出现了异常,接着从下面的ORA-04031错误也可以证实这一点。
ORA-04031: 无法分配 32792 字节的共享内存 (“shared pool”,”unknown object”,”sga heap(2,0)”,”PX msg pool”)
从这个错误来看,可以判断P621进程无法分别32792 byte的内存了。这部分内存是需要为Px msg pool进程分配。
这里首先说说PX msg pool是干什么的?
Px msg pool其实是用于Parallel 操作的。通常情况下这部分内存区域应该是从large pool中分配。然而很多业务系统中
由于设置不到,导致parallel 操作仍然使用shared pool的memory。这一点,我们稍后讨论。
接着我们看后面的日子发现PMON进程都无法获得latch了,然后这个时候系统已经不行了。 很明显这个问题的根源就是前面
的ORA-04031错误,这里我们需要分析下trace,来确认下,为什么进程会报错。
该trace文件(xx1_p485_28873.trc)的内容如下:
Allocation Request Summary Informaton
=====================================
Current information setting: 04014fff
SGA Heap Dump Interval=3600 seconds
Dump Interval=300 seconds
Last Dump Time=08/26/2014 11:51:12
Dump Count=1
Allocation request for: PX msg pool
Heap: c00000012fe3ae08, size: 32792
******************************************************
HEAP DUMP heap name="sga heap(2,0)" desc=c00000012fe3ae08
extent sz=0xfe0 alt=248 het=32767 rec=9 flg=-126 opc=0
parent=0000000000000000 owner=0000000000000000 nex=0000000000000000 xsz=0x10000000 heap=0000000000000000
fl2=0x20, nex=0000000000000000, dsxvers=1, dsxflg=0x0
dsx first ext=0x30000000
latch set 2 of 2
durations disabled for this heap
reserved granules for root 0 (granule size 268435456)
可以看到,PX msg pool请求获得32792 byte的内存大小,从什么地方申请呢? heap 2,也就是shared pool的第2个subpool。
但是却无法分配这么多,以至于报错,下面我们来看下subpool 2的实际情况:
Memory Utilization of Subpool 2
================================
Allocation Name Size
___________________________ ____________
"free memory " 1431685456 ++++整个subpool 2 还有1.4G free memory左右
"miscellaneous " 1680
"vem_user_actlog1 " 0
"SWEEPERR " 0
"File Space Usage " 112144
"trace_knlasg " 0
"ges big msg buffers " 56801208
"dsktab_kfgsg " 57128
"hot latch diagnostics " 0
"gcs res hash bucket " 4194304
"Wait event pointers " 0
"gcs res latch table " 0
"file state object " 4698096
"vips_package_size " 0
"SEQ S.O. " 1057920
。。。。
"gcs resources " 223317648
"gc lock state " 0
"kglsim alloc latch area " 0
"gcs affinity " 0
"gcs opaque in " 49248
"gcs pnp histo " 0
"ges resource " 605005184
"knlp free small rcv buf " 0
"SQLA " 280160
"PRTI " 0
"SQLP " 53608
"SQLK " 28336
"gcs dynamic r " 2556288
"work area tab " 236640
"plwpil:wa " 0
"kebm run-once actions " 0
"grptab_kfgsg " 0
"AW SGA " 0
"dbktb: trace buffer " 53673984
"v_swperrcount " 0
"ksunfy : SSO free list " 27834240
"ksxp IPC state object " 346448
"ksdhng: cache history " 0
"DFW_CONFIG_CAPTURE " 0
"enqueue resources " 862232
"kzekm heap descriptor " 0
"kelt translation table " 0
"IPS_PACKAGE_UNPACK_HISTOR" 0
"kglsim hash table bkts " 2097152
"KSXR global channels " 0
"ges enqueues " 344980144
"trace buf hdr xtend " 2080
"KSXR receive buffers " 1061888
"kglsim Java timestamps " 131072
"VM OSD context " 0
"KCB Table Scan Buffer " 4194816
"knlp receive buffers " 0
"PLS chunk " 0
"knlu_init_btree:init " 0
"KEWS aggregation objs " 0
"DDE_USER_ACTION_DEF " 0
"keswxNotify:tabPlans " 131072
"EM_USER_ACTIVITY " 0
"KSXR large reply queue " 168296
"event-class map " 0
"KSQ event description " 0
"KSXP CONN Stats buffer " 728432
"KSXP CTX Stats buffer " 578128
"KSXP PORT Stats buffer " 142768
"kglsim alloc latches " 0
"kglsim used list arr " 0
"kglsim free obj list " 0
"KTCN global operators " 0
"KESTB existence bitvec " 131072
"KFD extent enqueue obj " 42792
"ksdhng: el wtr cache " 401280
"ksb cic process list " 0
"gcs shadows " 130592800
"value block f " 43832
"ksdhng: blkers cache " 72960
"FileOpenBlock " 255011680
"qmtb_init_xxta " 0
"sort segment handle " 54936
"latchnum to latch map " 0
"LISTEN ADDRESS ENTRY " 0
"vips_pkg_file " 0
"PX msg pool " 47109312 +++++ 当前PX msg pool已经分配了47M左右
。。。。。。
"Checkpoint queue " 116765184
"sim segment hits " 72960
"sim lru segments " 72960
"pesom.c:Global " 0
"peshm.c:Global " 0
"paralltp txn r " 22476896
"ASH buffers " 119537664
"ASM KFFD SO " 0
"statement queuing " 0
"v_actinc " 0
"message pool freequeue " 523240
"ADR_CONTROL " 0
"vshowincb " 0
"vproblem2 " 0
"PMON blockers " 0
"procs: ksunfy " 6408000
"sga listelement " 0
"primem_kfmdsg " 0
"agent name " 0
"SYSTEM PARAMETERS " 232944
"Online xxtafile Move sess" 0
"sched job queue " 0
"vproblem " 0
"object queue hash buckets" 8404992
"object queue hash table d" 229824
"obj stats allocation chun" 1290240
"qmnsf_heap " 0
"sim cache nbufs " 0
"qmu subheap " 53976
"MTTR advisory context " 0
"LGWR-network Server info " 34032
"LGWR debug module memory " 0
"logout storm management " 0
"log_simultaneous_copies " 49248
"sys event stats " 552960
"parameter handle " 4273688
"Parameter Handle " 0
"name-service recovery " 80776
"channel handle " 48360
"parameter blocks " 22016
"qmrcInitSGA:qmrcSGA " 0
"PARAMETER ENTRY " 0
"ASM map operations " 30552
"Security Class " 33280
"GCR " 0
"post/wait queues " 55872
"bloom filter " 0
"Auto tune undo info " 0
"pesldT03_Instance " 33848
"cp pool array " 0
"cp cmon array " 0
"UNDO INFO " 0
"Log Archive Config Contex" 0
"state objects " 107440104
"ArchLog Dest Array " 22784
"parallel kcbibr " 0
"archive_lag_target " 0
"poisoned memory " 32768
"name-service request " 0
"latch recovery structures" 0
"latches for protecting fs" 0
"latches for protecting ls" 0
"Label Cache Heap descrptr" 0
"cp connection " 0
"sched job slv " 0
"bt_subentry " 0
"Saved job ids " 0
"object level stat table " 0
"pso tbs: ksunfy " 780000
"dummy " 0
"qm_init_sga:qmdpsg " 0
"analytic workspace " 0
"sim trace buf " 0
"qm_init_sga:oidctx " 0
"Sort Segment " 16852160
"PRESENTATION ENTRY " 0
"cp srv array " 0
"recov_kgqbtctx " 26512
"dbkea sga " 0
"pesom.c:latch array " 0
"Banner Storage " 0
"qmps connections " 65280
"quiescing session " 0
"qm_init_uga:qmsg " 0
"enqueue " 0
"buffer handles " 5472000
"process group array " 69776
"DML lock " 7066304
"property service SO " 0
"broker globals " 0
"sga node map " 0
"prmtzdini tz region " 801416
"VIEW " 0
"savepoints " 0
"Monitor Modification Hash" 0
"RESUMABLE NAME " 0
"obj stats hash table " 0
"max allowable # log files" 506400
"OS proc request holder " 293424
"message pool context area" 22400
"media recovery state obje" 0
"db_files " 1783968
"base-kglhdusr " 1218016
"constraints " 0
"ASM scan context " 0
"PX server msg stats " 0
"PX server deq stats " 0
"cinfo_kfnsg " 0
"cco_subnode " 0
"cco_pubnode " 0
"KQR X PO " 50414728
"KQR L PO " 5977688
"parameter table block " 31158248
"KQR M SO " 192808
"KQR L SO " 128104
"KQR S SO " 90928
"KSFV SGA " 0
"KSIR SGA " 0
"ASM rollback operations " 0
"kccsgehtt " 0
"kglsgivl2 " 28160
"plugin xxtafile array " 36016
"plwxx:PLW_STR_NEW_RVAL " 0
"plwspv:PLW_STR_NEW_VAL " 0
"lm heartbeat " 0
"KGSK sga " 0
"BRANCH TABLE SEGMENTED AR" 105288
"ksv slave " 0
"RVWR post requested array" 0
"parameter value memory " 0
"RTA Bounxxry Info thread " 71808
"RTA read control file ind" 0
"k2q_lock " 4606648
"ASM extent pointer array " 318197080
。。。。。。。
"kglsim object batch " 338733696
"ktlbk state objects " 5417280
。。。。。
"kglsim heap " 197796864
"ksz parent " 204960
"kgnfs authp " 16464
"event statistics per sess" 29840640
"event statistics ptr arra" 36480
"event list to post commit" 0
"KFG SO child " 0
"KGKP randnum " 40000
Library Cache Dump
可以看到gcs/ges resource/enqueue这部分内存都消耗了1.4g 左右,还是比较大的,另外ASM extent pointer array也有300多m。这个可能跟asm的extent分配有关系,对于DW缓解,建议不要使用默认的uniform AU 1m,太小了。
总体来看,从这个subpool的使用情况来看,无法看出是不是碎片严重了。但是从dump下面的library 数据来看,也是能看出一些端倪的,如下:
LIBRARY CACHE STATISTICS:
namespace gets hit ratio pins hit ratio reloads invalids
-------------- --------- --------- --------- --------- ---------- ----------
SQL AREA 325150175 0.984 2560976170 0.992 2656085 3942426
TABLE/PROCEDURE 73717472 0.996 786560445 0.998 481299 1362
BODY 11738443 1.000 86301250 1.000 521 0
TRIGGER 3247968 1.000 3248823 0.999 153 2
INDEX 576294 0.993 577639 0.974 8448 0
........
大家可以看到,SQL area 这里的reloads和invalids太高了。说明这个时间点shared pool的可用内存可能不多。虽然从前面的dump看free memory还不少。 我们需要明白一点,Oracle消耗shared pool的内存,是需要连续的一段内存区域。
从这里来看如果如不是shared pool碎片严重那么可能就是有大的SQL操作。我们看该进程的信息发现该进程一直在等待某个latch Child parallel query alloc buffer,如下:
SO: 0xc000001842ba9750, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0xc000001842ba9750, name=process, file=ksu.h LINE:12616 ID:, pg=0
(process) Oracle pid:664, ser:2, calls cur/top: 0xc0000015e1989c10/0xc000001711b743a8
flags : (0x0) -
flags2: (0x0), flags3: (0x10)
intr error: -2147473260, call error: 0, sess error: 0, txn error 0
intr queue: 2147494036
ksudlp FALSE at location: 0
(post info) last post received: 2147494036 136 12
last post received-location: ksu.h LINE:13927 ID:ksusig
last process to post me: c000001832b0b620 242 0
last post sent: 0 0 356
last post sent-location: kxfp.h LINE:4561 ID:kxfprienq: slave
last process posted by me: c000001832b0b620 242 0
(latch info) wait_event=0 bits=50
holding (efd=10) c000001804617de8 Child parallel query alloc buffer level=6 child#=23
Location from where latch is held: kxfpb.h LINE:496 ID:kxfpbalo: allocate a buffer
Context saved from call: 13835058152010268280
state=busy [holder orapid=664] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
746 (456, 1409025072, 2)
1068 (456, 1409025072, 3)
747 (456, 1409025072, 2)
857 (456, 1409025072, 1)
1067 (456, 1409025072, 3)
748 (456, 1409025072, 1)
983 (434, 1409025072, 4)
1070 (384, 1409025072, 3)
749 (384, 1409025072, 1)
1069 (384, 1409025072, 2)
912 (346, 1409025072, 4)
662 (323, 1409025072, 3)
871 (308, 1409025072, 4)
906 (270, 1409025072, 4)
735 (209, 1409025072, 4)
786 (157, 1409025072, 1)
972 (74, 1409025072, 1)
waiter count=17
holding (efd=10) c00000168571bdd0 Child process queue reference level=4 child#=10989
Location from where latch is held: kxfp.h LINE:4438 ID:kxfprialo2: process qref child init
Context saved from call: 13835058150030153416
state=busy [holder orapid=664] wlstate=free [value=0]
Process Group: DEFAULT, pseudo proc: 0xc0000018330dea98
O/S info: user: oracle, term: UNKNOWN, ospid: 28873
OSD pid info: Unix process pid: 28873, image: oracle@xx01 (P485)
PSO child state object changes :
这个latch Child parallel query alloc buffer 也就是在进行Px msg pool 内存分配的时候需要获得的,当内存分配完毕之后latch会立刻释放,通常情况下,Oracle 对于Latch的申请和释放是非常之快的。很明显这里这个进程由于申请内存失败,导致latch也没有释放,还阻塞了一堆的会话。
由于朋友这里没有其他的信息,所以很难准确的定位到是不是碎片或者其他的问题。但是从trace 里面也不难发现一些信息。前面提到进程申请内存,势必就要看看下这个会话是什么操作了,不看不知道,一看吓一跳,如下:
MERGE INTO xx.xxxx_A A
USING (SELECT /*+ parallel(a,512) parallel(b,8) */
:B2 STATIS_MONTH,
NVL(B.PROV_ID, '000') PROV_ID,
COUNT(DISTINCT A.SERV_NUM) CNT
FROM (SELECT DISTINCT SERV_NUM
FROM xxx.TDW_XXXXX_MIGU_MEMBER_M A
WHERE STATIS_MONTH = :B1
AND MEMBER_LVL = '1'
AND MEMBER_STAT IN ('1', '0')
AND EXISTS (SELECT /*+ parallel(b,64)*/
1
FROM xxx.TKR_DEV_XXXXX_CORE_M B
WHERE A.SERV_NUM = B.SERV_NUM)) A,
xxx.VDW_NUMBER_SEGMENT B
WHERE SUBSTR(A.SERV_NUM, 1, 7) = B.SEGMENT(+)
GROUP BY B.PROV_ID) B
ON (A.STATIS_MONTH = B.STATIS_MONTH AND A.PROV_ID = B.PROV_ID)
WHEN MATCHED THEN
UPxxTE SET CORE_NRML = CNT
大家可以看到,这个sql居然开了512个并行,如果把几个表的并行都加起来,高达584个。问题应用这里还不止这一个SQL。
对于并行查询,我们知道,即使你的并行度设置的再高,也不一定能用这么多,因为这是受限制数据库参数的。不过,悲剧的是,他这里默认参数都是比较大的。cpu_count=114,默认的dop都高达228。parallel_max_server更是高达好几千了。
还有一点,他的参数parallel_force_local是ture,这也导致parallel进程只能在单个实例进行分配。
其实上述信息都不是重要的,更重要也是更致命的一点的是,并行查询使用的Px msg pool居然从shared pool了进行分配了。
我看信息,他这里的large pool设置高达10gb啊,其实是完全没用派上用场。
对于Px msg pool,如果你的数据库PARALLEL_MIN_SERVERS比较大(这里就是比较大),那么当数据库启动时候,可能Px msg pool就会消耗不少的内存的。
对于Parallel操作的内存消耗Px memory,可以从large pool分配,也可以从shared pool进行分配。这里需要注意,即使你设置了large pool,如果不满足下面的条件,那么Px memory仍然会从shared pool去分配。
1) parallel_automatic_tuning 参数设置为auto
2) _PX_use_large_pool 隐含参数设置为true
3) 使用ASMM ,即使设置了sga_target 或 memory_target.
很遗憾的是,他这里3点都不满足。
: