B35 Inside rac by Julian Dyke

1
DB Tech Showcase - Osaka
May 2013
juliandyke.com© 2013 Julian Dyke
Julian Dyke
Independent Consultant
Inside RAC

© 2013 Julian Dyke juliandyke.com2
Agenda
 OPS versus RAC
 Buffer Cache
 Global Cache Services

RAC
Overview
Public
Network
Shared
Storage
Node 1
Instance
1
Node 2
Instance
2
Node 3
Instance
3
Node 4
Instance
4
Private
Network
(Interconnect)
Storage
Network

OPS versus RAC
Oracle 8.0.6 and below
Instance 2
Node 2
OPS - Oracle 8.0.6 and below
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads All I/O uses
shared storage
Enqueues only use
interconnect

Instance 2
Node 2
OPS - Oracle 8.1.5 to Oracle 8.1.7 - Cache Fusion Phase 1
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads Current I/O always
uses shared storage
Consistent reads can
use interconnect
OPS versus RAC
Oracle 8.1.5 to Oracle 8.1.7

Instance 2
Node 2
RAC - Oracle 9.0.1 and above - Cache Fusion Phase 2
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads
Current I/O and
consistent reads can
use interconnect
OPS versus RAC
Oracle 9.0.1 and above

Head of
Cold End
Head of
Hot End
92
0
34
3
72
4
52
1
71
2
66
0
49
0
42
1
45
2
52
1
71
2
66
0
42
1
11
1
52
1
71
2
11
1
42
1
42
2
71
0
92
0
34
3
72
4
45
2
11
1
52
1
42
2
33
1
45
2
11
1
42
2
33
1
34
4
92
0
34
4
72
4
45
2
11
1
42
0
33
1
71
0
87
1
87
1
72
4
33
1
45
2
Read Block 42
Get first available buffer
from cold end
Update buffer contentsInsert buffer at head of cold
end
Read Block 11
from cold end
end
Read Block 42
Update touch count
for block 42
Read Block 33
Move block 71 to head
of hot end
Set touch count
on block 71 to zero
from cold end
end
Read Block 34
Update touch count
for block 34
Read Block 87
Move block 42 to head
of hot end
Set touch count
on block 42 to zero
from cold end
end
STOP
Block
Number
Touch
Count
Buffer Cache
Single Block Reads

Head of
Cold End
Head of
Hot End
Read Block 1
Get first four available
buffers from cold end
Read next four blocks into
buffers
1 2 3 4
Insert buffers at head of
cold end
12 13 2 14 3 2 1
Move block 1 to cold end
121
Read Block 2
21 321 3 4
Read Block 3
Read Block 4
Read Block 5
Get next four available
buffers from cold end
Read next four blocks into
buffers
Insert buffers at head of
cold end
4 3 2 15
5 56
76
7 6 5
8
78 5 56 5 65 6 75 6 7 8
Read Block 6
Read Block 7
Read Block 8
STOP
DB_FILE_MULTIBLOCK_READ_COUNT = 4
Buffer Cache
Multi Block Reads

Global Services
Overview
 Resource
 Object to which access must be controlled at instance
level
 Enqueue
 Memory structure that serializes access to a resource
 Global Resources
 Object to which access must be controlled at cluster level
 Global Enqueue
 Locks and enqueues which need to be consistent between
all instances

Global Services
Overview
 Global Resource Directory (GRD)
 Records current state and owner of each resource
 Contains convert and write queues
 Distributed across all instances in cluster
 Maintained by GCS and GES
 Global Cache Services (GCS)
 Implements cache coherency for database
 Coordinates access to database blocks for instances
 Global Enqueue Services (GES)
 Controls access to other resources (locks) including
library cache and dictionary cache
 Performs deadlock detection

Global Cache Services
Introduction
 Global Cache Services exist to implement Cache Fusion
 Cache Fusion allows blocks to be updated by multiple
instances
 Only one instance can have the updatable (current) version of
a block
 GCS must ensure that only one instance can update a
block at any time
 Many instances can have read-only (consistent read) versions
of a block
 Instances can have multiple copies of same block at
different SCNs

2 way Current Read
Instance 1
Instance 2
Instance 4
1318
Request
shared
resource
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Request
granted
SN
Read
request
Block
returned
1318
1
2
3
4
STOP

3-way Current Read
Instance 1
Instance 2
Instance 4
1318
Request
exclusive
resource
Instance 3
Resource
Master
Instance 1 requests
exclusive read on
block
Transfer
block to
Instance 1 for
exclusive
access
SN
Block and
resource
status
Resource
status
1318
1
2
3
4
N
N
X
1320
STOP

3-way Current Read (Dirty Block)
Instance 1
Instance 2
Instance 4
1318
Request
block in
exclusive
mode
Instance 3
Resource
Master
Instance 4 requests
exclusive read on
block
Transfer
block to
Instance 4
in exclusive
mode
SN
Block and resource status
Resource
status
1318
12
3
4
N NX
1320
N
N
X
1320 1323
STOP
Note that Instance 1 will
create a past image (PI) of
the dirty block

3-way Current (Without Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block
in shared mode
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Resource
status
1
3
4
N NX
1320
N
N
X
1320 1323
Transfer
block to
Instance 2
in shared
mode
2
STOP
In Oracle 8.1.5 and above
_fairness_threshold is used
to avoid unnecessary lock
conversions

3-way Current (With Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block
in shared mode
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Resource
status
1
3
4
N NX
1320
N X
1320 1323
Transfer
block to
Instance 2
in shared
mode
2
S
S
STOP
In Oracle 8.1.5 and above
_fairness_threshold is used
to avoid unnecessary lock
conversions

Past Images
 When an instance passes a dirty block to another instance it
 Flushes redo buffer to redo log
 Retains past image (PI) of block in buffer cache
 PI is retained until another instance writes block to disk
 Used to reduce recovery times
 Recorded in V$BH.STATUS as PI
 Based on X$BH.STATE (value 8 in Oracle 10.2)

Past Images
71287129
UPDATE t1
SET c1 = 7124;
COMMIT;
UPDATE t1
SET c1 = 7129;
COMMIT;
7123
Instance 1
71237124712571267127
Buffer Cache
71247123
71257124
71267125
71277126
7128
71287127
Redo Log 1
Instance 2
Buffer Cache
71297128
UPDATE t1
SET c1 = 7125;
COMMIT;
UPDATE t1
SET c1 = 7126;
COMMIT;
UPDATE t1
SET c1 = 7127;
COMMIT;
UPDATE t1
SET c1 = 7128;
COMMIT; 7128
7123
Redo Log 2
7123
712871297129
7129
7129
Assume table t1 contains a
single row in block 42
Instance 1 updates column to
7124
Block 42 is read from disk
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
7125
Redo Log 1
cache
7126
Redo Log 1
cache
7127
Redo Log 1
cache
7128
Redo Log 1
cache
7129
GCS transfers block from
Instance 1 to Instance 2
Instance 1 makes block 42
a Past Image block
Undo/redo written to
Redo Log 2
cache
Instance 2 Crashes
Contents of buffer cache are lost
DBWR has not written changes
to block 42 back to disk yet
Instance 1 must perform
recovery for Instance 2
Block 42 needs recovery
Instance 1 uses Past Image
Undo/redo is applied from
Redo Log 2
Block 42 is subsequently written
back to disk by DBWR
STOP

Wait Events
 Wait events show reads where messages have been
exchanged with other instances
 Can include:
 gc cr grant 2-way
 gc cr block 2-way
 gc cr block 3-way
 gc cr multi block request
 gc current grant 2-way
 gc current block 2-way
 gc current block 3-way
 gc current multi block request

gc cr block 3-way wait event
Source Destination Description Bytes
RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456
RAC2 - LMS1 RAC4 - Server OK 212
RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480
RAC3 - LMS1 RAC2 - LMS1 OK 212
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500

RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,42
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
6
7
8
9
1,42
2,44
1,42
2,44

 2-way Consistent Read

RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,40
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1 2
3
4
5
6
7
8
1,40
2,44
1,40
2,44
STOP

gc current block 3-way wait event
 3-way Current Read
RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480
RAC4 - LMS1 RAC2 - LMS1 Received file 8 block 15 244

11
gc current block 3-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,42
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
6
7
8
9
1,42
2,44
12
UPDATE t1
SET c2 = 42
WHERE c1 = 1;
RAC3 saves past image of the dirty block until RAC4 writes the block to disk
1,42
2,44
1,42
2,50
STOP

gc cr grant 2-way wait event
 2-way Consistent Read
RAC3 - LMS1 RAC4 - Server Grant read file 6 block 69 276
RAC4 - Server RAC3 - LMS1 OK 212

gc cr grant 2-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,40
2,44
1,40
2,44
SELECT c2
FROM t1
WHERE c1 = 1;
1 2
5 6
34
STOP

gc cr multi block request wait event
RAC4 - Server RAC3 - LMS1 Request file 8 blocks 69-76 1872
RAC3 - LMS1 RAC4 - Server Grant file 8 blocks 69-76 to RAC4 772
RAC4 - Server RAC3 - LMS1 OK 212

RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
SELECT c2
FROM t1
WHERE c1 = 1;
1 2
5 6
34
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
STOP

 The following 10046/8 trace is for a gc cr multi block request
WAIT #2: nam='gc cr multi block request' ela= 722 file#=4
block#=248 class#=1 obj#=51866 tim=1169728375495574
WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244
blocks=5 obj#=51866 tim=1169728375506092
 This trace can be misleading because:
 the gc cr multi block request specifies the LAST block in
the range
 the gc cr multi block request does not specify how many
blocks should be read
 the gc cr multi block request does not specify how many
blocks have been returned from another instance

Block Mastering
 Each block is mastered on one instance
 Block DBA is reported by X$KJBR.KJBRNAME
 Names have the format:
[<block_number>][<file_number>][BL]
 For example
[0x137][0x40000][BL]
 Ordering by X$KJBR.KJBRNAME is difficult because the
resource names do not collate when sorted e.g.:
 is file# 4, block# 311
[0x12E][0x40000][BL]
[0x12F][0x40000][BL]
[0x13][0x40000][BL]
[0x130][0x40000][BL]
[0x131][0x40000][BL]
etc...

Block Mastering
 Some useful functions
CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2)
RETURN INTEGER
IS
pos1 INTEGER := INSTR (p_resource_name,'x',1,2);
pos2 INTEGER := INSTR (p_resource_name,']',1,2);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,'XXXXXXXX') / 65536;
END;
/
CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2)
RETURN INTEGER
IS
pos1 INTEGER := INSTR (p_resource_name,'x',1,1);
pos2 INTEGER := INSTR (p_resource_name,']',1,1);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,'XXXXXXXX');
END;
/

Block Mastering
 In Oracle 10.2 block mastering is determined by
 _lm_contiguous_res_count
 Specifies number of contiguous blocks that will hash to the
same HV bucket
 Defaults to 128
 For example
Start End
0x080 0x0FF
0x180 0x1FF
0x280 0x2FF
0x380 0x3FF
0x480 0x4FF
0x580 0x5FF
etc etc
Start End
0x000 0x07F
0x100 0x17F
0x200 0x27F
0x300 0x37F
0x400 0x47F
0x500 0x57F
etc etc
Instance 0 Instance 1

Block Mastering
 The following table shows that masters are still assigned to
ranges of 128 contiguous blocks in a four-node cluster
Start Block End Block Master
0 127 1
128 255 2
256 383 2
384 511 3
512 639 3
640 767 3
768 895 1
896 1023 0
1024 1279 2
1280 1407 1

Dynamic Remastering
 In Oracle 9.2
 documentation describes dynamic remastering
 not implemented in code
 In Oracle 10.1
 work at data file level
 very high threshold so difficult to test
 does occur on some customer sites
 In Oracle 10.2 and above
 works at segment level
 thresholds are relatively low

Dynamic Remastering
 Object remastering is recorded in V$GCSPFMASTER_INFO
 Instances are internally numbered 0, 1 etc
 Initially contains no rows
 After remastering object 52084 to instance 0
SELECT object_id, current_master, previous_master
FROM v$gcspfmaster_info;
 After remastering object 52084 to instance 1
Object ID Current Master Previous Master
52084 0 32767
Object ID Current Master Previous Master
52084 1 0

Dynamic Remastering
 In Oracle 10.2 and above, information about Dynamic
Remastering operations is also reported in the following fixed
views
 X$KJDRMREQ
 Dynamic Remastering Requests
 X$KJDRMAFNSTATS
 File Remastering Statistics
 X$KJDRMHVSTATS
 Hash Value Statistics

Dynamic Remastering
 In Oracle 11.1 and above, Dynamic Remastering statistics are
reported in V$DYNAMIC_REMASTER_STATS
Column Name Data Type
REMASTER_OPS NUMBER
REMASTER_TIME NUMBER
REMASTERED_OBJECTS NUMBER
QUIESCE_TIME NUMBER
FREEZE_TIME NUMBER
CLEANUP_TIME NUMBER
REPLAY_TIME NUMBER
FIXWRITE_TIME NUMBER
SYNC_TIME NUMBER
RESOURCES_CLEANED NUMBER
REPLAYED_LOCKS_SENT NUMBER
REPLAYED_LOCKS_RECEIVED NUMBER
CURRENT_OBJECTS NUMBER

Dynamic Remastering
 Dynamic remastering is coordinated by the LMD0 background
 The LMD0 process background process includes limited
details of dynamic remastering operations
 Excessive dynamic remastering can cause instance freezes
 Observed in both Oracle 10.1 and 10.2
 Oracle Support occasionally recommends that dynamic
remastering is disabled using the following parameters:
_gc_affinity_time = 0
_gc_undo_affinity=FALSE

Thank you for listening
info@juliandyke.com

Backup

Interconnect
Overview
 Instances communicate with each other over the interconnect
(network)
 Information transferred between instances includes
 data blocks
 locks
 SCNs
 Typically 1Gb Ethernet
 UDP protocol
 Often teamed in pairs to avoid SPOFs
 Can also use Infiniband
 Fewer levels in stack
 Other proprietary protocols are available

Interconnect
TCP/IP Five Layer Model
 All messages travel down through layers, across physical
layer then up again
5 Application
4 Transport
3 Network
2 Data Link
1Physical
5 Application
4 Transport
3 Network
2 Data Link
1Physical

Interconnect
TCP/IP Five Layer Model
 TCP/IP has a four or five layer model
 Five-layer model shown below
Layer TCP/IP Suite
5 Application DHCP, DNS, FTP, HTTP, SSH, NFS, NTP, SMTP, SNMP, TELNET, RPC, SOAP
4 Transport TCP, UDP
3 Network IP (IPv4, IPv6), ICMP, ARP, RARP
2 Data Link Ethernet, Token Ring, 802.11, Wi-Fi, FDDI, PPP
1 Physical 10BASE-T, 100BASE-T, 1000BASE-T, Optical Fibre, Twisted Pair
 Four-layer model combines data link and physical layers

Interconnect
TCP/IP Transport Layer
 Transport Layer
 Connection-oriented (TCP)
 Connectionless (UDP)
Ethernet
Physical Layer
IP
TCP UDPClusterware RAC

Interconnect
Encapsulation
Ethernet
Header
Ethernet
Trailer
UDP
Header
IP
Header Data
UDP
Header
IP
Header Data
UDP
Header Data
Data
4 bytes14 bytes 20 bytes 8 bytes
MTU Size

Oracle Clusterware
Node Heartbeat Messages
 Sent to each node in cluster every second in both directions
 Checks nodes are still members of cluster
 Sent by ocssd.bin using TCP well-known port 49895
 Outgoing message is 134 bytes (80 byte payload)
 Incoming message is 66 bytes (12 byte payload)
Node
1
Node
3
Node
2
Node
4
Outgoing
Incoming

Oracle Clusterware
Node Status Messages
 Number of packets exchanged by a node is determined by
number of nodes in cluster
 Number of packets per node per hour is
 (#nodes - 1) * 4 messages * 3600 seconds
Number of nodes Packets per hour
2 14,400
3 28,800
4 43,200
5 57,600
6 72,000
7 86,400
8 100,800
16 216,000
32 446,400

Datafiles
Controlfiles
Redo Logs
RAC Background Processes
Overview
Redo Logs
DIAG
LMON
LCK0
LMD0
LMSn
PMON SMON
LGWR
CKPT
ARCn
SMON PMON
DBWR DBWR LGWR
Shared Pool
Buffer Cache
Instance 2
Shared Pool
Buffer Cache
Instance 1
DIAG
LMON
LCK0
LMD0
LMSn
CKPT
ARCn
Node 1 Node 2

LMSn
 LMSn
 Global Cache Service Process
 Manage requests for data access across cluster
 Up to 20 in Oracle 10.1
 LMS0-LMS9 LMSa-LMSj
 Up to 36 in Oracle 10.2
 LMS0-LMS9 LMSa-LMSz
 In Oracle 10.1 and above, number of GCS server processes
can be configured using gcs_server_processes parameter
 Default value is 1 (single CPU system)
 Can also be configured using _lm_lms parameter

LMSn
 LMS processes run in real-time mode
 Remaining processes run in time-share mode
 Check using:
[oracle@server3 ~]$ ps -eo pid,user,opri,cmd | grep ora_lm
8596 oracle 75 ora_lmon_TEST1
8598 oracle 75 ora_lmd0_TEST1
8601 oracle 58 ora_lms0_TEST1
 58 is real time; 75 or 76 is time share
 You can also check process scheduling policies using chrt
oracle@server3 ~]$ chrt -p 8601 # lms0 - Real
Time
pid 8601's current scheduling policy: SCHED_RR
pid 8601's current scheduling priority: 1
[oracle@server3 ~]$ chrt -p 8596 # lmon - Time
Share
pid 8596's current scheduling policy: SCHED_OTHER
pid 8596's current scheduling priority: 0

LCK0
 LCK0
 Instance Enqueue Process
 Part of KCL (Kernel Cache Library)
 Manages
 instance resource requests
 cross-instance call operations
 Assists LMS processes
 Formerly known as lock process
 One LCK0 process per instance
 In 9.0.1 and below, number of lock processes may be
configurable using _gc_lck_procs parameter

LMD0
 LMD0
 Global Enqueue Service Daemon
 Manages requests for global enqueues
 Updates status of enqueues when granted to / revoked
from an instance
 Responsible for deadlock detection
 One LMD0 process per instance
 In 8.1.7 and below number of lock daemons may be
configurable using _lm_dlmd_processes parameter

LMON
 LMON
 Global Enqueue Service Monitor
 One LMON process per instance
 Monitors cluster to maintain global enqueues and
resources
 Manages
 instance and process expirations
 recovery processing for cluster enqueues

DIAG
 DIAG - Diagnosability Process
 Collects diagnostic data in the event of a failure
 Creates subdirectories in BACKGROUND_DUMP_DEST
directory
 In Oracle 9.0.1 and above can be disabled using
_diag_daemon parameter
 Do not try this on a production system

UDP Messages
 There are two types of message exchanged within RAC
 These are PROBABLY defined as follows
 Synchronous
 These messages require an acknowledgement for each
packet
 In some cases the acknowledgement packet can be
larger than the original request
 e.g. SCN synchronization
 Asynchronous
 These messages do not require an individual
acknowledgement for each packet
 e.g. block transfers between instances

Lock Modes
 Lock modes can be:
 Null
 Another instance can hold an exclusive or shared lock
 Shared
 Another instance can hold a shared lock but not an
exclusive lock
 Exclusive
 No other instances can hold shared or exclusive locks
 Locks can also be:
 Local
 No other instance has held an exclusive lock
 Global
 Another instance has held an exclusive lock in the past

Fairness Threshold
 Intended to prevent unnecessary lock downgrades when other
instances only require read-only copies
 For write to read transfers
 Writing instance retains X lock
 Reading instance retains null lock
 If _fairness_threshold reached then
 Writing instance downgrades X lock to S lock
 Reading instance receives S lock
 _fairness_threshold default value is 4

Lock Elements
 Lock elements are externalized in the V$LOCK_ELEMENT
dynamic performance view
 Based on X$LE
 Additional information is available in the X$LE view
 Past image buffers do not have a lock element
 In OPS one lock element could manage a contiguous range of
blocks
 Still can in RAC using GC_FILES_PER_LOCK parameter
 Disables Cache Fusion

Lock Elements
 Contain embedded GCS Client structures (KJBL)
Lock
Element
GCS
Client
Buffer
Header
Lock
Element
GCS
Client
Buffer
Header
Buffer
Header
Lock
Element
GCS
Client
Buffer
Header

Memory Structures
KJBRKJBR
KJBL
BH BH
LE
KJBL
LE
KJBL
GCS
Client
GCS
Shadow
GCS
Resource
Block
Header
Lock
Element
GCS Shadow
describes blocks
held by other
instances, but
mastered locally

Memory Structures
 GCS Resources (KJBR)
 Stored in segmented array
 Number of GCS resource structures determined by
 _gcs_resources parameter
 Externalized in X$KJBR
 Number of free GCS resource structures in X$KJBRFX
 GCS Enqueues (Clients / Shadows) (KJBL)
 GCS clients embedded in lock elements
 GCS shadows stored in segmented array
 Number of GCS shadow structures determined by
 _gcs_shadow_locks parameter
 Externalized in X$KJBL
 Number of free GCS shadow structures in X$KJBLFX

Dynamic Remastering
 Example
SELECT data_object_id FROM dba_objects
WHERE owner = 'US01'AND object_name = 'T1';
OBJECT_ID
---------
52084
ORADEBUG LKDEBUG -m pkey 52084
 To remaster object at current instance use:
 All blocks now mastered by the current instance
 To redistribute masters to all available instances use:
ORADEBUG LKDEBUG -m dpkey 52084
 Blocks mastered by both (all) instances again

Block Mastering
 In Oracle 10.1 and below block mastering is determined by a
hash function
 Algorithm applied to groups of 1289 contiguous blocks
 In two node cluster
 Instance 0 has 645 blocks
 etc
 In three node cluster
 etc
 Beware of small hot tables and indexes....

Dumps
 To dump the contents of the global cache use:
ALTER SESSION SET EVENTS
'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1';
GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18):
id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713)
lock: SL rls: 0x0000 acq: 0x0000 latch: 0
flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'
bscn: 0x0.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS CLIENT 0x21fecd60,1 sq[(nil),(nil)] resp[(nil),0x3591.10000] pkey 181
grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCAL
master 1 owner 0 sid 0 remote[(nil),0] hist 0x7c
history 0x3c.0x1.0x0.0x0.0x0.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0
disk: 0x0000.00000000 write request: 0x0000.00000000
pi scn: 0x0000.00000000
msgseq 0x1 updseq 0x0 reqids[1,0,0] infop 0x0
pkey 181
hv 107 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 6, dom 0]
kjga st 0x4, step 0.0.0, cinc 8, rmno 10, flags 0x0
lb 0, hb 0, myb 178, drmb 178, apifrz 0

Dumps
 Continued
GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358):
id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193)
lock: SL rls: 0x0000 acq: 0x0000 latch: 0
flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'
bscn: 0x0.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74
master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5
.....
GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894,0x2ff57390] name[0x6a39.10000] pkey 74
grant 0x2eff3858 cvt (nil) send (nil),0 write (nil),0@65535
flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
.....
GCS SHADOW 0x2eff3858,1 sq[0x237f43a0,0x2ee64e8c] resp[0x2ee64e74,0x6a39.10000] pkey 74
master 0 owner 1 sid 0 remote[0x23fea160,1] hist 0x65f
.....
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74
master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5
.....

System Change Number
 In RAC clusters SCN must be maintained across all nodes in
cluster
 SCN propagation scheme differs according to version
 In Oracle 10.1and below defaults to Lamport algorithm
 Lamport in alert.log
 SCN piggy-backed on GCS/GES messages
 Recorded in redo log
 Default delay of 7 seconds
 In Oracle 10.2 and above defaults to Broadcast on Commit
algorithm
 SCN negotiated immediately
 Apparently no delay

 System Change Number algorithm is determined by the
MAX_COMMIT_PROPAGATION_DELAY parameter
 In Oracle 10.1 and below
 Initialization parameter specified in centriseconds
 Default value is 700 centiseconds (7 seconds)
 Specifies maximum time taken for a COMMIT on one node
to be reflected on other nodes in the cluster
 For some applications performing rapid updates and
queries of the same data from different instances, value
must be set to 0 (Broadcast on commit)
 Examples include:
 E-Business suite
 SAP

 Default value of MAX_COMMIT_PROPAGATION_DELAY
parameter is 0
 SCN broadcast on commit method is used
 SCN updates are synchronized immediately
 SCN is synchronized
 after current read
 before block updated
 This ensures correct SCN is written to block

Broadcast on Commit
 Ethernet broadcast is not used
 SCN is synchronized by updating instance
 Sends UDP SCN synchronization message to each remote
instance
 Remote instances respond with their current SCN
 Another round of messages may be required if remote SCNs
are more recent than local SCN
 Synchronization occurs every time an instance needs a new
SCN
 Synchronization is always performed by the updating instance
 Number of messages = 4 x (number of instances - 1)

Broadcast on Commit
 In a 4-node cluster 12 messages are exchanged
RAC4-LMS0 RAC1-LMS0 Send current SCN 192
RAC1-LMS0 RAC4-LMS0 OK 212

Global Cache Service
Read Consistency
 When a read consistent version of a block is requested it may
be necessary to apply undo to a more recent version of that
block
 Undo can be applied by LMSn background process in
 Remote instance
 Local instance
 If undo applied by remote instance, any outstanding redo
must first be flushed from redo buffer of remote instance to
redo log
 Can have significant performance impact on consistent
reads
 Particularly on extended clusters

Read Consistency
 Statistics on inter-instance consistent reads are reported in
V$CR_BLOCK_SERVER
 Reports statistics for blocks served by local instances to
remote instances including
 Number of consistent reads served
 Number of current reads served
 Number of data blocks served
 Number of undo blocks served
 Number of undo headers served
 Number of fairness down converts
 Number of log flushes
 Number of times light works rule invoked

Read Consistency
 In theory, once a block has been written to disk, the LMS
process will not attempt to read it again when responding to a
consistent read request
 Light Works Rule
 Prevents LMS processes from going to disk when
responding to CR requests for data, undo or undo segment
blocks
 Can prevent LMS process from completing its response to
a CR request

Read Consistency
 Uncommitted changes MUST be flushed to the redo log before
the LMS process can ship a consistent block to another
instance
 Reading process must wait until redo log changes have been
written to redo log by LMS process
 Bad for standard RAC databases
 Reads must wait for redo log writes
 Worse for extended / stretch RAC clusters
 Increased latency of cross site disk communications

Read Consistency
 For each block on which a consistent read is performed, a
redo log flush must first be performed
 Number of redo log flushes is recorded in the FLUSHES
column of V$CR_BLOCK_SERVER
 Redo log flush time
 is recorded in the gc cr block flush time statistic for the
LMS process
 will increase time taken to serve consistent block
 will increase time taken to perform consistent read
 If LMS processes become very busy, consistent reads will
experience high wait times e.g. for a full table scan
gc cr multi block request

Read Consistency
Committed transaction on RAC2 - All blocks still in buffer cache
110
109
108
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
1
2
3
110 110
STOP

Read Consistency
Committed transaction on RAC2 - Some blocks written to disk
110
109
108
RAC1 RAC2
Redo Log
1
3
2
110
110
4
110
110
STOP

Read Consistency
Uncommitted transaction on RAC2 - All blocks still in buffer cache
110
108
RAC1 RAC2
Redo Log
2
3
1
108 110
4
5
6
109
110
109
109
108108
108108
STOP

Read Consistency
Uncommitted transaction on RAC2 - Some blocks written to disk
RAC1 RAC2
Redo Log
3
2
1
110
4
6
8
110
5
7 110
110
109
110
109
109
108108
108
STOP

Jumbo Frames
 By default Maximum Transmission Unit (MTU) is 1500
 MTU includes
 IP header
 UDP header
 Data
 Requires six packets to transmit one 8192 byte block
 On some adapters MTU can be increased to around 9000
 e.g. Intel PRO/1000
 At command line
ifconfig eth1 mtu 9000 up
 or in /etc/sysconfig/ifcfg-eth<x>
MTU=9000

Jumbo Frames
 Example - cost of sending on 8192 byte block
 MTU=1500 (default)
Frame# Ethernet
Header
IP Header UDP
Header
Data Ethernet
Trailer
Total
1 14 20 8 1472 4 1518
2 14 20 8 1472 4 1518
3 14 20 8 1472 4 1518
4 14 20 8 1472 4 1518
5 14 20 8 1472 4 1518
6 14 20 8 840 4 886
Total 84 120 48 8200 24 8476
Frame# Ethernet
Header
IP Header UDP
Header
Data Ethernet
Trailer
Total
1 14 20 8 8200 4 8246
Total 14 20 8 8200 4 8246
 MTU=9000

Jumbo Frames
 Not all network adapter drivers support jumbo frames
 Particularly cheap ones....
 All network adapters in private interconnect must have same
MTU size
 Switch must also be configured to support jumbo frames
 Lots of bugs and compatibility issues e.g.
 Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000
 affects 10.1.0.5, 10.2,0.1
 fixed in 10.2.0.2 and above

B35 Inside rac by Julian Dyke

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie B35 Inside rac by Julian Dyke

Ähnlich wie B35 Inside rac by Julian Dyke (20)

Mehr von Insight Technology, Inc.

Mehr von Insight Technology, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

B35 Inside rac by Julian Dyke