Weitere ähnliche Inhalte
Ähnlich wie B35 Inside rac by Julian Dyke (20)
Mehr von Insight Technology, Inc. (20)
Kürzlich hochgeladen (20)
B35 Inside rac by Julian Dyke
- 1. 1
DB Tech Showcase - Osaka
May 2013
juliandyke.com© 2013 Julian Dyke
Julian Dyke
Independent Consultant
Inside RAC
- 2. © 2013 Julian Dyke juliandyke.com2
Agenda
OPS versus RAC
Buffer Cache
Global Cache Services
- 3. © 2013 Julian Dyke juliandyke.com3
RAC
Overview
Public
Network
Shared
Storage
Node 1
Instance
1
Node 2
Instance
2
Node 3
Instance
3
Node 4
Instance
4
Private
Network
(Interconnect)
Storage
Network
- 4. © 2013 Julian Dyke juliandyke.com4
OPS versus RAC
Oracle 8.0.6 and below
Instance 2
Node 2
OPS - Oracle 8.0.6 and below
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads All I/O uses
shared storage
Enqueues only use
interconnect
- 5. © 2013 Julian Dyke juliandyke.com5
Instance 2
Node 2
OPS - Oracle 8.1.5 to Oracle 8.1.7 - Cache Fusion Phase 1
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads Current I/O always
uses shared storage
Consistent reads can
use interconnect
OPS versus RAC
Oracle 8.1.5 to Oracle 8.1.7
- 6. © 2013 Julian Dyke juliandyke.com6
Instance 2
Node 2
RAC - Oracle 9.0.1 and above - Cache Fusion Phase 2
Instance 1
Node 1
Interconnect
Shared
Storage
Current Writes
Consistent Reads
Current Reads
Current I/O and
consistent reads can
use interconnect
OPS versus RAC
Oracle 9.0.1 and above
- 7. © 2006 Julian Dyke juliandyke.com7
Head of
Cold End
Head of
Hot End
92
0
34
3
72
4
52
1
71
2
66
0
49
0
42
1
45
2
52
1
71
2
66
0
42
1
11
1
52
1
71
2
11
1
42
1
42
2
71
0
92
0
34
3
72
4
45
2
11
1
52
1
42
2
33
1
45
2
11
1
42
2
33
1
34
4
92
0
34
4
72
4
45
2
11
1
42
0
33
1
71
0
87
1
87
1
72
4
33
1
45
2
Read Block 42
Get first available buffer
from cold end
Update buffer contentsInsert buffer at head of cold
end
Read Block 11
Get first available buffer
from cold end
Update buffer contentsInsert buffer at head of cold
end
Read Block 42
Update touch count
for block 42
Read Block 33
Move block 71 to head
of hot end
Set touch count
on block 71 to zero
Get first available buffer
from cold end
Update buffer contentsInsert buffer at head of cold
end
Read Block 34
Update touch count
for block 34
Read Block 87
Move block 42 to head
of hot end
Set touch count
on block 42 to zero
Get first available buffer
from cold end
Update buffer contentsInsert buffer at head of cold
end
STOP
Block
Number
Touch
Count
Buffer Cache
Single Block Reads
- 8. © 2006 Julian Dyke juliandyke.com8
Head of
Cold End
Head of
Hot End
Read Block 1
Get first four available
buffers from cold end
Read next four blocks into
buffers
1 2 3 4
Insert buffers at head of
cold end
12 13 2 14 3 2 1
Move block 1 to cold end
121
Read Block 2
Move block 2 to cold end
21 321 3 4
Read Block 3
Move block 3 to cold end
Read Block 4
Move block 4 to cold end
Read Block 5
Get next four available
buffers from cold end
Read next four blocks into
buffers
Insert buffers at head of
cold end
Move block 5 to cold end
4 3 2 15
5 56
76
7 6 5
8
78 5 56 5 65 6 75 6 7 8
Read Block 6
Move block 6 to cold end
Read Block 7
Move block 7 to cold end
Read Block 8
Move block 8 to cold end
STOP
DB_FILE_MULTIBLOCK_READ_COUNT = 4
Buffer Cache
Multi Block Reads
- 9. © 2013 Julian Dyke juliandyke.com9
Global Services
Overview
Resource
Object to which access must be controlled at instance
level
Enqueue
Memory structure that serializes access to a resource
Global Resources
Object to which access must be controlled at cluster level
Global Enqueue
Locks and enqueues which need to be consistent between
all instances
- 10. © 2013 Julian Dyke juliandyke.com10
Global Services
Overview
Global Resource Directory (GRD)
Records current state and owner of each resource
Contains convert and write queues
Distributed across all instances in cluster
Maintained by GCS and GES
Global Cache Services (GCS)
Implements cache coherency for database
Coordinates access to database blocks for instances
Global Enqueue Services (GES)
Controls access to other resources (locks) including
library cache and dictionary cache
Performs deadlock detection
- 11. © 2013 Julian Dyke juliandyke.com11
Global Cache Services
Introduction
Global Cache Services exist to implement Cache Fusion
Cache Fusion allows blocks to be updated by multiple
instances
Only one instance can have the updatable (current) version of
a block
GCS must ensure that only one instance can update a
block at any time
Many instances can have read-only (consistent read) versions
of a block
Instances can have multiple copies of same block at
different SCNs
- 12. © 2013 Julian Dyke juliandyke.com12
Global Cache Services
2 way Current Read
Instance 1
Instance 2
Instance 4
1318
Request
shared
resource
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Request
granted
SN
Read
request
Block
returned
1318
1
2
3
4
STOP
- 13. © 2013 Julian Dyke juliandyke.com13
Global Cache Services
3-way Current Read
Instance 1
Instance 2
Instance 4
1318
Request
exclusive
resource
Instance 3
Resource
Master
Instance 1 requests
exclusive read on
block
Transfer
block to
Instance 1 for
exclusive
access
SN
Block and
resource
status
Resource
status
1318
1
2
3
4
N
N
X
1320
STOP
- 14. © 2013 Julian Dyke juliandyke.com14
Global Cache Services
3-way Current Read (Dirty Block)
Instance 1
Instance 2
Instance 4
1318
Request
block in
exclusive
mode
Instance 3
Resource
Master
Instance 4 requests
exclusive read on
block
Transfer
block to
Instance 4
in exclusive
mode
SN
Block and resource status
Resource
status
1318
12
3
4
N NX
1320
N
N
X
1320 1323
STOP
Note that Instance 1 will
create a past image (PI) of
the dirty block
- 15. © 2013 Julian Dyke juliandyke.com15
Global Cache Services
3-way Current (Without Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block
in shared mode
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Block and resource status
Resource
status
1
3
4
N NX
1320
N
N
X
1320 1323
Transfer
block to
Instance 2
in shared
mode
2
STOP
In Oracle 8.1.5 and above
_fairness_threshold is used
to avoid unnecessary lock
conversions
- 16. © 2013 Julian Dyke juliandyke.com16
Global Cache Services
3-way Current (With Downgrade)
Instance 1
Instance 2
Instance 4
1318
Request block
in shared mode
Instance 3
Resource
Master
Instance 2 requests
current read on
block
Block and resource status
Resource
status
1
3
4
N NX
1320
N X
1320 1323
Transfer
block to
Instance 2
in shared
mode
2
S
S
STOP
In Oracle 8.1.5 and above
_fairness_threshold is used
to avoid unnecessary lock
conversions
- 17. © 2013 Julian Dyke juliandyke.com17
Global Cache Services
Past Images
When an instance passes a dirty block to another instance it
Flushes redo buffer to redo log
Retains past image (PI) of block in buffer cache
PI is retained until another instance writes block to disk
Used to reduce recovery times
Recorded in V$BH.STATUS as PI
Based on X$BH.STATE (value 8 in Oracle 10.2)
- 18. © 2013 Julian Dyke juliandyke.com18
Global Cache Services
Past Images
71287129
UPDATE t1
SET c1 = 7124;
COMMIT;
UPDATE t1
SET c1 = 7129;
COMMIT;
7123
Instance 1
71237124712571267127
Buffer Cache
71247123
71257124
71267125
71277126
7128
71287127
Redo Log 1
Instance 2
Buffer Cache
71297128
UPDATE t1
SET c1 = 7125;
COMMIT;
UPDATE t1
SET c1 = 7126;
COMMIT;
UPDATE t1
SET c1 = 7127;
COMMIT;
UPDATE t1
SET c1 = 7128;
COMMIT; 7128
7123
Redo Log 2
7123
712871297129
7129
7129
Assume table t1 contains a
single row in block 42
Instance 1 updates column to
7124
Block 42 is read from disk
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
Instance 1 updates column to
7125
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
Instance 1 updates column to
7126
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
Instance 1 updates column to
7127
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
Instance 1 updates column to
7128
Undo/Redo written to
Redo Log 1
Block 42 is updated in buffer
cache
Instance 2 updates column to
7129
GCS transfers block from
Instance 1 to Instance 2
Instance 1 makes block 42
a Past Image block
Undo/redo written to
Redo Log 2
Block 42 is updated in buffer
cache
Instance 2 Crashes
Contents of buffer cache are lost
DBWR has not written changes
to block 42 back to disk yet
Instance 1 must perform
recovery for Instance 2
Block 42 needs recovery
Instance 1 uses Past Image
Undo/redo is applied from
Redo Log 2
Block 42 is subsequently written
back to disk by DBWR
STOP
- 19. © 2013 Julian Dyke juliandyke.com19
Global Cache Services
Wait Events
Wait events show reads where messages have been
exchanged with other instances
Can include:
gc cr grant 2-way
gc cr block 2-way
gc cr block 3-way
gc cr multi block request
gc current grant 2-way
gc current block 2-way
gc current block 3-way
gc current multi block request
- 20. © 2013 Julian Dyke juliandyke.com20
Global Cache Services
gc cr block 3-way wait event
Source Destination Description Bytes
RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456
RAC2 - LMS1 RAC4 - Server OK 212
RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480
RAC3 - LMS1 RAC2 - LMS1 OK 212
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868
- 21. © 2013 Julian Dyke juliandyke.com21
Global Cache Services
gc cr block 3-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,42
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
6
7
8
9
1,42
2,44
1,42
2,44
- 22. © 2013 Julian Dyke juliandyke.com22
Global Cache Services
gc cr block 2-way wait event
2-way Consistent Read
Source Destination Description Bytes
RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400
RAC3 - LMS1 RAC4 - Server OK 212
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 1 1500
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 2 1500
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 3 1500
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 4 1500
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 5 1500
RAC3 - LMS1 RAC4 - Server Block file 6 block 69 part 6 868
- 23. © 2013 Julian Dyke juliandyke.com23
Global Cache Services
gc cr block 2-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,40
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1 2
3
4
5
6
7
8
1,40
2,44
1,40
2,44
STOP
- 24. © 2013 Julian Dyke juliandyke.com24
Global Cache Services
gc current block 3-way wait event
3-way Current Read
Source Destination Description Bytes
RAC4 - Server RAC2 - LMS1 Request file 8 block 15 456
RAC2 - LMS1 RAC4 - Server OK 212
RAC2 - LMS1 RAC3 - LMS1 Send file 8 block 15 to RAC4 480
RAC3 - LMS1 RAC2 - LMS1 OK 212
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 1 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 2 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 3 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 4 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 5 1500
RAC3 - LMS1 RAC4 - Server Block file 8 block 15 part 6 868
RAC4 - LMS1 RAC2 - LMS1 Received file 8 block 15 244
RAC2 - LMS1 RAC4 - LMS1 OK 212
- 25. © 2013 Julian Dyke juliandyke.com25
11
Global Cache Services
gc current block 3-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,42
2,44
UPDATE t1
SET c2 = 50
WHERE c1 = 2;
1
2
3
4 5
10
6
7
8
9
1,42
2,44
12
UPDATE t1
SET c2 = 42
WHERE c1 = 1;
RAC3 saves past image of the dirty block until RAC4 writes the block to disk
1,42
2,44
1,42
2,50
STOP
- 26. © 2013 Julian Dyke juliandyke.com26
Global Cache Services
gc cr grant 2-way wait event
2-way Consistent Read
Source Destination Description Bytes
RAC4 - Server RAC3 - LMS1 Request file 6 block 69 400
RAC3 - LMS1 RAC4 - Server OK 212
RAC3 - LMS1 RAC4 - Server Grant read file 6 block 69 276
RAC4 - Server RAC3 - LMS1 OK 212
- 27. © 2013 Julian Dyke juliandyke.com27
Global Cache Services
gc cr grant 2-way wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
1,40
2,44
1,40
2,44
1,40
2,44
SELECT c2
FROM t1
WHERE c1 = 1;
1 2
5 6
34
STOP
- 28. © 2013 Julian Dyke juliandyke.com28
Global Cache Services
gc cr multi block request wait event
Source Destination Description Bytes
RAC4 - Server RAC3 - LMS1 Request file 8 blocks 69-76 1872
RAC3 - LMS1 RAC4 - Server OK 212
RAC3 - LMS1 RAC4 - Server Grant file 8 blocks 69-76 to RAC4 772
RAC4 - Server RAC3 - LMS1 OK 212
- 29. © 2013 Julian Dyke juliandyke.com29
Global Cache Services
gc cr multi block request wait event
RAC1
RAC2
RAC4
1318
RAC3
Resource
Master
SELECT c2
FROM t1
WHERE c1 = 1;
1 2
5 6
34
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
1,40
2,44
STOP
- 30. © 2013 Julian Dyke juliandyke.com30
Global Cache Services
gc cr multi block request wait event
The following 10046/8 trace is for a gc cr multi block request
WAIT #2: nam='gc cr multi block request' ela= 722 file#=4
block#=248 class#=1 obj#=51866 tim=1169728375495574
WAIT #2: nam='db file scattered read' ela= 10437 file#=4 block#=244
blocks=5 obj#=51866 tim=1169728375506092
This trace can be misleading because:
the gc cr multi block request specifies the LAST block in
the range
the gc cr multi block request does not specify how many
blocks should be read
the gc cr multi block request does not specify how many
blocks have been returned from another instance
- 31. © 2013 Julian Dyke juliandyke.com31
Global Cache Services
Block Mastering
Each block is mastered on one instance
Block DBA is reported by X$KJBR.KJBRNAME
Names have the format:
[<block_number>][<file_number>][BL]
For example
[0x137][0x40000][BL]
Ordering by X$KJBR.KJBRNAME is difficult because the
resource names do not collate when sorted e.g.:
is file# 4, block# 311
[0x12E][0x40000][BL]
[0x12F][0x40000][BL]
[0x13][0x40000][BL]
[0x130][0x40000][BL]
[0x131][0x40000][BL]
etc...
- 32. © 2013 Julian Dyke juliandyke.com32
Global Cache Services
Block Mastering
Some useful functions
CREATE OR REPLACE FUNCTION get_file_number (p_resource_name VARCHAR2)
RETURN INTEGER
IS
pos1 INTEGER := INSTR (p_resource_name,'x',1,2);
pos2 INTEGER := INSTR (p_resource_name,']',1,2);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,'XXXXXXXX') / 65536;
END;
/
CREATE OR REPLACE FUNCTION get_block_number (p_resource_name VARCHAR2)
RETURN INTEGER
IS
pos1 INTEGER := INSTR (p_resource_name,'x',1,1);
pos2 INTEGER := INSTR (p_resource_name,']',1,1);
s VARCHAR2(30) := SUBSTR (p_resource_name,pos1+1,pos2-pos1-1);
BEGIN
RETURN TO_NUMBER (s,'XXXXXXXX');
END;
/
- 33. © 2013 Julian Dyke juliandyke.com33
Global Cache Services
Block Mastering
In Oracle 10.2 block mastering is determined by
_lm_contiguous_res_count
Specifies number of contiguous blocks that will hash to the
same HV bucket
Defaults to 128
For example
Start End
0x080 0x0FF
0x180 0x1FF
0x280 0x2FF
0x380 0x3FF
0x480 0x4FF
0x580 0x5FF
etc etc
Start End
0x000 0x07F
0x100 0x17F
0x200 0x27F
0x300 0x37F
0x400 0x47F
0x500 0x57F
etc etc
Instance 0 Instance 1
- 34. © 2013 Julian Dyke juliandyke.com34
Global Cache Services
Block Mastering
The following table shows that masters are still assigned to
ranges of 128 contiguous blocks in a four-node cluster
Start Block End Block Master
0 127 1
128 255 2
256 383 2
384 511 3
512 639 3
640 767 3
768 895 1
896 1023 0
1024 1279 2
1280 1407 1
- 35. © 2013 Julian Dyke juliandyke.com35
Global Cache Services
Dynamic Remastering
In Oracle 9.2
documentation describes dynamic remastering
not implemented in code
In Oracle 10.1
work at data file level
very high threshold so difficult to test
does occur on some customer sites
In Oracle 10.2 and above
works at segment level
thresholds are relatively low
- 36. © 2013 Julian Dyke juliandyke.com36
Global Cache Services
Dynamic Remastering
Object remastering is recorded in V$GCSPFMASTER_INFO
Instances are internally numbered 0, 1 etc
Initially contains no rows
After remastering object 52084 to instance 0
SELECT object_id, current_master, previous_master
FROM v$gcspfmaster_info;
After remastering object 52084 to instance 1
Object ID Current Master Previous Master
52084 0 32767
Object ID Current Master Previous Master
52084 1 0
- 37. © 2013 Julian Dyke juliandyke.com37
Global Cache Services
Dynamic Remastering
In Oracle 10.2 and above, information about Dynamic
Remastering operations is also reported in the following fixed
views
X$KJDRMREQ
Dynamic Remastering Requests
X$KJDRMAFNSTATS
File Remastering Statistics
X$KJDRMHVSTATS
Hash Value Statistics
- 38. © 2013 Julian Dyke juliandyke.com38
Global Cache Services
Dynamic Remastering
In Oracle 11.1 and above, Dynamic Remastering statistics are
reported in V$DYNAMIC_REMASTER_STATS
Column Name Data Type
REMASTER_OPS NUMBER
REMASTER_TIME NUMBER
REMASTERED_OBJECTS NUMBER
QUIESCE_TIME NUMBER
FREEZE_TIME NUMBER
CLEANUP_TIME NUMBER
REPLAY_TIME NUMBER
FIXWRITE_TIME NUMBER
SYNC_TIME NUMBER
RESOURCES_CLEANED NUMBER
REPLAYED_LOCKS_SENT NUMBER
REPLAYED_LOCKS_RECEIVED NUMBER
CURRENT_OBJECTS NUMBER
- 39. © 2013 Julian Dyke juliandyke.com39
Global Cache Services
Dynamic Remastering
Dynamic remastering is coordinated by the LMD0 background
The LMD0 process background process includes limited
details of dynamic remastering operations
Excessive dynamic remastering can cause instance freezes
Observed in both Oracle 10.1 and 10.2
Oracle Support occasionally recommends that dynamic
remastering is disabled using the following parameters:
_gc_affinity_time = 0
_gc_undo_affinity=FALSE
- 40. © 2013 Julian Dyke juliandyke.com40
Thank you for listening
info@juliandyke.com
- 42. © 2013 Julian Dyke juliandyke.com42
Interconnect
Overview
Instances communicate with each other over the interconnect
(network)
Information transferred between instances includes
data blocks
locks
SCNs
Typically 1Gb Ethernet
UDP protocol
Often teamed in pairs to avoid SPOFs
Can also use Infiniband
Fewer levels in stack
Other proprietary protocols are available
- 43. © 2013 Julian Dyke juliandyke.com43
Interconnect
TCP/IP Five Layer Model
All messages travel down through layers, across physical
layer then up again
5 Application
4 Transport
3 Network
2 Data Link
1Physical
5 Application
4 Transport
3 Network
2 Data Link
1Physical
- 44. © 2013 Julian Dyke juliandyke.com44
Interconnect
TCP/IP Five Layer Model
TCP/IP has a four or five layer model
Five-layer model shown below
Layer TCP/IP Suite
5 Application DHCP, DNS, FTP, HTTP, SSH, NFS, NTP, SMTP, SNMP, TELNET, RPC, SOAP
4 Transport TCP, UDP
3 Network IP (IPv4, IPv6), ICMP, ARP, RARP
2 Data Link Ethernet, Token Ring, 802.11, Wi-Fi, FDDI, PPP
1 Physical 10BASE-T, 100BASE-T, 1000BASE-T, Optical Fibre, Twisted Pair
Four-layer model combines data link and physical layers
- 45. © 2013 Julian Dyke juliandyke.com45
Interconnect
TCP/IP Transport Layer
Transport Layer
Connection-oriented (TCP)
Connectionless (UDP)
Ethernet
Physical Layer
IP
TCP UDPClusterware RAC
- 46. © 2013 Julian Dyke juliandyke.com46
Interconnect
Encapsulation
Ethernet
Header
Ethernet
Trailer
UDP
Header
IP
Header Data
UDP
Header
IP
Header Data
UDP
Header Data
Data
4 bytes14 bytes 20 bytes 8 bytes
MTU Size
- 47. © 2013 Julian Dyke juliandyke.com47
Oracle Clusterware
Node Heartbeat Messages
Sent to each node in cluster every second in both directions
Checks nodes are still members of cluster
Sent by ocssd.bin using TCP well-known port 49895
Outgoing message is 134 bytes (80 byte payload)
Incoming message is 66 bytes (12 byte payload)
Node
1
Node
3
Node
2
Node
4
Outgoing
Incoming
- 48. © 2013 Julian Dyke juliandyke.com48
Oracle Clusterware
Node Status Messages
Number of packets exchanged by a node is determined by
number of nodes in cluster
Number of packets per node per hour is
(#nodes - 1) * 4 messages * 3600 seconds
Number of nodes Packets per hour
2 14,400
3 28,800
4 43,200
5 57,600
6 72,000
7 86,400
8 100,800
16 216,000
32 446,400
- 49. © 2013 Julian Dyke juliandyke.com49
Datafiles
Controlfiles
Redo Logs
RAC Background Processes
Overview
Redo Logs
DIAG
LMON
LCK0
LMD0
LMSn
PMON SMON
LGWR
CKPT
ARCn
SMON PMON
DBWR DBWR LGWR
Shared Pool
Buffer Cache
Instance 2
Shared Pool
Buffer Cache
Instance 1
DIAG
LMON
LCK0
LMD0
LMSn
CKPT
ARCn
Node 1 Node 2
- 50. © 2013 Julian Dyke juliandyke.com50
RAC Background Processes
LMSn
LMSn
Global Cache Service Process
Manage requests for data access across cluster
Up to 20 in Oracle 10.1
LMS0-LMS9 LMSa-LMSj
Up to 36 in Oracle 10.2
LMS0-LMS9 LMSa-LMSz
In Oracle 10.1 and above, number of GCS server processes
can be configured using gcs_server_processes parameter
Default value is 1 (single CPU system)
Can also be configured using _lm_lms parameter
- 51. © 2013 Julian Dyke juliandyke.com51
RAC Background Processes
LMSn
In Oracle 10.2 and above
LMS processes run in real-time mode
Remaining processes run in time-share mode
Check using:
[oracle@server3 ~]$ ps -eo pid,user,opri,cmd | grep ora_lm
8596 oracle 75 ora_lmon_TEST1
8598 oracle 75 ora_lmd0_TEST1
8601 oracle 58 ora_lms0_TEST1
58 is real time; 75 or 76 is time share
You can also check process scheduling policies using chrt
oracle@server3 ~]$ chrt -p 8601 # lms0 - Real
Time
pid 8601's current scheduling policy: SCHED_RR
pid 8601's current scheduling priority: 1
[oracle@server3 ~]$ chrt -p 8596 # lmon - Time
Share
pid 8596's current scheduling policy: SCHED_OTHER
pid 8596's current scheduling priority: 0
- 52. © 2013 Julian Dyke juliandyke.com52
RAC Background Processes
LCK0
LCK0
Instance Enqueue Process
Part of KCL (Kernel Cache Library)
Manages
instance resource requests
cross-instance call operations
Assists LMS processes
Formerly known as lock process
One LCK0 process per instance
In 9.0.1 and below, number of lock processes may be
configurable using _gc_lck_procs parameter
- 53. © 2013 Julian Dyke juliandyke.com53
RAC Background Processes
LMD0
LMD0
Global Enqueue Service Daemon
Manages requests for global enqueues
Updates status of enqueues when granted to / revoked
from an instance
Responsible for deadlock detection
One LMD0 process per instance
In 8.1.7 and below number of lock daemons may be
configurable using _lm_dlmd_processes parameter
- 54. © 2013 Julian Dyke juliandyke.com54
RAC Background Processes
LMON
LMON
Global Enqueue Service Monitor
One LMON process per instance
Monitors cluster to maintain global enqueues and
resources
Manages
instance and process expirations
recovery processing for cluster enqueues
- 55. © 2013 Julian Dyke juliandyke.com55
RAC Background Processes
DIAG
DIAG - Diagnosability Process
Collects diagnostic data in the event of a failure
Creates subdirectories in BACKGROUND_DUMP_DEST
directory
In Oracle 9.0.1 and above can be disabled using
_diag_daemon parameter
Do not try this on a production system
- 56. © 2013 Julian Dyke juliandyke.com56
Global Cache Services
UDP Messages
There are two types of message exchanged within RAC
These are PROBABLY defined as follows
Synchronous
These messages require an acknowledgement for each
packet
In some cases the acknowledgement packet can be
larger than the original request
e.g. SCN synchronization
Asynchronous
These messages do not require an individual
acknowledgement for each packet
e.g. block transfers between instances
- 57. © 2013 Julian Dyke juliandyke.com57
Global Cache Services
Lock Modes
Lock modes can be:
Null
Another instance can hold an exclusive or shared lock
Shared
Another instance can hold a shared lock but not an
exclusive lock
Exclusive
No other instances can hold shared or exclusive locks
Locks can also be:
Local
No other instance has held an exclusive lock
Global
Another instance has held an exclusive lock in the past
- 58. © 2013 Julian Dyke juliandyke.com58
Global Cache Services
Fairness Threshold
Intended to prevent unnecessary lock downgrades when other
instances only require read-only copies
For write to read transfers
Writing instance retains X lock
Reading instance retains null lock
If _fairness_threshold reached then
Writing instance downgrades X lock to S lock
Reading instance receives S lock
_fairness_threshold default value is 4
- 59. © 2013 Julian Dyke juliandyke.com59
Global Cache Services
Lock Elements
Lock elements are externalized in the V$LOCK_ELEMENT
dynamic performance view
Based on X$LE
Additional information is available in the X$LE view
Past image buffers do not have a lock element
In OPS one lock element could manage a contiguous range of
blocks
Still can in RAC using GC_FILES_PER_LOCK parameter
Disables Cache Fusion
- 60. © 2013 Julian Dyke juliandyke.com60
Global Cache Services
Lock Elements
Contain embedded GCS Client structures (KJBL)
Lock
Element
GCS
Client
Buffer
Header
Lock
Element
GCS
Client
Buffer
Header
Buffer
Header
Lock
Element
GCS
Client
Buffer
Header
- 61. © 2013 Julian Dyke juliandyke.com61
Global Cache Services
Memory Structures
KJBRKJBR
KJBL
BH BH
LE
KJBL
LE
KJBL
GCS
Client
GCS
Shadow
GCS
Resource
Block
Header
Lock
Element
GCS Shadow
describes blocks
held by other
instances, but
mastered locally
- 62. © 2013 Julian Dyke juliandyke.com62
Global Cache Services
Memory Structures
GCS Resources (KJBR)
Stored in segmented array
Number of GCS resource structures determined by
_gcs_resources parameter
Externalized in X$KJBR
Number of free GCS resource structures in X$KJBRFX
GCS Enqueues (Clients / Shadows) (KJBL)
GCS clients embedded in lock elements
GCS shadows stored in segmented array
Number of GCS shadow structures determined by
_gcs_shadow_locks parameter
Externalized in X$KJBL
Number of free GCS shadow structures in X$KJBLFX
- 63. © 2013 Julian Dyke juliandyke.com63
Global Cache Services
Dynamic Remastering
Example
SELECT data_object_id FROM dba_objects
WHERE owner = 'US01'AND object_name = 'T1';
OBJECT_ID
---------
52084
ORADEBUG LKDEBUG -m pkey 52084
To remaster object at current instance use:
All blocks now mastered by the current instance
To redistribute masters to all available instances use:
ORADEBUG LKDEBUG -m dpkey 52084
Blocks mastered by both (all) instances again
- 64. © 2013 Julian Dyke juliandyke.com64
Global Cache Services
Block Mastering
In Oracle 10.1 and below block mastering is determined by a
hash function
Algorithm applied to groups of 1289 contiguous blocks
In two node cluster
Instance 0 has 645 blocks
Instance 1 has 644 blocks
etc
In three node cluster
Instance 0 has 430 blocks
Instance 2 has 215 blocks
Instance 1 has 430 blocks
Instance 2 has 214 blocks
etc
Beware of small hot tables and indexes....
- 65. © 2013 Julian Dyke juliandyke.com65
Global Cache Services
Dumps
To dump the contents of the global cache use:
ALTER SESSION SET EVENTS
'IMMEDIATE TRACE NAME GC_ELEMENTS LEVEL 1';
GLOBAL CACHE ELEMENT DUMP (address: 0x21fecd18):
id1: 0x3591 id2: 0x10000 obj: 181 block: (1/13713)
lock: SL rls: 0x0000 acq: 0x0000 latch: 0
flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'
bscn: 0x0.18a9c bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS CLIENT 0x21fecd60,1 sq[(nil),(nil)] resp[(nil),0x3591.10000] pkey 181
grant 1 cvt 0 mdrole 0x21 st 0x20 GRANTQ rl LOCAL
master 1 owner 0 sid 0 remote[(nil),0] hist 0x7c
history 0x3c.0x1.0x0.0x0.0x0.0x0. cflag 0x0 sender 2 flags 0x0 replay# 0
disk: 0x0000.00000000 write request: 0x0000.00000000
pi scn: 0x0000.00000000
msgseq 0x1 updseq 0x0 reqids[1,0,0] infop 0x0
pkey 181
hv 107 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 6, dom 0]
kjga st 0x4, step 0.0.0, cinc 8, rmno 10, flags 0x0
lb 0, hb 0, myb 178, drmb 178, apifrz 0
- 66. © 2013 Julian Dyke juliandyke.com66
Global Cache Services
Dumps
Continued
GLOBAL CACHE ELEMENT DUMP (address: 0x237f4358):
id1: 0x6a39 id2: 0x10000 obj: 74 block: (1/27193)
lock: SL rls: 0x0000 acq: 0x0000 latch: 0
flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'
bscn: 0x0.26992 bctx: (nil) write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74
grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5
.....
GCS RESOURCE 0x2ee64e74 hashq [0x2ee61894,0x2ff57390] name[0x6a39.10000] pkey 74
grant 0x2eff3858 cvt (nil) send (nil),0 write (nil),0@65535
flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
.....
GCS SHADOW 0x2eff3858,1 sq[0x237f43a0,0x2ee64e8c] resp[0x2ee64e74,0x6a39.10000] pkey 74
grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
master 0 owner 1 sid 0 remote[0x23fea160,1] hist 0x65f
.....
GCS SHADOW 0x237f43a0,1 sq[0x2ee64e8c,0x2eff3858] resp[0x2ee64e74,0x6a39.10000] pkey 74
grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
master 0 owner 0 sid 0 remote[(nil),0] hist 0x12a5
.....
- 67. © 2013 Julian Dyke juliandyke.com67
Global Cache Services
System Change Number
In RAC clusters SCN must be maintained across all nodes in
cluster
SCN propagation scheme differs according to version
In Oracle 10.1and below defaults to Lamport algorithm
Lamport in alert.log
SCN piggy-backed on GCS/GES messages
Recorded in redo log
Default delay of 7 seconds
In Oracle 10.2 and above defaults to Broadcast on Commit
algorithm
SCN negotiated immediately
Apparently no delay
- 68. © 2013 Julian Dyke juliandyke.com68
Global Cache Services
System Change Number
System Change Number algorithm is determined by the
MAX_COMMIT_PROPAGATION_DELAY parameter
In Oracle 10.1 and below
Initialization parameter specified in centriseconds
Default value is 700 centiseconds (7 seconds)
Specifies maximum time taken for a COMMIT on one node
to be reflected on other nodes in the cluster
For some applications performing rapid updates and
queries of the same data from different instances, value
must be set to 0 (Broadcast on commit)
Examples include:
E-Business suite
SAP
- 69. © 2013 Julian Dyke juliandyke.com69
Global Cache Services
System Change Number
In Oracle 10.2 and above
Default value of MAX_COMMIT_PROPAGATION_DELAY
parameter is 0
SCN broadcast on commit method is used
SCN updates are synchronized immediately
SCN is synchronized
after current read
before block updated
This ensures correct SCN is written to block
- 70. © 2013 Julian Dyke juliandyke.com70
Global Cache Services
Broadcast on Commit
Ethernet broadcast is not used
SCN is synchronized by updating instance
Sends UDP SCN synchronization message to each remote
instance
Remote instances respond with their current SCN
Another round of messages may be required if remote SCNs
are more recent than local SCN
Synchronization occurs every time an instance needs a new
SCN
Synchronization is always performed by the updating instance
Number of messages = 4 x (number of instances - 1)
- 71. © 2013 Julian Dyke juliandyke.com71
Global Cache Services
Broadcast on Commit
In a 4-node cluster 12 messages are exchanged
Source Destination Description Bytes
RAC4-LMS0 RAC1-LMS0 Send current SCN 192
RAC1-LMS0 RAC4-LMS0 OK 212
RAC4-LMS0 RAC2-LMS0 Send current SCN 192
RAC2-LMS0 RAC4-LMS0 OK 212
RAC4-LMS0 RAC3-LMS0 Send current SCN 192
RAC3-LMS0 RAC4-LMS0 OK 212
RAC1-LMS0 RAC4-LMS0 Send current SCN 192
RAC4-LMS0 RAC1-LMS0 OK 212
RAC2-LMS0 RAC4-LMS0 Send current SCN 192
RAC4-LMS0 RAC2-LMS0 OK 212
RAC3-LMS0 RAC4-LMS0 Send current SCN 192
RAC4-LMS0 RAC3-LMS0 OK 212
- 72. © 2013 Julian Dyke juliandyke.com72
Global Cache Service
Read Consistency
When a read consistent version of a block is requested it may
be necessary to apply undo to a more recent version of that
block
Undo can be applied by LMSn background process in
Remote instance
Local instance
If undo applied by remote instance, any outstanding redo
must first be flushed from redo buffer of remote instance to
redo log
Can have significant performance impact on consistent
reads
Particularly on extended clusters
- 73. © 2013 Julian Dyke juliandyke.com73
Global Cache Service
Read Consistency
Statistics on inter-instance consistent reads are reported in
V$CR_BLOCK_SERVER
Reports statistics for blocks served by local instances to
remote instances including
Number of consistent reads served
Number of current reads served
Number of data blocks served
Number of undo blocks served
Number of undo headers served
Number of fairness down converts
Number of log flushes
Number of times light works rule invoked
- 74. © 2013 Julian Dyke juliandyke.com74
Global Cache Service
Read Consistency
In theory, once a block has been written to disk, the LMS
process will not attempt to read it again when responding to a
consistent read request
Light Works Rule
Prevents LMS processes from going to disk when
responding to CR requests for data, undo or undo segment
blocks
Can prevent LMS process from completing its response to
a CR request
- 75. © 2013 Julian Dyke juliandyke.com75
Global Cache Service
Read Consistency
Uncommitted changes MUST be flushed to the redo log before
the LMS process can ship a consistent block to another
instance
Reading process must wait until redo log changes have been
written to redo log by LMS process
Bad for standard RAC databases
Reads must wait for redo log writes
Worse for extended / stretch RAC clusters
Increased latency of cross site disk communications
- 76. © 2013 Julian Dyke juliandyke.com76
Global Cache Service
Read Consistency
For each block on which a consistent read is performed, a
redo log flush must first be performed
Number of redo log flushes is recorded in the FLUSHES
column of V$CR_BLOCK_SERVER
Redo log flush time
is recorded in the gc cr block flush time statistic for the
LMS process
will increase time taken to serve consistent block
will increase time taken to perform consistent read
If LMS processes become very busy, consistent reads will
experience high wait times e.g. for a full table scan
gc cr multi block request
- 77. © 2013 Julian Dyke juliandyke.com77
Global Cache Services
Read Consistency
Committed transaction on RAC2 - All blocks still in buffer cache
110
109
108
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
1
2
3
110 110
STOP
- 78. © 2013 Julian Dyke juliandyke.com78
Global Cache Services
Read Consistency
Committed transaction on RAC2 - Some blocks written to disk
110
109
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
1
3
2
110
110
4
110
110
STOP
- 79. © 2013 Julian Dyke juliandyke.com79
Global Cache Services
Read Consistency
Uncommitted transaction on RAC2 - All blocks still in buffer cache
110
108
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
2
3
1
108 110
4
5
6
109
110
109
109
108108
108108
STOP
- 80. © 2013 Julian Dyke juliandyke.com80
Global Cache Services
Read Consistency
Uncommitted transaction on RAC2 - Some blocks written to disk
Redo Buffer Redo Buffer
Buffer CacheBuffer Cache
RAC1 RAC2
Redo Log
3
2
1
110
4
6
8
110
5
7 110
110
109
110
109
109
108108
108
STOP
- 81. © 2013 Julian Dyke juliandyke.com81
Global Cache Services
Jumbo Frames
By default Maximum Transmission Unit (MTU) is 1500
MTU includes
IP header
UDP header
Data
Requires six packets to transmit one 8192 byte block
On some adapters MTU can be increased to around 9000
e.g. Intel PRO/1000
At command line
ifconfig eth1 mtu 9000 up
or in /etc/sysconfig/ifcfg-eth<x>
MTU=9000
- 82. © 2013 Julian Dyke juliandyke.com82
Global Cache Services
Jumbo Frames
Example - cost of sending on 8192 byte block
MTU=1500 (default)
Frame# Ethernet
Header
IP Header UDP
Header
Data Ethernet
Trailer
Total
1 14 20 8 1472 4 1518
2 14 20 8 1472 4 1518
3 14 20 8 1472 4 1518
4 14 20 8 1472 4 1518
5 14 20 8 1472 4 1518
6 14 20 8 840 4 886
Total 84 120 48 8200 24 8476
Frame# Ethernet
Header
IP Header UDP
Header
Data Ethernet
Trailer
Total
1 14 20 8 8200 4 8246
Total 14 20 8 8200 4 8246
MTU=9000
- 83. © 2013 Julian Dyke juliandyke.com83
Global Cache Services
Jumbo Frames
Not all network adapter drivers support jumbo frames
Particularly cheap ones....
All network adapters in private interconnect must have same
MTU size
Switch must also be configured to support jumbo frames
Lots of bugs and compatibility issues e.g.
Bug 4447620: RAC UDP MTU size restricted to 1500 or 9000
affects 10.1.0.5, 10.2,0.1
fixed in 10.2.0.2 and above