AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 1
Troubleshooting and Diagnosing 19c RAC
Sandesh Rao
VP AIOps - Autonomous Database
@sandeshr
https://www.linkedin.com/in/raosandesh/

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
2

Program Agenda
Architecture and Basics
Troubleshooting Scenarios
Proactive and Reactive tools
19c and beyond
Q&A
1
2
3
4
5
3

Program Agenda
18/19c and beyond
Q&A
1
2
3
4
5
4

Grid Infrastructure
• Grid Infrastructure is Combination of :
– Oracle Cluster Ready Services (CRS)
– Oracle Automatic Storage Management (ASM)
• The Grid Home contains the software for both
products
– Must be installed in different location to RDBMS home
– Installer locks the Grid Home path by setting root
permissions
• CRS can also be Standalone for ASM and/or Oracle
Restart
• CRS can run by itself or in combination with other
vendor clusterware
5
Overview
Disk Group A Disk Group B
Database
Instance
Database
Instance
ASM
Instance
ASM
Instance
Database
Instance
Database
Instance
ASM
Instance
ASM
Instance
Database
Instance
ASM
Instance
Host 1 Host 2 Host 3
Cluster
ASM Disk Groups

Grid Infrastructure
• Shared Oracle Cluster Registry (OCR) and Voting files
– Must be in ASM or CFS
– OCR backed up every 4 hours automatically GIHOME/cdata
– Kept 4,8,12 hours, 1 day, 1 week
– Restored with ocrconfig
– Voting file backed up into OCR at each change.
– Voting file restored with crsctl
CRS Requirements

Grid Infrastructure
• Requirements
– One or more redundant private networks for inter-node communications
– High speed with low latency
– Separate physical network or managed converged network
– VLANS are supported
• Usage
– Interconnect is a memory backplane for the cluster
– Clusterware messaging
– RDBMS messaging and block transfer
– ASM messaging
– HANFS for block traffic
CRS Network

CRS stack is spawned from
Oracle HA Services Daemon
(ohasd)
On Unix ohasd runs out of
inittab with respawn
A node can be evicted
when deemed unhealthy
• May require reboot
• IPMI integration
or diskmon in case of Exadata
CRS provides Cluster Time
Synchronization services
• Always runs but in observer
mode if ntpd configured
How it works
Grid Infrastructure

Core Resources
Grid Infrastructure Processes
HA Stack CRS Stack CRS Service
Level 0 Level 1 Level 2 Level 3 Level 4
INIT
ohasd
cssdmonitor
Network sources
SCAN VIP
Node VIP
ACFS Registry
GNS VIP
ASM Instance
Diskgroup
DB Resources
SCAN Listener
Listener
Services
eONS
ONS
GNS
GSD
CRSD
orarootagent
CRSD
oraagent
ASM
mDNSD
GIPCS
EVMD
GPNPD
CRSD
CTSSD
Diskmon
CSSD
OHASD
oraagent
OHASD
oraclerootagent
cssdagent

Oracle RAC 12c and onwards
Flex Cluster Flex ASM
Full Oracle
Multitenant & In-
Memory Support
Fleet
Provisioning and
Patching (FPP)
10
http://www.slideshare.net/MarkusMichalewicz/oracle-
database-inmemory-meets-oracle-rac
New In-Memory
Format
SALES
Column
Format
Oracle Confidential – Internal/Restricted/Highly Restricted

• Configure during Installation
• Reject non-Oracle I/O
• Stops OS utilities from
overwriting ASM disks
• Protects database files
• Reduce OS resource usage
• Fewer open file descriptors
• Faster node recovery
11
12.2 Automatic Storage Management (ASM)
ASM Filter Driver – Full Integration
• Further configuration and
monitoring is conducted by
using the AFDTOOL utility:
• Provision a disk:
$ afdtool -add /dev/dsk1 disk1
• Remove a disk:
$ afdtool -delete disk1
• List the managed disks:
$ afdtool -getdevlist
Oracle Confidential – Internal/Restricted/Highly Restricted

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 12
Oracle RAC 12.2 Enhancements Worth Noticing
Node Weighting
Idea: If everything is
equal, let the majority of
work survive
Pluggable Database
& Service Isolation
Improved singleton
workload performance
and failure behavior
Service-oriented
Buffer Cache Access
Improved data access
performance & planned
maintenance operation
Fully Integrated
Extended RAC Support
Site-awareness and installer
support for extended RAC

Node Eviction Basics
13
Behavior pre-12.1.0.2
NodeA
Oracle GI | HUB
Oracle RAC
NodeB
Oracle GI | HUB
Oracle RAC
cons_1 cons_2
• Node eviction follows a
rather predictable pattern
– Example in a 2-node cluster: The node
with the lowest node number survives.
• Customers must not base their
application logic on which node
survives the split brain.
– As this may(!) change in future releases

Node Weighting
14
Idea: Everything equal, let the majority of work survive
NodeA
Oracle GI | HUB
Oracle RAC
NodeB
Oracle GI | HUB
Oracle RAC
cons_1 cons_2
• Node Weighing is a new feature that considers
the workload run on a node during fencing
• The idea is to let the majority of work survive,
if everything else is equal
– “Majority work” is for example represented
by the number of services.
• Example: In a 2-node cluster, the node hosting
the majority of services (at fencing time) is
meant to survive
• DBAs can overrule and rate a service
as a “critical” based on business needs

Node Weighting
work survive
Pluggable Database
& Service Isolation
Improved singleton
Service-oriented
Buffer Cache Access
Fully Integrated

Pluggable Database & Service Isolation
16
Prevents “noisy neighbors” from affecting others with unnecessary chatter
NodeA
Oracle GI | HUB
Oracle RAC
NodeB
Oracle GI | HUB
Oracle RAC
cons_1 cons_2
• Using Oracle Multitenant, PDBs can be opened
as singletons (in one database instance only), in
a subset of instances or all in instances at once.
• If certain PDBs are only opened on some
instances, Pluggable Database Isolation
– improves performance by
• Reducing DLM operations for
PDBs not open in all instances.
• Optimizing block operations based
on in-memory block separation.
MSG
Messages (MSG)

Pluggable Database & Service Isolation
17
Prevents instance failures of instances only hosting singleton PDBs to affect others
NodeA
Oracle GI | HUB
Oracle RAC
NodeB
Oracle GI | HUB
Oracle RAC
cons_1 cons_2
• Using Oracle Multitenant, PDBs can be opened
as singletons (in one database instance only), in
a subset of instances or in all instances at once.
• If certain PDBs are only opened on some
instances, Pluggable Database Isolation
– Improves performance by
• Reducing DLM operations for
PDBs not open in all instances.
• Optimizing block operations based
on in-memory block separation.
– Ensures that instance failures of instances
only hosting singleton PDBs will not impact
other instances of the same RAC-based CDB.

Node Weighting
work survive
Pluggable Database
& Service Isolation
Improved singleton
Service-oriented
Buffer Cache Access
Fully Integrated

Service-oriented Buffer Cache Access
19
Improve performance by managing data with the service to which it belongs
NodeA
Oracle GI
Oracle RAC
NodeB
Oracle GI
Oracle RAC
cons_1 cons_2
• Service-oriented Buffer Cache Access over time
determines the data (on database object level)
accessed by the service. This information
– Is persisted in the database.
– Is used to improve data access performance
(e.g. do not manage data of a service in an
instance that does not host the service).
– Can be used to pre-warm an instance cache prior
to a service startup (fresh start or relocation).

Node Weighting
work survive
Pluggable Database
& Service Isolation
Improved singleton
Service-oriented
Buffer Cache Access
Fully Integrated

Cluster Domain
22
For cost reduction through centralization, standardization and optimization
Why Use Oracle RAC for Your Private Database Cloud?
Cluster
Single
Node
Cluster Centralization:
centralize common management tasks
on the Domain Services Cluster.
Domain Services Cluster Standardization:
Use the same building blocks –
commodity hardware clusters – to
scale databases, compute & storage.
Database
Member
Cluster
Application
Member
Cluster
Optimization example:
Version independence – run any
Oracle RAC 12.2+ Member Cluster
using any platform at any time.
Linux
Cluster
AIX
Cluster
Solaris
Cluster

Centralization – Cluster Domain & Domain Services
Domain Services Cluster
Mgmt
Repository
Service
Trace File
Analyzer
(TFA)
Service
Rapid
Home
Provision
Service
Cluster Domain
A Cluster Domain is a logical management
entity to group various clusters in your DC.
The Mgmt Repository and the TFA service are
mandatory in the Cluster Domain. They represent
centralized versions of their local counterparts.
To provide centralized services in the Cluster
Domain, you need to deploy a Domain Services
Cluster. It will host the central services.
Additional services can
be added as needed.

Standardization – Member Clusters
Cluster Domain
Database
Member Cluster
uses
local ASM
Application
Member Cluster
GI only
A (Database) Member Cluster is a cluster that registers with the Mgmt Repository
Service and uses the centralized TFA service. It can use additional services as needed.
Mgmt
Repository
Service
Trace File
Analyzer
(TFA)
Service
Rapid
Home
Provision
Service
An Application Member Cluster (available
since 12.1.0.2) is a cluster designed to host
applications. It uses a lightweight GI stack.

Standardization – Storage Consolidation
Mgmt
Repository
Service
Trace File
Analyzer
(TFA)
Service
Rapid
Home
Provision
Service
Database
Member Cluster
uses the
ASM Services
Shared ASM
Cluster Domain
Storage Services
ASM
Service
IO
Service
ACFS
Services
Database
Member Cluster
uses the IO &
ASM Services
Storage flexibility: Member Clusters do not need
direct connectivity to shared disks. Using the
shared ASM Service, they can use network
connectivity to the IOservice to access a centrally
managed pool of storage.
To further standardize and centralize, various
Storage Services are offered in the domain.

Fleet Patching & Provisioning Support
Database & Grid Infrastructure
11.2.0.3.
11.2.0.4.
12.1
12.2
18
VM VM
VM VM
VM VM
VM VM
• Single Instance
• Oracle Restart
• Oracle RAC One
• Oracle RAC
BM
Non-CDB CDB/PDB
VM
• Generic Software
• Data Guard Aware
• Customizable
Multi-OS
19

Hides errors, timeouts, and
maintenance
No application knowledge or
changes to use
Rebuilds session state & in-flight
transactions
Adapts as applications change:
protected for the future
Standardize on Transparent Application Continuity
27
Request
Errors/Timeouts hidden
TAC
Applications see no errors during outages

Oracle RAC Performance Features
• Automatic Undo Management
• Cache Fusion
• Oracle Real Application Clusters
• Session Affinity
• PDB & Services Isolation
• Service-Oriented Buffer Cache
• Leaf Block Split Optimizations
• Self Tuning LMS
• Multithreaded Cache Fusion
• ExaFusion Direct-to-Wire Protocol
• Smart Fusion Block Transfer
• Universal Connection Pool (UCP) Support for Oracle RAC
• Support for Distributed Transactions (XA) in Oracle RAC
• Parallel Execution Optimizations for Oracle RAC
• Affinity Locking and Read-Mostly Objects
• Reader Bypass
• Flash Cache
• Connection Load Balancing
• Load Balancing Advisory
• Cluster Managed Services
• Automatic Storage Management
9i 10g
11g
12c
18c
• Zero Downtime Patching
Clusterware
• Fleet Provisioning and Patching
• Automated Transaction
Draining
• Support TLS Ciphers for
Clusterware
• Automated PDB Relocation
Over two decades of innovation
19c
• Scalable Sequences
• Undo RDMA-Read
• Commit Cache
• Database Reliability Framework

RAC Enhancements
• Remastering Slaves (1 slave per LMS)
– Starting with Oracle RAC 12.1, the LMS offloads heavy remastering work to the slave
– This improves LMS’s responsiveness for Cache Fusion requests during remastering
• Support for 100 LMS’s – change in default value
– Oracle RAC 12.2 supports up to 100 LMS’s (names: LMS0-LM99) as opposed to 35
– On larger systems (lots of CPU, large SGA), more LMS’s will start by default
– More LMS’s means better reconfiguration time without any impact during runtime
• More Dynamic Remastering (DRM)
– Starting with Oracle RAC 19c, DRM is planned to more adaptively consider the overall system state
29

Program Agenda
18/19c and beyond
Q&A
1
2
3
4
5
30

Cluster Startup
Oracle Support
TFA
Check core CRS
resources
running
ps –ef|grep init.ohasd
ps –ef|grep ohasd.bin
Not Running
Review status of
CRS services &
stack
crsctl check crs
crsctl check cluster
Running
Compare OLR
permissions to
reference
system & fix
differences
Not Running
Running
tfactl diagcollect
Review & fix
issues in logs
ohasd.log
Agent logs
process logs
Review & fix
CRS startup
config & log
crsctl config crs
ohasd.log

Node Eviction Problem Triage
Oracle Support
TFA
tfactl diagcollect
Check for &
fix
resource
starvation
System log
Troubleshooting guides:
1531223.1 (OSWatcher)
1328466.1 (CHM)
Check for &
fix
network
heartbeat
problems
ocssd.log
1050693.1
1534949.1
1546004.1
Check for &
fix
voting disk
problems
1549428.1
1466639.1

Reconfiguration Performance Improvements
11.2.0.4
11204
4 x
1.5 x
12.2 18.1

• Timings with different #LMS:
– Total reconfiguration time for an
instance leave & re-join
– 100GB cache
– 2 node RAC
34
Reconfiguration Performance as of 18c
Buffer Cache Size Reconfiguration Time
25GB 3.0 sec
50GB 4.9 sec
100GB 8.3 sec
• Timings with different cache sizes:
– Total reconfiguration time for an
instance leave & re-join
– 8 LMS’s
– 2 node RAC
# LMS Reconfiguration Time
8 LMS’s 8.3 sec
16 LMS’s 5.0 sec
32 LMS’s 3.6 sec

Reconfiguration Diagnosability
**************** BEGIN DLM RCFG HA STATS ****************
Total dlm rcfg time (inc 6): 3.586 secs (394926177, 394929763)
Begin step .........: 0.005 secs (394926177, 394926182)
Freeze step ........: 0.019 secs (394926182, 394926201)
Sync 1 step ........: 0.002 secs (394926264, 394926266)
Sync 2 step ........: 0.024 secs (394926266, 394926290)
Enqueue cleanup step: 0.002 secs (394926290, 394926292)
Sync pcm1 step .....: 0.004 secs (394926293, 394926297)
……
….
Enqueue dubious step: 0.004 secs (394926432, 394926436)
Sync 5 step ........: 0.000 secs (394926436, 394926436)
Enqueue grant step .: 0.001 secs (394926436, 394926437)
Sync 6 step ........: 0.012 secs (394926437, 394926449)
Fixwrt replay step .: 0.885 secs (394928837, 394929722)
Sync 8 step ........: 0.040 secs (394929722, 394929762)
End step ...........: 0.001 secs (394929762, 394929763)
Number of replayed enqueues sent / received .......: 2246 / 893
Number of replayed fusion locks sent / received ...: 124027 / 0
Number of enqueues mastered before / after rcfg ...: 2058 / 1384
**************** END DLM RCFG HA STATS *****************
Detailed timing
breakdown available
in LMON trace file

DRM Diagnosability
Dynamic Remastering Statistics DB/Inst: SALES/sales1 Snaps: 393-452
-> Affinity objects - Affinity objects mastered at the begin/end snapshot
-> Read-mostly objects - Read-mostly objects mastered at the begin/end snapshot
per Begin End
Name Total Remaster Op Snap Snap
-------------------------------- ------------ ------------- -------- --------
remaster ops 24 1.00
remastered objects 24 1.00
remaster time (s) 7.4 0.31
freeze time (s) 1.5 0.06
cleanup time (s) 2.4 0.10
replay time (s) 0.3 0.01
fixwrite time (s) 2.4 0.10
sync time (s) 0.1 0.01
affinity objects N/A 3 27
read-mostly objects N/A 0 0
read-mostly objects (persistent) N/A 0 0
Detailed timing
breakdown available
in AWR Report

Program Agenda
19c and beyond
Q&A
1
2
3
4
5
37

Oracle’s Database and Clusterware Tools
• What if issues were detected before they
had an impact?
• What if you were notified with a specific
diagnosis and corrective actions?
• What if resource bottlenecks threatening
SLAs were identified early?
• What if bottlenecks could be
automatically relieved just in time?
• What if database hangs and node
reboots could be eliminated?
Confidential – Oracle Restricted 38
Cluster
Verification
Utility
ORAchk /
EXAchk
Cluster
Health
Monitor
Cluster
Health
Advisor
Trace File
Analyzer
Hang
Manager
Memory
Guard
Quality of
Service
Management

Automatic proactive warning
of problems before they
impact you
39
Get scheduled health reports
sent to you in email
Why Oracle ORAchk & EXAchk
Health checks for most impactful
reoccurring problems
Runs in your environment
with no need to send
anything to Oracle
Findings can be integrated
into other tools of choiceEngineered
Systems
Non
Engineered
Systems
EXAchk
Common Framework
ORAchk
Further slide details

Engineered Systems
Oracle Exadata Database Machine
Oracle SuperCluster
Oracle Private Cloud Appliance
Oracle Database Appliance
Oracle Big Data Appliance
Oracle Exalogic Elastic Cloud
Oracle Exalytics In-Memory
Machine
Oracle Zero Data Loss Recovery
Appliance
Oracle ZFS Storage Appliance
Systems
Oracle Solaris
Cross stack checks
Solaris Cluster
OVN
ASR
41
Oracle Stack Coverage
Oracle Database
Standalone Database
Grid Infrastructure & RAC
Maximum Availability Architecture
(MAA) Scorecard
Upgrade Readiness Validation
Golden Gate
Enterprise Manager Cloud
Control
Repository
Agent
OMS
Middleware
Application Continuity
Oracle Identify and Access
Management Suite (Oracle IAM)
E-Business Suite
Oracle Payables
Oracle Workflow
Oracle Purchasing
Oracle Order Management
Oracle Process Manufacturing
Oracle Receivables
Oracle Fixed Assets
Oracle HCM
Oracle CRM
Oracle Project Billing
Siebel
Database best practices
PeopleSoft
Database best practices
SAP
EXAdata best practices

• Profiles provide logical grouping of
checks which are about similar topics
• Run only checks in a specific profile
• Run everything except checks in a specific
profile
Profiles
./exachk –profile <profile>
./exachk –excludeprofile <profile>
Profile Description
asm ASM Checks
avdf Audit Vault Configuration checks
clusterware Oracle clusterware checks
control_VM Checks only for Control VM(ec1-vm, ovmm, db, pc1, pc2).
No cross node checks
corroborate Exadata checks needs further review by user to determine
pass or fail
dba DBA Checks
ebs Oracle E-Business Suite checks
eci_healthchecks Enterprise Cloud Infrastructure Healthchecks
ecs_healthchecks Enterprise Cloud System Healthchecks
goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
maa Maximum Availability Architecture Checks
ovn Oracle Virtual Networking
platinum Platinum certification checks
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml
44

• Profiles provide logical grouping of
checks which are about similar topics
• Run only checks in a specific profile
• Run everything except checks in a specific
profile
Profiles
./orachk –profile <profile>
./orachk –excludeprofile <profile>
Profile Description
asm ASM Checks
bi_middleware Oracle Business Intelligence checks
clusterware Oracle clusterware checks
dba DBA Checks
ebs Oracle E-Business Suite checks
emagent Cloud control agent checks
emoms Cloud Control management server
em Cloud control checks
goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
oam Oracle Access Manager checks
oim Oracle Identify Manager checks
oud Oracle Unified Directory server checks
ovn Oracle Virtual Networking
peoplesoft Peoplesoft best practices
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
siebel Siebel Checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml
45

Enterprise Manager Integration
•Check results integrated into EM
compliance framework via plugin
•View results in native EM
compliance dashboards
•Related checks grouped into
compliance standards
•View targets checked, violations &
average score
•Drill down into compliance standard
to see individual check results
•View break down by target
46

JSON Output to Integrate with Kibana, Elastic Search etc
48

Oracle Health Check Collection Manager Dashboard
49

Differences between each run
Diff Output
50

• New checks to help when upgrading the database
to 12.2+
• Both pre and post upgrade verification to prevent
problems related to:
• OS configuration
• Grid Infrastructure & Database patch prerequisites
• Database configuration
• Cluster configuration
Upgrade to Database 12.2 and beyond with confidence
orachk -u –o pre
orachk -u –o post
Pre upgrade
Post upgrade

Real-time Status Summary
tfactl summary
Choose an
option to drill
down
High-level summary of all
Database components

Real-time Status Summary – Drill Down
Drill downs show real-time
analytics & details of any
problems found

Perform Analysis Using the Included Tools
Not all tools are included in Grid or Database install.
Download from 1513912.1 to get full collection of tools
Tool Description
orachk or
exachk
Provides health checks for the Oracle stack.
Oracle Trace File Analyzer will install either
• Oracle EXAchk for Engineered Systems, see document 1070954.1 for
more details
or
• Oracle ORAchk for all non-Engineered Systems, see document
1268927.2 for more details
oswatcher Collects and archives OS metrics. These are useful for instance or node
evictions & performance Issues. See document 301137.1 for more details
procwatcher Automates & captures database performance diagnostics and session level
hang information. See document 459694.1 for more details
oratop Provides near real-time database monitoring. See document 1500864.1
for more details.
alertsummary Provides summary of events for one or more database or ASM alert files
from all nodes
ls Lists all files TFA knows about for a given file name pattern across all nodes
pstack Generate process stack for specified processes across all nodes
Tool Description
grep Search alert or trace files with a given database and file name pattern, for
a search string.
summary Provides high level summary of the configuration
vi Opens alert or trace files for viewing a given database and file name
pattern in the vi editor
tail Runs a tail on an alert or trace files for a given database and file name
pattern
param Shows all database and OS parameters that match a specified pattern
dbglevel Sets and unsets multiple CRS trace levels with one command
history Shows the shell history for the tfactl shell
changes Reports changes in the system setup over a given time period. This
includes database parameters, OS parameters and patches applied
calog Reports major events from the Cluster Event log
events Reports warnings and errors seen in the logs
managelogs Shows disk space usage and purges ADR log and trace files
ps Finds processes
triage Summarize oswatcher/exawatcher data
62
Verify which tools you have installed: tfactl toolstatus

Generates Diagnostic Metrics View of Cluster and Databases
Cluster Health Monitor (CHM)
GIMR
ologgerd
(master)
osysmond
osysmond
osysmond
osysmond
12c Grid Infrastructure
Management Repository
• Always on - Enabled by default
• Provides Detailed OS Resource Metrics
• Assists Node eviction analysis
• Locally logs all process data
• User can define pinned processes
• Listens to CSS and GIPC events
• Categorizes processes by type
• Supports plug-in collectors (ex.
traceroute, netstat, ping, etc.)
• New CSV output for ease of analysis
OS Data OS Data
OS Data
OS Data
Confidential – Oracle Internal/Restricted/Highly RestrictedConfidential – Oracle Restricted

Oclumon CLI or Full Integration with EM Cloud Control
Confidential – Oracle Internal/Restricted/Highly RestrictedConfidential – Oracle Restricted

Cluster Health Advisor (CHA)*
Discovers Potential Cluster & DB Problems - Notifies with Corrective Actions
91
OS Data
GIMR
ochad
• Detects node and database
performance problems
• Provides early-warning alerts and
corrective action
• Supports on-site calibration to improve
sensitivity
• Integrated into EMCC Incident Manager
and notifications
• Standalone Interactive GUI Tool
DB Data
CHM
Node
Health
Prognostics
Engine
Database
Health
Prognostics
Engine
* Requires and Included with RAC or R1N License
Confidential – Oracle Restricted

Calibrating CHA to your RAC deployment
Choosing a Data Set for Calibration – Defining “normal”
$ chactl query calibration –cluster –timeranges ‘start=2016-10-28 07:00:00,end=2016-10-28 13:00:00’
Cluster name : mycluster
Start time : 2016-10-28 07:00:00
End time : 2016-10-28 13:00:00
Total Samples : 11524
Percentage of filtered data : 100%
1) Disk read (ASM) (Mbyte/sec)
MEAN MEDIAN STDDEV MIN MAX
0.11 0.00 2.62 0.00 114.66
<25 <50 <75 <100 >=100
99.87% 0.08% 0.00% 0.02% 0.03%
2) Disk write (ASM) (Mbyte/sec)
0.01 0.00 0.15 0.00 6.77
<50 <100 <150 <200 >=200
100.00% 0.00% 0.00% 0.00% 0.00%
3) Disk throughput (ASM) (IO/sec)
2.20 0.00 31.17 0.00 1100.00
<5000 <10000 <15000 <20000 >=20000
100.00% 0.00% 0.00% 0.00% 0.00%
4) CPU utilization (total) (%)
9.62 9.30 7.95 1.80 77.90
<20 <40 <60 <80 >=80
92.67% 6.17% 1.11% 0.05% 0.00%

Calibrating CHA to your RAC deployment
• Create and store the new model
$ chactl query calibrate cluster –model daytime –timeranges ‘start=2018-10-28 07:00:00,
end=2018-10-28 13:00:00’
• Begin using the new model
$ chactl monitor cluster –model daytime
• Confirm the new model is being used
$ chactl status –verbose
monitoring nodes svr01, svr02 using model daytime
monitoring database qoltpacdb, instances oltpacdb_1, oltpacdb_2 using model DEFAULT_DB
Creating a new CHA Model with CHACTL

Cluster Health Advisor – Command Line Operations
Monitoring Your Databases and Nodes with CHACTL
Enable CHA monitoring on RAC database with optional model
$ chactl monitor database –db oltpacdb [-model model_name]
Enable CHA monitoring on RAC database with optional verbose
$ chactl status –verbose
monitoring nodes svr01, svr02 using model DEFAULT_CLUSTER
monitoring database oltpacdb, instances oltpacdb_1, oltpacdb_2 using model DEFAULT_DB

CHA Command Line Operations
Checking for Health Issues and Corrective Actions with CHACTL QUERY DIAGNOSIS
$ chactl query diagnosis -db oltpacdb -start "2016-10-28 01:52:50" -end "2016-10-28 03:19:15"
2016-10-28 01:47:10.0 Database oltpacdb DB Control File IO Performance (oltpacdb_1) [detected]
2016-10-28 01:47:10.0 Database oltpacdb DB Control File IO Performance (oltpacdb_2) [detected]
2016-10-28 02:59:35.0 Database oltpacdb DB Log File Switch (oltpacdb_1) [detected]
2016-10-28 02:59:45.0 Database oltpacdb DB Log File Switch (oltpacdb_2) [detected]
Problem: DB Control File IO Performance
Description: CHA has detected that reads or writes to the control files are slower than expected.
Cause: The Cluster Health Advisor (CHA) detected that reads or writes to the control files were
slow because of an increase in disk IO.
The slow control file reads and writes may have an impact on checkpoint and Log Writer (LGWR) performance.
Action: Separate the control files from other database files and move them to faster disks or Solid
State Devices.
Problem: DB Log File Switch
Description: CHA detected that database sessions are waiting longer than expected
for log switch completions.
Cause: The Cluster Health Advisor (CHA) detected high contention during log switches
because the redo log files were small and the redo logs switched frequently.
Action: Increase the size of the redo logs.

Cluster Health Advisor – Command Line Operations
HTML Diagnostic Health Output Available (-html <file_name>)

Oracle 12c Hang Manager
• Reliably detects database hangs and
deadlocks
• Autonomously resolves them
• Supports QoS Performance Classes, Ranks
and Policies to maintain SLAs
• Logs all detections and resolutions
• New SQL interface to configure sensitivity
(Normal/High) and trace file sizes
Autonomously Preserves Database Availability and Performance Session
DIA0
EVALUATE
DETECT
ANALYZE
Hung?
VERIFY
Victim
QoS
Policy

Full Resolution Dump Trace File and DB Alert Log Audit Reports
Oracle 12c Hang Manager
Dump file …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
Oracle Database 12c Enterprise Edition Release 18/19c.0.0.0 - 64bit Beta
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics
and Real Application Testing options
Build label: RDBMS_MAIN_LINUX.X64_151013
ORACLE_HOME: …/3775268204/oracle
System name: Linux
Node name: slc05kyr
Release: 2.6.39-400.211.1.el6uek.x86_64
Version: #1 SMP Fri Nov 15 13:39:16 PST 2013
Machine: x86_64
VM name: Xen Version: 3.4 (PVM)
Instance name: hm62
Redo thread mounted by this instance: 2
Oracle process number: 19
Unix process pid: 12656, image: oracle@slc05kyr (DIA0)
*** 2015-10-13T16:47:59.541509+17:00
*** SESSION ID:(96.41299) 2015-10-13T16:47:59.541519+17:00
*** CLIENT ID:() 2015-10-13T16:47:59.541529+17:00
*** SERVICE NAME:(SYS$BACKGROUND) 2015-10-13T16:47:59.541538+17:00
*** MODULE NAME:() 2015-10-13T16:47:59.541547+17:00
*** ACTION NAME:() 2015-10-13T16:47:59.541556+17:00
*** CLIENT DRIVER:() 2015-10-13T16:47:59.541565+17:00
2015-10-13T16:47:59.435039+17:00
Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc
2015-10-13T16:47:59.506775+17:00
DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=1.
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1.
In the alert log on the instance local to the session (instance 2 in this case),
we see the following:
2015-10-13T16:47:59.538673+17:00
Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
2015-10-13T16:48:04.222661+17:00
DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid:40 with serial # 43179 (ospid:13031)

Program Agenda
19c and beyond
Q&A
1
2
3
4
5
99

Oracle RAC 18c
• Manages hung database processes
– Detect & resolves
– Cross-layer hangs
• I.E: Hangs caused by a blocked ASM resource.
• Resolves deadlocks
• User defined control via PL/SQL
• Early Warning exposed via (V$ view)
100
Hang Manager
Database Member
Cluster
Uses ASM IO
Service
IO Service
ASM
Service
Session
Detect
Analyze
Evaluate
Hung?
Hang
Resolution

Diagnostic Service
Oracle Confidential – Highly Restricted 101
All data
aggregated in
one place
Real-time
overview of
infrastructure &
services
Fine-grained drill
down for
diagnosis

Initial Anomalous Events Ranked Anomalous Events
Oracle Confidential – Highly Restricted 102
Timestamp Correlation & Ranking
Full initial list of anomalous events
1. Sort the anomalous events in chronological order
2. keep tack of unique events and their first occurrence
3. Compare sequence of events to previous timeframes in the same collection
4. Prioritize unique events not seen previously in the collection

Oracle 19c
• Applied Machine Learning for
Database Diagnostics
– Efficient diagnosis using Machine
Learning
– Automatically performs
corrective actions to prevent
possible issues
– Provides simple alerts &
recommendations for issues that
require manual intervention
Oracle Domain Services Cluster
IO Service
ACFS
Services
ASM
Service
TFA
Service
Management
Service
RHP
Service
Shared ASM
Subject Matter
ExpertLog
ASHMetrics
ML
Knowledge
Extraction
Model
Generation
Human
Supervision
Application
Optimized
Models
Feedback

• Monitors for problems before
service disruption
– E.g HB for critical processes
• Detects the cause of problem
• Use collected data across all nodes
to identify root cause
– E.g. Waits on GRD
• Resolves the problem with minimal
disruption
– E.g Resize internal Structures
Introducing Database Reliability Framework
Resources

Monitor
Detect
Review
Resolve
• Increase in number of resources in
the Global Resource Directory
(GRD)
• Resulting in higher wait times for
GRD
• Several solutions possible
– Is wait time due to high CPU load?
– Increase in number of LMS help?
– Increasing CR slaves help
– Increasing internal thresholds help?
Database Reliability Framework in Action

• Busy FG process(es) using CPU
• Potential upcoming memory
starvation
• LGWR constrained by CPU
• Too many RT processes
• Insufficient CR slaves
• DLM resource cache incorrectly
sized
• Control file IO (CFIO) stall
• v$ views
• v$gcr_metrics - details on all defined
metrics
• v$gcr_actions - details on all defined
actions
• v$gcr_log – metric/action history
summary log
• v$gcr_status – details on latest
metric/action status
107
Examples and DRF Views

• Increase the maximum number of
LMSs
– Based on System utilization (DRF)
• Each LMS will spawn a dedicated
CR slave
– Threshold of Rollback Changes
– Threaded CR slave in 18c
• Optimized for Multi core/thread architecture
• Remastering Slaves (RMV0..)
– Offloads heavy remastering work to
slaves
Cache Fusion Optimizations

Commit Cache
• Reduce Cache Fusion traffic for
remote undo header lookups
• Often becomes a bottleneck with
DML heavy OLTP/mixed workloads
• Remote undo header lookups are
needed for:
– Check if a transaction has committed
– Delayed block cleanout
109
0
400
800
1,200
1,600
2,000
Data
Blocks
Undo
Headers
Undo
Blocks
Others
#BlockTransfers(thousands)
CR (Immediate) CR (Busy)
Current (Immediate) Current (Busy)

• Undo Block RDMA-read
• In some workloads, more than half of remote reads are
for Undo Blocks to satisfy read consistency
– Undo Block RDMA-read uses RDMA to directly
and rapidly access UNDO blocks in remote
instances
• Commit Cache
– The Commit Cache maintains an in-memory
table on each instance which records the
commit time of transactions
– Remote LMS directly reads the commit cache
and sends back commit times for requested
transactions.
• Replaces having to send entire 8K transaction
table block
110
RAC Optimizations for Exadata
UNDOUNDO
RDMA RDMA

Fusion Block Transfer
1 2 3 4
Shadow Process LGWR
gc current block busy,
gc buffer busy acquire,
gc buffer busy release

• On Exadata, Oracle does not wait for the
log write notification
– Exadata ensures the log write completes before
changes to block on another instance commit,
guaranteeing durability
– Wait for Log I/O during transfer of hot blocks is
eliminated
– Up to 40% throughput and 33% response time
improvement in some heavily contended OLTP
workloads
• Storage software will ensure
correct ordering of writes
112
Smart Fusion Block Transfer
1. Issue log write
2. Wait for log
write completion
3. Transfer
block
Exadata Avoids I/O Wait confirmation
Storage

Continuous Feature Improvements
Lock Domain per PDB Utilize Bloom Filter to further
reduce Reconfiguration times
Utilize Database Reliability
Framework

Scalable Sequences
Continuous Application Availability
Oracle RAC Sharding
Cluster Domains
Cluster Health Advisor (CHA)
RAC Reader Nodes
Application Continuity (AC)
Oracle Flex ASM & Flex Clusters
Rapid Home Provisioning (RHP)
Oracle Quality of Service Management (QoS)
Policy-Based Cluster Management
Oracle RAC One Node & RACcheck
Oracle ASM Cluster File System (ACFS)
Oracle Grid Infrastructure (GI)
UCP and OCI Load Balancing Support for RAC
Cluster Verification Utility (CVU)
Cluster-Managed Services
Oracle Clusterware
Oracle Automatic Storage Management (ASM)
Oracle Real Application Clusters (RAC) Oracle 9i
Oracle 10g
Oracle RAC’s Journey into the Autonomous Database
Oracle 11g
Oracle
12c
20-years of continuous innovation*
Oracle 18c/19c
* Documented features list is selective; 20 years include development time
114

Flex Cluster
Leaf nodes deprecated
Massive Parallel Query Oracle RAC
deprecated
Oracle RAC Reader Nodes
to be implemented on Hub nodes
Flex Cluster – Changes Down the Road
115

gridSetup and zip-based install
for Oracle Grid Infrastructure
NEW: RPM-based installs for the
Oracle Database and Oracle Client
ASM Management for
NFS-based Clusterware files
for easier management and thereby
better availability.
Separate Diskgroup for Grid
Infrastructure Management
Repository (GIMR)
allows for more flexibility during Grid
Infrastructure Installation
Better Management
$ORACLE_HOME/gridSetup.sh
Configure ASM on NFS
116

More Changes..
• Desupport of Direct File System Placement for Oracle Clusterware Files
– Introduced with Oracle Clusterware 12c Rel. 2 (12.2.0.1)
– Effective with Oracle Clusterware 18c
– Desupport revoked effective with Oracle Clusterware 19c
• Oracle Grid Infrastructure Management Repository (GIMR)
– Around since Oracle Grid Infrastructure 11g Release 2
– Automatic Installation of the GIMR introduced with Grid Infrastructure 12.1.0.2
– Separate diskgroup installation introduced with Grid Infrastructure 12c Release 2
– Automatic install revised for Oracle Grid Infrastructure 19c
• Plans foresee a GIMR installation outside of the Oracle Grid Infrastructure home for Standard Clusters
• Centralized GIMR hosting on a Domain Services Cluster (for Member Clusters) remains unchanged
117

Patching Improvements
• OJVM is Oracle RAC rolling patch enabled with Oracle RAC 18c (18.4)
– Non-Java services are available at all times
– Java services are available all the time, except for a ~10 seconds brownout
• No errors are reported during the brownout
• Zero-Downtime Oracle Grid Infrastructure Patching (*18.3)
– Patch Oracle Grid Infrastructure without interrupting database operations
– Patches are applied out-of-place and in a rolling fashion with one node being patched
at a time while the database instance(s) on that node remain up and running
– Supported for Oracle RAC and RAC One Node clusters with two or more nodes
119

The Road Ahead Leads into the Autonomous Database Cloud
• Future scalability & performance improvements
– Tailor to scaling well within Exadata dimensions (“scale linear across 64 nodes, not 200”)
– Are designed to meet ADB performance requirements and will grow as ADB enhances
– Will leverage RDMA technology for server-less communication
– Plan to use RoCE as the next-generation network for the cloud
• Details in MOS note “Oracle RAC Interconnect Protocols – Support and Roadmap (ID 2434852.1)”
– Will substitute storage access with network-based access to data on remote nodes
– Are likely to utilize NVM for storage on independent servers
• Future availability improvements
– Will focus on reducing re-configuration times (brownouts) further to come closer to “zero”
– Will provide even more ways to perform maintenance & admin tasks with no downtime
121

AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC

Ähnlich wie AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC (20)

Mehr von Sandesh Rao

Mehr von Sandesh Rao (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC