SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
An Introduction to RAC System Test Planning
Methods

Ajith Narayanan
ERP Advisor , Dell IT
Bangalore, 29th June 2013
Who Am I?

Ajith Narayanan
ERP Advisor
Dell IT
8.5 years of Oracle Apps & Oracle DBA experience.
Blogger :- http://oracledbascriptsfromajith.blogspot.com
Website Chair:- http://www.oracleracsig.org – Oracle RAC SIG
Agenda
Real Application Clusters Testing Objectives
Oracle Technologies Used For Tests
Test 1 :Planned Node Reboot
Test 2 :Clusterware and Fencing
Test 3 :Restart Failed Node
Test 4 :Reboot All Nodes Same Time
Test 5 :Unplanned Instance Failure
Test 6 :Planned Instance Termination
Test 7 :Clusterware and Fencing
Test 8: Service Failover
Test 9: Public Network Failure
Test 10: Interconnect Network Failure
Sample Cluster Callout Script
Q&A
Real Application Clusters Testing Objectives

To verify that the system has been installed and configured correctly. Check
that nothing is in broken state.
To Verify that basic functionality still works in a specific environment and for
a specific workload.
To make sure that the system will achieve its objectives, in particular,
availability and performance objectives.
Oracle Technologies Used For Tests
Fast Application Notification (FAN) – Notification mechanism that alerts application of
service level changes of the database.
Fast Connection Failover (FCF) – Utilizes FAN events to enable database clients to
proactively react to down events by quickly failing over connections to surviving
database instances.
Transparent Application Failover (TAF) – Allows for connections to be automatically
reestablished to a surviving database instance in the case that the instance servicing
the initial connection should fail. TAF has the ability to fail over in-flight select
statements (if configured) but insert, update and delete transactions will be rolled
back.
Runtime Connection Load Balancing (RCLB) – Provides intelligence about the current
service level of the database instances to application connection pools. This increases
the performance of the application by utilizing least loaded servers to service
application requests and allows for dynamic workload balancing in the event of the loss
of service by a database instance or increase of service by adding a database instance.
Test 1 :Planned Node Reboot
Procedure
Start client workload & Identify instance with most client connections
Reboot the node where the most loaded instance is running
For AIX, HPUX, Windows: “shutdown –r” , For Linux: “shutdown –r now” , For Solaris: “reboot”

Expected Results
The instances and other Clusterware resources go offline ( ‘SERVER’ field of crsctl stat res –t output)
The node VIP fails over the surviving nodes and will show a state of “INTERMEDIATE” with state_details of
“FAILED_OVER”
The SCAN VIP(s) that were running on the rebooted node will fail over to surviving nodes.
The SCAN Listener(s) running on that node will fail over to a surviving node.
Instance recovery is performed by another instance.
Services are moved to available instances
Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types
and configuration). With TAF configured select statements should continue. Active DMLwill be aborted.
After the database reconfiguration, surviving instances continue processing their workload.

Measures
Time to detect node or instance failure. Time to complete instance recovery. Alert Log helps
Time to restore client activity to same level.
Time before failed instance is restarted automatically by Clusterware and is accepting new connections
Successful failover of the SCAN VIP(s) and SCAN Listener(s)
Test 2 :Unplanned Node Failure Of OCR Master
Procedure

Start client workload.
Identify the node that is the OCR master using the following grep command from any of the nodes:
grep -i "OCR MASTER" $GI_HOME/log/<node_name>/crsd/crsd.l*

NOTE: Windows users must manually review the $GI_HOME/log/<node_name>/crsd/crsd.l* logs to determine the OCR
Master.
Power off the node that is the OCR master.
NOTE: On many servers the power-off switch will perform a controlled shutdown, So we have to cut the power supply

.

Expected Results
The instances and other Clusterware resources go offline ( ‘SERVER’ field of crsctl stat res –t output)
The node VIP fails over the surviving nodes and will show a state of “INTERMEDIATE” with state_details of
“FAILED_OVER”
The SCAN VIP(s) that were running on the rebooted node will fail over to surviving nodes.
The SCAN Listener(s) running on that node will fail over to a surviving node.
Instance recovery is performed by another instance.
Services are moved to available instances
Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types
and configuration). With TAF configured select statements should continue. Active DMLwill be aborted.
After the database reconfiguration, surviving instances continue processing their workload.
Test 3 :Restart Failed Node
Procedure
ajithpathiyil2:/home/oracle[RAC1]$ srvctl start instance –d RAC –I RAC1

Expected Results
On clusters having 3 or fewer nodes, one of the SCAN VIPs and Listeners will be relocated to the restarted node when
the Oracle Clusterware starts.
The VIP will migrate back to the restarted node.
Services that had failed over as a result of the node failure will NOT automatically be relocated.
Failed resources (asm, listener, instance, etc) will be restarted by the Clusterware.

Measures
Time for all resources to become available again, Check with “crsctl stat res –t”
Test 4 :Reboot All Nodes Same Time
Procedure
Issue a reboot on all nodes at the same time
For AIX, HPUX, Windows: ‘shutdown –r’
For Linux: ‘shutdown –r now’
For Solaris: ‘reboot’

Expected Results
All nodes, instances and resources are restarted without problems

Measures
Time for all resources to become available again, Check with “crsctl stat res –t”
Test 5 :Unplanned Instance Failure
Procedure
Start client workload
Identify single database instance with the most client connections and abnormally terminate that instance:
For AIX, HPUX, Linux, Solaris:
Obtain the PID for the pmon process of the database instance:
# ps –ef | grep pmon
kill the pmon process:
# kill –9 <pmon pid>
For Windows:
Obtain the thread ID of the pmon thread of the database instance by running:
SQL> select b.name, p.spid from v$bgprocess b, v$process p where b.paddr=p.addr and b.name=’PMON’;
Run orakill to kill the thread:
cmd> orakill <SID> <Thread ID>
Test 5 :Unplanned Instance Failure
Expected Results
One of the other instances performs instance recovery
Services are moved to available instances, if a preferred instance failed
Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client
types and configuration)
After a short freeze, surviving instances continue processing the workload
Failing instance will be restarted by Oracle Clusterware, unless this feature has been disabled

Measures
Time to detect instance failure
Time to complete instance recovery. Check alert log for recovering instance
Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload)
Duration of database freeze during failover.
Time before failed instance is restarted automatically by Oracle Clusterware and is accepting new connections
Test 6 :Planned Instance Termination
Procedure
Issue a ‘shutdown abort’

Expected Results
One other instance performs instance recovery
Services are moved to available instances, if a preferred instance failed
Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client
types and configuration)
The instance will NOT be automatically restarted by Oracle Clusterware due to the user invoked shutdown.

Measures
Time to detect instance failure.
Time to complete instance recovery. Check alert log for recovering instance.
Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload).
The instance will NOT be restarted by Oracle Clusterware due to the user induced shutdown.
Test 7 : Clusterware and Fencing
Node fencing is a general concept used by computer clusters to forcefully remove a malfunctioning
node from it. This preventive technique is a necessary measure to make sure no I/O from
malfunctioning node can be done, thus preventing data corruptions and guaranteeing cluster integrity.
Procedure
1. Start with a normal, running cluster with the database instances up and running.
2. Monitor the logfiles for clusterware on each node. On each node, start a new window and run the
following command:
The network heartbeats are associated with a timeout called misscount, set from 11g Release 1
to 30.
ajithpathiyil1:/home/oracle[+ASM1] $crsctl get css misscount
30
ajithpathiyil1:/home/oracle[+ASM1] $oifcfg getif
bond0 192.168.78.51 global public
bond1 10.10.0.0 global cluster_interconnect
ajithpathiyil1:/home/oracle[grid]$ tail -f /u01/grid/oracle/product/11.2.0/grid_1/log/ajithpathiyil2/crsd/crsd.l*
ajithpathiyil1:/home/oracle[grid]$ tail -f /u01/grid/oracle/product/11.2.0/grid_1/log/‘hostname -s‘/cssd/ocssd.log
ajithpathiyil2:/home/oracle[grid]$ ifconfig eth1 down
Test 7 : Clusterware and Fencing
Expected Results
Following this command, watch the logfiles you began monitoring in step 2 above. You should see errors in those
logfiles and eventually (could take a minute or two, literally) you will observe one node reboot itself.
If you used ifconfig to trigger a failure, then the node will rejoin the cluster and the instance should start
automatically.
Alert Log
[cssd(2864)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval.
Removal of this node from cluster in 14.920 seconds
…
[cssd(2864)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval.
Removal of this node from cluster in 2.900 seconds
[cssd(2864)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is
going down to p reserve cluster integrity
More debugging information is written to the ocssd.bin process log file:
[CSSD][1119164736](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on
map type 2
[CSSD][1119164736]###################################
[CSSD][1119164736]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
[CSSD][1119164736]###################################
Test 8: Service Failover
Procedure
Create a Service
ajithpathiyil2:/home/oracle[RAC1]$ srvctl add service -d RAC -s svctest -r RAC1 -a RAC2 -P BASIC
ajithpathiyil2:/home/oracle[RAC1]$ srvctl start service -d RAC -s svctest
ajithpathiyil2:/home/oracle[RAC1]$ srvctl status service -d RAC -s svctest
Service svctest is running on instance(s) RAC1
ajithpathiyil2:/home/oracle[RAC1]$
Warning !
You should never directly change the SERVICE_NAMES init parameter on a RAC database!! This parameter is maintained
automatically by the clusterware.
SQL> show user
USER is "SYS"
SQL> select instance_name from v$instance;
INSTANCE_NAME
---------------RAC1
SQL> shutdown abort;
ORACLE instance shut down.
SQL>
Test 9: Public Network Failure
Procedure
Unplug all network cables for the public network
NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to
the interface resulting in unexpected results.

Expected Results
•Check with “crsctl stat res –t”
The ora.*.network and listener resources will go offline for the node.
SCAN VIPs and SCAN LISTENERs running on the node will fail over to a surviving node.
ajithpathiyil2:/home/oracle[grid]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node ajithpathiyil2
ajithpathiyil2:/home/oracle[grid]$
ajithpathiyil2:/home/oracle[grid]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node ajithpathiyil2
ajithpathiyil2:/home/oracle[grid]$
Test 9: Public Network Failure

The VIP for the node will fail over to a surviving node.
The database instance will remain up but will be unregistered with the remote listeners.
Database services will fail over to one of the other available nodes.
If TAF is configured, clients should fail over to an available instance.
NODE VERSION=1.0 host=ajithpathiyil2 incarn=0 status=nodedown reason=public_nw_down timestamp=30-Aug-2009
01:56:12 reported=Sun Jan 30 01:56:13 CDT 2013
NODE VERSION=1.0 host=ajithpahtiyil2 incarn=147028525 status=nodedown reason=member_leave timestamp=30-Aug2009 01:57:19 reported=Sun Aug 30 01:57:20 CDT 2013

Measures
Time to detect the network failure and relocate resources.
Test 10: Interconnect Network Failure
Procedure
Unplug all network cables for the interconnect network
NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to
the interface resulting in unexpected results.

Expected Results
For 11.2.0.2 and above:
CSSD will detect split-brain situation and perform one of the following:
o In a two-node cluster the node with the lowest node number will survive.
o In a multiple node cluster the largest sub-cluster will survive.
On the node(s) that is being evicted, a graceful shutdown of Oracle Clusterware will be attempted. o All I/O capable
client processes will be terminated and all resources will be cleaned up. If process termination and/or resource
cleanup does not complete successfully the node will be rebooted.
o Assuming that the above has completed successfully, OHASD will attempt to restart the stack. In this case the stack
will be restarted once the network connectivity of the private interconnect network has been restored.
Review the following logs:
o $GI_HOME/log/<nodename>/alert<nodename>.log
o $GI_HOME/log/<nodename>/cssd/ocssd.log
Test 10: Interconnect Network Failure
Measures
For 11.2.0.2 and above:
Oracle Clusterware will gracefully shutdown, should graceful shutdown fail (due to I/O processes not being
terminated or resource cleanup) the node will be rebooted.
Assuming that the graceful shutdown of Oracle Clusterware succeeded, OHASD will restart the stack once
network connectivity for the private interconnect has been restored.
Sample Cluster Callout Script
#!/bin/ksh
# # Author: Ajith Narayanan
## http://oracledbascriptsfromajith.blogspot.com
## Version 1.0
## This callout script is extended to report/mail the affected weblogic services when any Oracle cluster event occurs.
##
umask 022
FAN_LOGFILE=$ORACLE_HOME/racg/usrco/`hostname`_uptime.log
EVENTLINE=$ORACLE_HOME/racg/usrco/`hostname`_eventline.log
EVENTLINE_MID=$ORACLE_HOME/racg/usrco/`hostname`_eventline_mid.log
MAIL_CONT=$ORACLE_HOME/racg/usrco/`hostname`_mail.log
WEBLOGIC_DS=$ORACLE_HOME/racg/usrco/weblogic_ds
echo $* "reported="`date` >> $FAN_LOGFILE &
tail -1 $FAN_LOGFILE > $EVENTLINE
awk '{
for (f = 1; f <= NF; f++) { a[NR, f] = $f }
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++) {
for (r = 1; r <= NR; r++) {
printf a[r, f] (r==NR ? RS : FS)
}
}
}' $EVENTLINE > $EVENTLINE_MID
SER=`grep "service=" $EVENTLINE_MID|awk -F= '{print $2}'`
DB=`grep "database=" $EVENTLINE_MID|awk -F= '{print $2}'`
Sample Cluster Callout Script
INST=`grep "instance=" $EVENTLINE_MID|awk -F= '{print $2}'`
HOST=`grep "host=" $EVENTLINE_MID|awk -F= '{print $2}'`
STAT=`grep "status=" $EVENTLINE_MID|awk -F= '{print $2}'`
if [ "$SER" != " " | "$DB" != " " | "$INST" != " " | "$HOST" != " " | "$STAT" != " " ]; then
if [ $STAT = nodedown ]; then
cat $EVENTLINE_MID > $MAIL_CONT
echo "**============================SERVICES AFFECTED===============================**" >> $MAIL_CONT
grep -i "$DB_" $WEBLOGIC_DS >> $MAIL_CONT
elif [ $STAT = up ]; then
cat $EVENTLINE_MID > $MAIL_CONT
echo "**============================SERVICES RESTORED===============================**" >> $MAIL_CONT
grep -i "$DB_" $WEBLOGIC_DS|grep "SERVICE_NAME=$SER" >> $MAIL_CONT
else
cat $EVENTLINE_MID > $MAIL_CONT
echo "**============================SERVICES AFFECTED===============================**" >> $MAIL_CONT
grep -i "$DB_" $WEBLOGIC_DS|grep "SERVICE_NAME=$SER" >> $MAIL_CONT
#fi
cat $MAIL_CONT| /bin/mail -s "Cluster $STAT event: $DB $INST $SER $HOST" ajithpathiyil@gmail.com
fi
#cat $MAIL_CONT| /bin/mail -s "Cluster $STAT event: $DB $INST $SER $HOST" ajithpathiyil@gmail.com
fi
rm $EVENTLINE $EVENTLINE_MID $MAIL_CONT
Q&A
Thank You For Attending AIOUG Tech Day
Be A Part Of AIOUG For Sharing & Gaining Knowledge

Weitere ähnliche Inhalte

Was ist angesagt?

Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionAjith Narayanan
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksScott Jenner
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert Bialek
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert BialekTrivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert Bialek
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert BialekTrivadis
 
OOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AASOOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AASKyle Hailey
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWRpasalapudi
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptChien Chung Shen
 
Data Guard Deep Dive UKOUG 2012
Data Guard Deep Dive UKOUG 2012Data Guard Deep Dive UKOUG 2012
Data Guard Deep Dive UKOUG 2012Emre Baransel
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cAjith Narayanan
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014Alex Zaballa
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cTanel Poder
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsEnkitec
 
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...Alex Zaballa
 
Upgrade 11gR2 to 12cR1 Clusterware
Upgrade 11gR2 to 12cR1 ClusterwareUpgrade 11gR2 to 12cR1 Clusterware
Upgrade 11gR2 to 12cR1 ClusterwareNikhil Kumar
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
A Deep Dive into ASM Redundancy in Exadata
A Deep Dive into ASM Redundancy in ExadataA Deep Dive into ASM Redundancy in Exadata
A Deep Dive into ASM Redundancy in ExadataEmre Baransel
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder
 

Was ist angesagt? (20)

Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert Bialek
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert BialekTrivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert Bialek
Trivadis TechEvent 2016 Oracle Client Failover - Under the Hood by Robert Bialek
 
OOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AASOOUG - Oracle Performance Tuning with AAS
OOUG - Oracle Performance Tuning with AAS
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Oracle Database Performance Tuning Concept
Oracle Database Performance Tuning ConceptOracle Database Performance Tuning Concept
Oracle Database Performance Tuning Concept
 
Data Guard Deep Dive UKOUG 2012
Data Guard Deep Dive UKOUG 2012Data Guard Deep Dive UKOUG 2012
Data Guard Deep Dive UKOUG 2012
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
 
Rac questions
Rac questionsRac questions
Rac questions
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014
Flex Cluster e Flex ASM - GUOB Tech Day - OTN TOUR LA Brazil 2014
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...
Oracle Database 12c - The Best Oracle Database 12c Tuning Features for Develo...
 
Analyzing awr report
Analyzing awr reportAnalyzing awr report
Analyzing awr report
 
Upgrade 11gR2 to 12cR1 Clusterware
Upgrade 11gR2 to 12cR1 ClusterwareUpgrade 11gR2 to 12cR1 Clusterware
Upgrade 11gR2 to 12cR1 Clusterware
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
A Deep Dive into ASM Redundancy in Exadata
A Deep Dive into ASM Redundancy in ExadataA Deep Dive into ASM Redundancy in Exadata
A Deep Dive into ASM Redundancy in Exadata
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 

Ähnlich wie An introduction to_rac_system_test_planning_methods

Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
 
Performance Test Plan - Sample 1
Performance Test Plan - Sample 1Performance Test Plan - Sample 1
Performance Test Plan - Sample 1Atul Pant
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus cagingBertrandDrouvot
 
Highly Available Load Balanced Galera MySql Cluster
Highly Available Load Balanced  Galera MySql ClusterHighly Available Load Balanced  Galera MySql Cluster
Highly Available Load Balanced Galera MySql ClusterAmr Fawzy
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practiceDocker, Inc.
 
Resilience Testing
Resilience Testing Resilience Testing
Resilience Testing Ran Levy
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollective
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollectivePuppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollective
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollectivePuppet
 
Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherencearagozin
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
The Role of Elastic Load Balancer - Apache Stratos
The Role of Elastic Load Balancer - Apache StratosThe Role of Elastic Load Balancer - Apache Stratos
The Role of Elastic Load Balancer - Apache StratosImesh Gunaratne
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStackIlya Shakhat
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveAlex Thompson
 

Ähnlich wie An introduction to_rac_system_test_planning_methods (20)

Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 
Performance Test Plan - Sample 1
Performance Test Plan - Sample 1Performance Test Plan - Sample 1
Performance Test Plan - Sample 1
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
 
Highly Available Load Balanced Galera MySql Cluster
Highly Available Load Balanced  Galera MySql ClusterHighly Available Load Balanced  Galera MySql Cluster
Highly Available Load Balanced Galera MySql Cluster
 
Rac&asm
Rac&asmRac&asm
Rac&asm
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practice
 
Resilience Testing
Resilience Testing Resilience Testing
Resilience Testing
 
Using AWR for SQL Analysis
Using AWR for SQL AnalysisUsing AWR for SQL Analysis
Using AWR for SQL Analysis
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Sge
SgeSge
Sge
 
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollective
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollectivePuppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollective
Puppet Camp DC 2015: Distributed OpenSCAP Compliance Validation with MCollective
 
Performance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherence
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
01 oracle architecture
01 oracle architecture01 oracle architecture
01 oracle architecture
 
The Role of Elastic Load Balancer - Apache Stratos
The Role of Elastic Load Balancer - Apache StratosThe Role of Elastic Load Balancer - Apache Stratos
The Role of Elastic Load Balancer - Apache Stratos
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStack
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

An introduction to_rac_system_test_planning_methods

  • 1. An Introduction to RAC System Test Planning Methods Ajith Narayanan ERP Advisor , Dell IT Bangalore, 29th June 2013
  • 2. Who Am I? Ajith Narayanan ERP Advisor Dell IT 8.5 years of Oracle Apps & Oracle DBA experience. Blogger :- http://oracledbascriptsfromajith.blogspot.com Website Chair:- http://www.oracleracsig.org – Oracle RAC SIG
  • 3. Agenda Real Application Clusters Testing Objectives Oracle Technologies Used For Tests Test 1 :Planned Node Reboot Test 2 :Clusterware and Fencing Test 3 :Restart Failed Node Test 4 :Reboot All Nodes Same Time Test 5 :Unplanned Instance Failure Test 6 :Planned Instance Termination Test 7 :Clusterware and Fencing Test 8: Service Failover Test 9: Public Network Failure Test 10: Interconnect Network Failure Sample Cluster Callout Script Q&A
  • 4. Real Application Clusters Testing Objectives To verify that the system has been installed and configured correctly. Check that nothing is in broken state. To Verify that basic functionality still works in a specific environment and for a specific workload. To make sure that the system will achieve its objectives, in particular, availability and performance objectives.
  • 5. Oracle Technologies Used For Tests Fast Application Notification (FAN) – Notification mechanism that alerts application of service level changes of the database. Fast Connection Failover (FCF) – Utilizes FAN events to enable database clients to proactively react to down events by quickly failing over connections to surviving database instances. Transparent Application Failover (TAF) – Allows for connections to be automatically reestablished to a surviving database instance in the case that the instance servicing the initial connection should fail. TAF has the ability to fail over in-flight select statements (if configured) but insert, update and delete transactions will be rolled back. Runtime Connection Load Balancing (RCLB) – Provides intelligence about the current service level of the database instances to application connection pools. This increases the performance of the application by utilizing least loaded servers to service application requests and allows for dynamic workload balancing in the event of the loss of service by a database instance or increase of service by adding a database instance.
  • 6. Test 1 :Planned Node Reboot Procedure Start client workload & Identify instance with most client connections Reboot the node where the most loaded instance is running For AIX, HPUX, Windows: “shutdown –r” , For Linux: “shutdown –r now” , For Solaris: “reboot” Expected Results The instances and other Clusterware resources go offline ( ‘SERVER’ field of crsctl stat res –t output) The node VIP fails over the surviving nodes and will show a state of “INTERMEDIATE” with state_details of “FAILED_OVER” The SCAN VIP(s) that were running on the rebooted node will fail over to surviving nodes. The SCAN Listener(s) running on that node will fail over to a surviving node. Instance recovery is performed by another instance. Services are moved to available instances Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration). With TAF configured select statements should continue. Active DMLwill be aborted. After the database reconfiguration, surviving instances continue processing their workload. Measures Time to detect node or instance failure. Time to complete instance recovery. Alert Log helps Time to restore client activity to same level. Time before failed instance is restarted automatically by Clusterware and is accepting new connections Successful failover of the SCAN VIP(s) and SCAN Listener(s)
  • 7. Test 2 :Unplanned Node Failure Of OCR Master Procedure Start client workload. Identify the node that is the OCR master using the following grep command from any of the nodes: grep -i "OCR MASTER" $GI_HOME/log/<node_name>/crsd/crsd.l* NOTE: Windows users must manually review the $GI_HOME/log/<node_name>/crsd/crsd.l* logs to determine the OCR Master. Power off the node that is the OCR master. NOTE: On many servers the power-off switch will perform a controlled shutdown, So we have to cut the power supply . Expected Results The instances and other Clusterware resources go offline ( ‘SERVER’ field of crsctl stat res –t output) The node VIP fails over the surviving nodes and will show a state of “INTERMEDIATE” with state_details of “FAILED_OVER” The SCAN VIP(s) that were running on the rebooted node will fail over to surviving nodes. The SCAN Listener(s) running on that node will fail over to a surviving node. Instance recovery is performed by another instance. Services are moved to available instances Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration). With TAF configured select statements should continue. Active DMLwill be aborted. After the database reconfiguration, surviving instances continue processing their workload.
  • 8. Test 3 :Restart Failed Node Procedure ajithpathiyil2:/home/oracle[RAC1]$ srvctl start instance –d RAC –I RAC1 Expected Results On clusters having 3 or fewer nodes, one of the SCAN VIPs and Listeners will be relocated to the restarted node when the Oracle Clusterware starts. The VIP will migrate back to the restarted node. Services that had failed over as a result of the node failure will NOT automatically be relocated. Failed resources (asm, listener, instance, etc) will be restarted by the Clusterware. Measures Time for all resources to become available again, Check with “crsctl stat res –t”
  • 9. Test 4 :Reboot All Nodes Same Time Procedure Issue a reboot on all nodes at the same time For AIX, HPUX, Windows: ‘shutdown –r’ For Linux: ‘shutdown –r now’ For Solaris: ‘reboot’ Expected Results All nodes, instances and resources are restarted without problems Measures Time for all resources to become available again, Check with “crsctl stat res –t”
  • 10. Test 5 :Unplanned Instance Failure Procedure Start client workload Identify single database instance with the most client connections and abnormally terminate that instance: For AIX, HPUX, Linux, Solaris: Obtain the PID for the pmon process of the database instance: # ps –ef | grep pmon kill the pmon process: # kill –9 <pmon pid> For Windows: Obtain the thread ID of the pmon thread of the database instance by running: SQL> select b.name, p.spid from v$bgprocess b, v$process p where b.paddr=p.addr and b.name=’PMON’; Run orakill to kill the thread: cmd> orakill <SID> <Thread ID>
  • 11. Test 5 :Unplanned Instance Failure Expected Results One of the other instances performs instance recovery Services are moved to available instances, if a preferred instance failed Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration) After a short freeze, surviving instances continue processing the workload Failing instance will be restarted by Oracle Clusterware, unless this feature has been disabled Measures Time to detect instance failure Time to complete instance recovery. Check alert log for recovering instance Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload) Duration of database freeze during failover. Time before failed instance is restarted automatically by Oracle Clusterware and is accepting new connections
  • 12. Test 6 :Planned Instance Termination Procedure Issue a ‘shutdown abort’ Expected Results One other instance performs instance recovery Services are moved to available instances, if a preferred instance failed Client connections are moved / reconnected to surviving instances (Procedure and timings will depend on client types and configuration) The instance will NOT be automatically restarted by Oracle Clusterware due to the user invoked shutdown. Measures Time to detect instance failure. Time to complete instance recovery. Check alert log for recovering instance. Time to restore client activity to same level (assuming remaining nodes have sufficient capacity to run workload). The instance will NOT be restarted by Oracle Clusterware due to the user induced shutdown.
  • 13. Test 7 : Clusterware and Fencing Node fencing is a general concept used by computer clusters to forcefully remove a malfunctioning node from it. This preventive technique is a necessary measure to make sure no I/O from malfunctioning node can be done, thus preventing data corruptions and guaranteeing cluster integrity. Procedure 1. Start with a normal, running cluster with the database instances up and running. 2. Monitor the logfiles for clusterware on each node. On each node, start a new window and run the following command: The network heartbeats are associated with a timeout called misscount, set from 11g Release 1 to 30. ajithpathiyil1:/home/oracle[+ASM1] $crsctl get css misscount 30 ajithpathiyil1:/home/oracle[+ASM1] $oifcfg getif bond0 192.168.78.51 global public bond1 10.10.0.0 global cluster_interconnect ajithpathiyil1:/home/oracle[grid]$ tail -f /u01/grid/oracle/product/11.2.0/grid_1/log/ajithpathiyil2/crsd/crsd.l* ajithpathiyil1:/home/oracle[grid]$ tail -f /u01/grid/oracle/product/11.2.0/grid_1/log/‘hostname -s‘/cssd/ocssd.log ajithpathiyil2:/home/oracle[grid]$ ifconfig eth1 down
  • 14. Test 7 : Clusterware and Fencing Expected Results Following this command, watch the logfiles you began monitoring in step 2 above. You should see errors in those logfiles and eventually (could take a minute or two, literally) you will observe one node reboot itself. If you used ifconfig to trigger a failure, then the node will rejoin the cluster and the instance should start automatically. Alert Log [cssd(2864)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.920 seconds … [cssd(2864)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.900 seconds [cssd(2864)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to p reserve cluster integrity More debugging information is written to the ocssd.bin process log file: [CSSD][1119164736](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2 [CSSD][1119164736]################################### [CSSD][1119164736]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread [CSSD][1119164736]###################################
  • 15. Test 8: Service Failover Procedure Create a Service ajithpathiyil2:/home/oracle[RAC1]$ srvctl add service -d RAC -s svctest -r RAC1 -a RAC2 -P BASIC ajithpathiyil2:/home/oracle[RAC1]$ srvctl start service -d RAC -s svctest ajithpathiyil2:/home/oracle[RAC1]$ srvctl status service -d RAC -s svctest Service svctest is running on instance(s) RAC1 ajithpathiyil2:/home/oracle[RAC1]$ Warning ! You should never directly change the SERVICE_NAMES init parameter on a RAC database!! This parameter is maintained automatically by the clusterware. SQL> show user USER is "SYS" SQL> select instance_name from v$instance; INSTANCE_NAME ---------------RAC1 SQL> shutdown abort; ORACLE instance shut down. SQL>
  • 16. Test 9: Public Network Failure Procedure Unplug all network cables for the public network NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results. Expected Results •Check with “crsctl stat res –t” The ora.*.network and listener resources will go offline for the node. SCAN VIPs and SCAN LISTENERs running on the node will fail over to a surviving node. ajithpathiyil2:/home/oracle[grid]$ srvctl status scan SCAN VIP scan1 is enabled SCAN VIP scan1 is running on node ajithpathiyil2 ajithpathiyil2:/home/oracle[grid]$ ajithpathiyil2:/home/oracle[grid]$ srvctl status scan_listener SCAN Listener LISTENER_SCAN1 is enabled SCAN listener LISTENER_SCAN1 is running on node ajithpathiyil2 ajithpathiyil2:/home/oracle[grid]$
  • 17. Test 9: Public Network Failure The VIP for the node will fail over to a surviving node. The database instance will remain up but will be unregistered with the remote listeners. Database services will fail over to one of the other available nodes. If TAF is configured, clients should fail over to an available instance. NODE VERSION=1.0 host=ajithpathiyil2 incarn=0 status=nodedown reason=public_nw_down timestamp=30-Aug-2009 01:56:12 reported=Sun Jan 30 01:56:13 CDT 2013 NODE VERSION=1.0 host=ajithpahtiyil2 incarn=147028525 status=nodedown reason=member_leave timestamp=30-Aug2009 01:57:19 reported=Sun Aug 30 01:57:20 CDT 2013 Measures Time to detect the network failure and relocate resources.
  • 18. Test 10: Interconnect Network Failure Procedure Unplug all network cables for the interconnect network NOTE: It is recommended NOT to use ifconfig to down the interface, this may lead to the address still being plumbed to the interface resulting in unexpected results. Expected Results For 11.2.0.2 and above: CSSD will detect split-brain situation and perform one of the following: o In a two-node cluster the node with the lowest node number will survive. o In a multiple node cluster the largest sub-cluster will survive. On the node(s) that is being evicted, a graceful shutdown of Oracle Clusterware will be attempted. o All I/O capable client processes will be terminated and all resources will be cleaned up. If process termination and/or resource cleanup does not complete successfully the node will be rebooted. o Assuming that the above has completed successfully, OHASD will attempt to restart the stack. In this case the stack will be restarted once the network connectivity of the private interconnect network has been restored. Review the following logs: o $GI_HOME/log/<nodename>/alert<nodename>.log o $GI_HOME/log/<nodename>/cssd/ocssd.log
  • 19. Test 10: Interconnect Network Failure Measures For 11.2.0.2 and above: Oracle Clusterware will gracefully shutdown, should graceful shutdown fail (due to I/O processes not being terminated or resource cleanup) the node will be rebooted. Assuming that the graceful shutdown of Oracle Clusterware succeeded, OHASD will restart the stack once network connectivity for the private interconnect has been restored.
  • 20. Sample Cluster Callout Script #!/bin/ksh # # Author: Ajith Narayanan ## http://oracledbascriptsfromajith.blogspot.com ## Version 1.0 ## This callout script is extended to report/mail the affected weblogic services when any Oracle cluster event occurs. ## umask 022 FAN_LOGFILE=$ORACLE_HOME/racg/usrco/`hostname`_uptime.log EVENTLINE=$ORACLE_HOME/racg/usrco/`hostname`_eventline.log EVENTLINE_MID=$ORACLE_HOME/racg/usrco/`hostname`_eventline_mid.log MAIL_CONT=$ORACLE_HOME/racg/usrco/`hostname`_mail.log WEBLOGIC_DS=$ORACLE_HOME/racg/usrco/weblogic_ds echo $* "reported="`date` >> $FAN_LOGFILE & tail -1 $FAN_LOGFILE > $EVENTLINE awk '{ for (f = 1; f <= NF; f++) { a[NR, f] = $f } } NF > nf { nf = NF } END { for (f = 1; f <= nf; f++) { for (r = 1; r <= NR; r++) { printf a[r, f] (r==NR ? RS : FS) } } }' $EVENTLINE > $EVENTLINE_MID SER=`grep "service=" $EVENTLINE_MID|awk -F= '{print $2}'` DB=`grep "database=" $EVENTLINE_MID|awk -F= '{print $2}'`
  • 21. Sample Cluster Callout Script INST=`grep "instance=" $EVENTLINE_MID|awk -F= '{print $2}'` HOST=`grep "host=" $EVENTLINE_MID|awk -F= '{print $2}'` STAT=`grep "status=" $EVENTLINE_MID|awk -F= '{print $2}'` if [ "$SER" != " " | "$DB" != " " | "$INST" != " " | "$HOST" != " " | "$STAT" != " " ]; then if [ $STAT = nodedown ]; then cat $EVENTLINE_MID > $MAIL_CONT echo "**============================SERVICES AFFECTED===============================**" >> $MAIL_CONT grep -i "$DB_" $WEBLOGIC_DS >> $MAIL_CONT elif [ $STAT = up ]; then cat $EVENTLINE_MID > $MAIL_CONT echo "**============================SERVICES RESTORED===============================**" >> $MAIL_CONT grep -i "$DB_" $WEBLOGIC_DS|grep "SERVICE_NAME=$SER" >> $MAIL_CONT else cat $EVENTLINE_MID > $MAIL_CONT echo "**============================SERVICES AFFECTED===============================**" >> $MAIL_CONT grep -i "$DB_" $WEBLOGIC_DS|grep "SERVICE_NAME=$SER" >> $MAIL_CONT #fi cat $MAIL_CONT| /bin/mail -s "Cluster $STAT event: $DB $INST $SER $HOST" ajithpathiyil@gmail.com fi #cat $MAIL_CONT| /bin/mail -s "Cluster $STAT event: $DB $INST $SER $HOST" ajithpathiyil@gmail.com fi rm $EVENTLINE $EVENTLINE_MID $MAIL_CONT
  • 22. Q&A
  • 23. Thank You For Attending AIOUG Tech Day Be A Part Of AIOUG For Sharing & Gaining Knowledge