If SQL Server is heart of our environment, his health should be very important, right? If SQL Server is important, his availability for our businesses (internal and external) is important to. For our customers doesn't matter where data are stored, how are stored and what we do with those data. Especially for our managers. The data must be available on demand, on time, at he moment of request. High Availability is our responsibility. How we can prepare our environment for HA? How HA is connected for with SLA? And why Service Level Agreement are important for us? In this session I want to discuss about HA options for SQL Server (2008, 2012), about our different customers, and about Service Level Agreement (formal or not).
3. SELECT {BIO}
Polish SQL Server User Group Leader
Microsoft Certified Trainer
MCP, MCSA, MLSS, MLSBS, MCTS, MCITP, MCT
SQL Server MVP from 2010
Friends of RedGate PLUS
PASS SQL Azure Virtual Chapter Co-Founder
Blogger, Influencer, Technical Writer
Last 7 years (living) in Data Center in Wrocław
Generally about 12 years in IT/banking area
GITCA Technical Lead & Vice-Chair EMEA Board
Speaker at SQL Server Community Launch, Time for SharePoint,
CodeCamps, SharePoint Community Launch, CISSP Day, InfoTRAMS,
SQLSaturday, SQLBits, CarreerCon,
Autor of few articles on TechNet (PL) and WSS.pl portal
Deep Dives Co-Author:
High availability of SQL Server in the context
of Service Level Agreements (Chapter 18th)
Working for MS Subject Matter Expert and MS Terminology
community (Windows 7, 8 & Visualstudio 2010,2011
4. Agenda
Back to the school:
What is High Availability
What is Service Level Agreement
Using HA in SQL Server 2008
HA solutions in SQL Server 2008 that means:
Enterprise, Enterprise
Why SLA and DBA
Dependency of SLA and HA
Case Studies
Q&A
5. What is High Availability?
High Availability (HA) to ensure the
continued operation of equipment and
systems for the purposes of (usually) in an
enterprise production environment.
Is designed to prevent data loss as a result of:
software bugs,
manufacturing defects
hardware failure
natural disasters
human error
other unforeseen events
8. Two kinds of monster:
PSO > USO > SLA
PSO Planned System Outages – Planned System Unavailability
Minimum planned unavailability, due to the need to carry out
modernization work, installing patches, replacement / extension
of hardware,
Agreed/accepted by/with the client and not affecting the
provisions of the HA, and SLA, until
...USO Unplaned System Outages – Unplanned System Unavailability
an error that prevents a partial or total work environment in a
tangible, measurable customer
resulting in high costs if you need repairs, as well as penalty
payments for non-SLA
9. Performance metrics (HA)
What it really is the availability of the order of 99.99%?
Availability 99.99% to 0.01UNAVAILABILITY in a
requested period (eg annual), which ...
How much is that in terms of the unavailability of the
server / environment / database:
Availability = MTBF / MTBF + MTTR
MTBF -> Mean Time Between Failures
MTTR -> Mean Time To Repair
10. Unavailability in minutes, hours, days, weeks...
Downtime Downtime Downtime
Availability %
per year per month* per week
90% 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 min
99.8% 17.52 hours 86.23 min 20.16 min
99.9% ("three nines") 8.76 hours 43.2 min 10.1 min
99.95% 4.38 hours 21.56 min 5.04 min
99.99% ("four nines") 52.6 min 4.32 min 1.01 min
99.999% ("five nines") 5.26 min 25.9 s 6.05 s
99.9999% ("six nines") 31.5 s 2.59 s 0.605 s
11. What isSLA?
SLA - Service Level Agreement.
The origins date back to 1980 and the agreements between
operators and end customers.
Mutually negotiable contract for the provision of services (not
just IT, but these in particular)
It must be concluded formally, though legally permissible is an
informal agreement
Including the level and range of services provided by means of
measurable indicators (level of accessibility, usability,
performance)
The contract should have specified minimum and maximum
range for each subject to its services
12. Metrics of SLA
There is no specific SLA measurement WITHOUT indicators!
SAMPLE CALL CENTER / SERVICE DESK:
ABA (Abandonment Rate): Percentage of calls abandoned while waiting for
a response.
ASA (Average Speed to Answer): Average time (usually in seconds) required
for the connection of boards help.
TSF (Time Service Factor): Percentage of calls answered in precise time
frame, such as 80% in 20 seconds.
FCR (First Call Resolution): Percentage of calls where the problem was
solved without having to switch to another expert
TAT (Turn Around Time): The time it takes to complete certain tasks.
13. High Availability in SQL Server 2008
Microsoft SQL Server 2008 oferuje:
• Database Mirroring
• Database Snapshots
• Windows Clustering
• SQL Server Replication
• Hot-add memory and CPU
• Online Index Operations
• Table and Index Partitioning
• Failover Clustering
• Peer-To-Peer Replication
14. Solutions for HA for SQL Server
DATABASE FAILOVER TRANSACTIONAL
AREA LOG SHIPPING
MIRRORING CLUSTERING REPLICATION
some data loss
Data Loss no data loss no data loss some data loss possible possible
Automatic Failover YES (in HA mode) YES no no
YES, connect to same
Transparent To Client YES, autodirect IP no, NLB helps no, NLB helps
20 seconds or more + seconds plus time to
Downtime < 3 seconds time to recovery seconds recovery
Standby Ready Access Yes, with db snapshots no data loss YES
Data Granularity DB only all systems and db's table or view DB only
Masking of hdd failure YES No, shared disk YES YES
NO, duplicate NO, duplicate NO, duplicate
Special hardware recommended Cluster HCL recommended recommended
Complexity Some More More More
15. High
Why High Availability? Availability
Businesses need to work around the clock to meet customer demands
When systems are not running, businesses are losing revenue, opportunities,
customers and reputation
High availability reduces the impact of required maintenance on
day-to-day operations and helps recover quickly from disasters
Businesses need flexibility to easily build high availability solutions that meet
business and technology needs
Online operations
Multiple instance clustering
Prevent Unplanned
Downtime Live Migration
Automatic page repair with
database mirroring Reduce Planned
Downtime
Hot-add CPU and RAM
Database snapshots
Peer-to-peer replication
16. High
Prevent Unplanned Downtime Availability
Multiple-Instance Database
Clustering
Applications &
Business Logic 1100101
00101
0010111
1100101
0010100
1100101
00101
1100101
• More than one passive node is
available to host instances from
00101
101 00101
110010
110010 110010
multiple failovers on active nodes
• Having multiple failover nodes
provides greater availability
• Multiple instances can share the
Active Failover Offline
Active Active
same failover node, which reduces
hardware costs
• Simplified setup reduces
administrative costs
Because of the critical nature of the G4S application,
CASON sets up the servers in a failover cluster to
ensure high availability.
—CASON Case Study
17. High
Enhanced Database Mirroring Availability
High Performance Mirroring
• Increase performance through
asynchronous mirroring
Automatic Page Repair
Applications &
• Automatically detects page corruption
Business Logic and retrieves data from the mirror
• Reduces downtime and
management costs
• Minimizes application changes to
correctly handle I/O errors
Reporting from Mirror
Principal Mirror • Increase utilization of mirror server
• Reduce need for reporting servers
“This is a really powerful enhancement because prior
to this… you would have to run DBCC CHECKDB...
and that would likely mean taking downtime… With
SQL Server 2008 Database Mirroring you can avoid
the effort and downtime.”
18. High
Help Recover From User Errors Availability
1100101
00101
1100101
00101
110010
Database Snapshots
• Provide a read-only static view of
Applications & the database at a point in time
Business Logic
• Revert to a point in time before
user error
Snapshot Source • Data loss is limited to changes after
1100101
00101
1100101
00101
the snapshot
110010
• Run reports from a snapshot
1100101
00101
1100101
00101
110010
created on the mirror server in a
mirror to better utilize resources
“Database snapshots allow you to create read-only
databases for reporting and can also be useful in your
data recovery efforts in the event of a disaster.”
—Tim Chapman, SQL Server Database Administrator
19. High
Maintain Databases Without Downtime Availability
1100101
00101
1100101
Online Operations
00101
110010
• Allow routine maintenance without
corresponding downtime
Applications &
Business Logic ‒ Online index operations
‒ Online page and file restoration
‒ Online configuration of peer-to-peer
Table Index
0
5
nodes
Deleted
1
Deleted
4
Deleted
2 • Users and applications can access
23
Deleted
3
74
data while the table, key, or index is
5
05 being updated
6
3
7
We recommend performing online index operations for
business environments that operate 24 hours a day,
seven days a week, in which the need for concurrent
user activity during index operations is vital.
— SQL Server Books Online
20. High
Minimize Planned Downtime and Increase Efficiency Availability
Live Migration
• Move running instances of VMs
between host servers
Applications &
Business Logic 11001010
11001010
11001010
11001010
0101
0101
0101
0101
11001010
11001010
11001010
11001010
• Virtual machines can be moved for
0101
maintenance or to balance
0101
0101
0101
110010
110010
110010
110010
workload on host servers
11001010 11001010
11001010
• Perform maintenance on physical
11001010
0101
0101 0101
0101
11001010
11001010 11001010
11001010
0101
0101 0101
0101
110010
110010 110010
110010
machines without any downtime
• Requires Windows Server 2008 R2
Hyper-v
“This server already runs on our cluster solution with
high availability, but after we have tested live migration
on the new hardware, we’ll move it over to ensure
optimal performance and reliability”
—Rodrigo Immaginario, IT Manager, Universidade Vila Velha
21. Minimize Planned Downtime
High
Availability
Hot-Add CPU and RAM
• Dynamically add memory and
Applications & processors to servers without
Business Logic 110010
100101
110010
100101 incurring downtime
110010 110010
100101 100101
110010 110010
• Requires hardware support for
110010
100101
110010
110010
100101
110010
either physical or virtual hardware
100101 100101
110010 110010
Hot-add CPU is the ability to dynamically add CPUs to
a running system. Adding CPUs can occur physically
by adding new hardware, logically by online hardware
partitioning, or virtually through a virtualization layer.
—SQL Server Books Online
22. High
Access Data Seamlessly Across Servers Availability
Peer-to-Peer Replication
• Increases reliability by replicating
Applications & data to multiple servers
Business Logic
• Provides higher availability in case
1100101
0010110
00101
0101100
1100101
1011001
00101
01
110010
of failure or to allow maintenance
at any of the participating nodes
110010
100101
• Offers improved performance for
110010
100101
110010
each node with geo-scale
1100101
00101
1100101
00101
110010
architecture
• Add and remove servers easily
without taking replication offline,
by using the new topology wizard
“[Microsoft] SQL Server 2008 replication proved to be
very predictable and reliable in our testing. This helps
us to create flexible and scalable replication solutions.
Reliability must be at the foundation of all that we do.”
— Sergey Elchinsky, Leading System Engineer, Baltika Breweries
23. Database Mirroring
Mirroring, which is a mirror image of the data
Available only for two bases (principal, mirror)
The desired function of a witness (witness)
Requirements:
principal, mirror - only SQL Server Enterprise
witness - can be SQL Server Express
Availability for the database:
copy of the database on a different physical server and / or virtual
Availability for the system:
A copy of the entire environment on a different physical server
and / or virtual
24. Database Mirroring Refresher Synchronous Mode
KEY POINT: mirror
database is an EXACT
copy of the principal
1 Acknowledge
Commit
7 Acknowledge
6
Constantly
2 redoing on
mirror
2 Transmit to mirror 4
Write to
local log Committed Write to
3 in log remote log
5
DB
DB
Log Log
25. Hot-add memory and CPU
In SQL Server 2005 added the ability to use memory to be added "on
the fly"
In SQL Server 2008 extends the dynamic capabilities of SQL Server
work, allowing you to hot add CPU
"Hot-add" is the ability to connect the RAM / CPU to the computer
while the computer is running, and then by refreshing the SQL Server
to use the new equipment ONLINE
The equipment must support hot-add (of course!)
Supported only in the Enterprise Edition running on a 64-bit version of Windows
Server 2008 Datacenter / Enterprise
SQL Server does not automatically start using the new processor / memory
The need to reconfigure run
Already running query will not use the newly added memory / processor.
26. Hot-Add CPU: Affinity Masks
Affinity masks control which CPUs are used by SQL Server, and for
what purpose
Any affinity masks will need to be updated after hot-adding new
CPUs
If the affinity mask is set to non-zero, you will need to update it so
that SQL Server knows it can use the new CPUs.
On systems with > 32 CPUs, you will need to set the affinity64
mask to pick up the new CPUs
If you want to use the new CPUs for IO only, you must add the
relevant bits to the affinity I/O (or affinity64 I/O) mask
If questioned about affinity masks
All zeroes means that Windows decides which CPUs are used
Non-zero: single bit per CPU, if bit is 1, SQL Server will use it
bit cannot be set in affinity AND affinity I/O mask
27. Fast Manual Failover
High Security mode (synchronous mirroring without witness),
manual failover is always used
SQL Server 2005, if there is an emergency situation, the
database on the mirror is closed and restarted to force the to
recover non-commited transaction log
This can greatly increase the failover time
Consider a database with hundreds of files, which all have to be opened
to start the sequence database
SQL Server 2008 removes this step, thus speeding up and
reducing the use of emergency shutdown
28. Peer-to-Peer Topology (?)
In SQL Server 2005 introduces the ability to use solution peer-to-peer
(or "two-way") Transactional Replication
A great way to scale the resources needed to work
Partialy as a way to have "undue copy"
One major drawback - changing the topology of peer-to-peer
required to stop ALL activity on the servers in the topology tree
In SQL Server 2008,
these restrictions have been removed (in most cases),
Setup Wizard also upgraded peer-to-peer network in SSMS
Switching partitions can be repeated
29. Topology Wizard
The wizard now is graphical, with drag-n-drop functionality for making topology
connections
30. SQL Server 2012 & AlwaysOn | marketing
Help reduce planned and unplanned downtime with the new
integrated high availability and disaster recover solution, SQL Server
AlwaysOn.
Simplify deployment and management of HA requirements using
integrated configuration and monitoring tools.
Improve IT cost efficiency and performance using Active Secondary.
Reduce planned downtime with Windows Server Core.
31. SQL Server 2012 & AlwaysOn | technical
AlwaysOn Failover Cluster Instances
As part of the SQL Server AlwaysOn offering, AlwaysOn Failover Cluster Instances leverages Windows Server Failover
Clustering (WSFC) functionality to provide local high availability through redundancy at the server-instance level—a
failover cluster instance (FCI). An FCI is a single instance of SQL Server that is installed across Windows Server
Failover Clustering (WSFC) nodes and, possibly, across multiple subnets. On the network, an FCI appears to be an
instance of SQL Server running on a single computer, but the FCI provides failover from one WSFC node to another
if the current node becomes unavailable.
AlwaysOn Availability Groups
AlwaysOn Availability Groups is an enterprise-level high-availability and disaster recovery solution introduced in SQL
Server 2012 to enable you to maximize availability for one or more user databases. AlwaysOn Availability Groups
requires that the SQL Server instances reside on Windows Server Failover Clustering (WSFC) nodes.
Database mirroring
Avoid using this feature in new development work, and plan to modify aplications that currently use this feature. We
recommend that you use AlwaysOn Availability Groups instead. Database mirroring is a solution to increase
database availability by supporting almost instantaneous failover. Database mirroring can be used to maintain a
single standby database, or mirror database, for a corresponding production database that is referred to as the
principal database. For more information, see Database Mirroring (SQL Server).
Log shipping
Like AlwaysOn Availability Groups and database mirroring, log shipping operates at the database level. You can use
log shipping to maintain one or more warm standby databases (referred to as secondary databases) for a single
production database that is referred to as the primary database. For more information about log shipping, see
About Log Shipping (SQL Server).
33. SLA - what does this have to do
with the DBA
Production hours:
Hours in which the partition / table / database must be available
May be different for different parts of a database, for example, depending on the
application
The percentage of time the service:
The percentage of time within (time range) when the service / partition / table /
database is available
Hours reserved for downtime:
These advance hours of downtime (technical break) facilitate the work of users
Methods Customer Support
The response time from the HelpDesk
DBA response time for an event
34. SLA - what does this have to do
with the DBA
Number of users on the system
Number of transactions processed per unit of time
Acceptable performance levels for access to the various operations
Minimum time required to replicate the different servers
Deadline for data recovery from failures
Accidental deletion of data
Damage to the database
SQL Server Crash
OS Server Crash
Time it takes to read the data on the web (eg read / write table sales)
so that it was possible to continue the sale
Maximum amount of space
Maximum amount of tables / databases
Number of users in specific roles
35. Why SLA is so important?
In fact, it's more than just a signed agreement between the client and
your boss.
It is also a contract that YOU need to meet
If it's signed an agreement to zero downtime and zero data loss
(abstraction?) Then you need to make sure that if corruption can fulfill
this contract (change / delete data on purpose by the authorized
user).
If you can not meet the SLA, the business is exposed to downtime
and data loss
The end result is to submit your CV to a recruitment agency ...
36. Do you think you can meet your Service Level Agreement?
You need to know what are the conditions / requirements for
SLA if you meet them
As you can accomplish if you do not know that there is an SLA?
As you review the contract if you did not invite anyone to the
meeting on the creation of a Service Level Agreement?
The end result is to submit your CV to a recruitment agency ...
37. Do you think you can meet your SLA?
The recovery plan looks great on paper - but if ever you test it?
Suppose this situation:
We allow 15 minutes is not available for database size of 100 GB.
We are able to within the last 15 minutes substitute a copy of the user
database
What will you do in case of damage to the database?
What will you do in the event of disk failure?
What will you do in case of burning the motherboard?
What do you do when cutting the cable FC?
How much time it will take to recover from a backup?
How much time it will take to bring ribbons with backup from a second
location 25 kilometers away in the city center at 14?
Do you still meet the SLA 15 minutes of downtime?
39. Summary
You need to know about the existence of SLA
You must take part in a Service Level Agreement
(requirements / features / technology)
You need to have contingency plans - TESTED
You must have knowledge of their
responsibilities
You must be able to meet the technical SLA