Sql azure cluster dashboard public.ppt

Agenda
• About the Project
• Lesson Learned
• Rethinking about tester’s career

Today’s Technique
Trends
• Continuous Integration
• Continuous Delivery
• Live Site First
• DevOps
• Bigdata/Hadoop
• Testing in Production
• Real Time Analysis
Please visit infoq.com
Topics covered by
this talk
• Monitoring in Production
• Data Driven Quality
• Data Pipeline
• Alert

About Me
Has been SQL Server team for 8 years
Lucky to always report to great manager
Mainly focus on Windows Azure SQL Database Now
And I will share the lesson learned from monitoring our
service through telemetry.
Blog: http://blogs.msdn.com/b/qingsongyao/
Read my test career blog

What is Windows Azure SQL Database
Windows Azure SQL Database, formerly SQL Azure, is a fully
managed relational database service that delivers flexible
manageability, includes built-in high availability, offers predictable
performance, and supports massive scale-out.
In other word, you create server and database, and we manage for
you to achieve HA, reliable performance with low cost

What we have before
• Flexible tool to get service status in real-time through rich API
• Rich telemetry data exposed in different ways:
o PerfStore stores all perf counters
o OpStore stores all operation records
o Cluster Manager contains state for machines, watchdogs and alerts
The Problem
o Hard to correlated data from different sources
o No separation of telemetry data with customer data
o Very hard to write and deploy new telemetry and alerts

Project Overview
• Problem Statement
We have lot of data, but lack of ways to retrieve and present them in an
easy way.
• What is the project about?
We want to have a central place to display real time information about
all clusters
• What is the goal of this project
o Help people to get service insight
o Effective Detect production issues and assist people to solve them quickly
o Help to deep analyze the issues and understand the root cause

SQL
Azure
Clusters
Data
Collector
Command
Gateway
Dashboard
DW
Incident
Response Team
raise alerts
Dashboard Report
Architect
GPM
LPM
MSDB
PerfStore
OpStore
…
Multi Thread
PowerShell
Data
Collection
Agent
SQL Azure
And IASS

Business Value
Trend Analysis
• Using Dashboard for Livesite incident
• Drive repair items and feature planning
Internal Monitoring Alert
• From reactive to proactive
• Reduce issue detection and migration time
Monitoring Testing Clusters
• All A1 clusters are monitoring

On 3/8 7PM UTC, a couple of machines are down, and 150 DBs are impacted,
we start to use dashboard to monitor the recovery progress
Availability Trend

Bug # Assert Count
1229076 "Assert Assert Failed: Stack: at System.Environment.GetStackTrace(Exception 1
"Assert Assert Failed: ClientId: 00000000-0000-0000-0000-000000000000 NodeInfo: 66
1229073 "Assert Assert Failed: Incoming epoch 0-130072965305395635-6f6f103456a59b4f4a44d 71
1192590 "Assert Assert Failed: PartitionId <App>dbo</App><TG>UserDb</TG><Lo>0x8000000000 85806
1228173 "FabricUnhandledException System.ArgumentException: Illegal characters in path. 2
1229079 "FabricUnhandledException System.ComponentModel.Win32Exception (0x80004005) 11
1229087 "FabricUnhandledException System.Data.Fabric.Common.AsyncCallbackException: 9
1224236 "FabricUnhandledException System.InsufficientMemoryException: Insufficient winso 818
1229089 "FabricUnhandledException System.IO.FileNotFoundException: Could not find file ' 5
1226404 "FabricUnhandledException System.IO.IOException: The process cannot access the f 9
1228178 "FabricUnhandledException System.NullReferenceException: Object reference not se 2
1229081 "FabricUnhandledException System.ObjectDisposedException: Cannot access a dispos 1
1229084 "FabricUnhandledException System.Runtime.CallbackException: Async Callback threw 3
When we have outage in one cluster, we scan all exceptions and measure the
potential impact of other clusters
Incident

SE Repl LCK_M_X Hit Per Cluster
Trend Analysis and Prediction

New Alert based on dashboard
Original goal is to collect real time cluster information in a
dashboard
Quickly turn into an very important way of alert and
resolve live side issue
Highlights
• Data Lag is usually less than 10 minutes
• Data aggregated at central DW
• Write and Deploy a new alert take hours
• We can always watch and turn your alert at any time

From Passive to Reactive and to Predictive
What happens yesterday:
• Customer noticed us that we have outage.
• Every day we only look at issues happens in the past.
What happen today with the assistance of dashboard
• You always know what happens in a cluster now.
• You noticed live site issue as soon as it happens
• You have enough information to trouble shooting.

Long Term Alert Process
Monitoring
Data generators
•SAWA
•Autopilot
•MDS
•Internal Customer
•Real time Log
parsing
•(no alert will fire at
here).
Automatic Data
Aggregation
•Filter noise data
•Align data by time
series
•Enable cross
domain/dimension
analysis.
Automatic Issue
detection
• Base on cluster health
model
•Built-in knowledge of
issue diagnostics (replace
TSG)
• Heuristics and Statistics
models
Fast and Accurate
Solution for issues
• largely reduce false
failures
•Root causes are
correctly identified
•Route to the right
team
•Auto-health
support will be
built-in into the
system

Lesson Learned for
building a data pipeline

Choose the right technique is important
o You don’t necessary need Hadoop to process large
amount of data.
o Latency does matter, the faster you can get the data,
the more valuable it is.
o Allow other can quickly authoring and consume your
data.

Build resilience into your data pipeline
o The flow of one kind of data does not impact any
other flows
o Build-in retry logic in your data flow
o Always assuming that your data flow can be
failed, and allow reprocess the same flow

Monitoring your pipeline
• Data processing time
• Data processing error frequency
• Performance of your database

How we running Cluster Dashboard
• DevOps model:
o new change need pass unit tests
o deployed to testcluster dashboard for a couple of hours
o Xcopy deploy to production on demand.
• HA and Monitoring built-in
o Having back collector machine and DW machines
o DW has daily full backup and hours incremental backup.
o Measure, monitor and alert both collector and DW machine
• Data size and Performance
o Key table and queries are extensive tuned for better performance
o Data retention policy applied for several tables.

Rethinking about
tester’s career

What I am doing
everyday?
• 0% writing tests
• 0% sign-off
• 0% test planning
• 0% on test lab
• 60% monitoring the
production
• 40% learning and thinking

Key Takeaway
• Data visualization is needed for people to understand
the data.
• Your telemetry/bigdata project should drive actions,
instead of only providing data.
• It will take time and resource to build a data pipeline
and it is fun and learning process to build such
pipeline
• Alert has a life cycle as well.

Sql azure cluster dashboard public.ppt

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Sql azure cluster dashboard public.ppt

Ähnlich wie Sql azure cluster dashboard public.ppt (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sql azure cluster dashboard public.ppt

Hinweis der Redaktion