1. Agenda
• About the Project
• Lesson Learned
• Rethinking about tester’s career
2. Today’s Technique
Trends
• Continuous Integration
• Continuous Delivery
• Live Site First
• DevOps
• Bigdata/Hadoop
• Testing in Production
• Real Time Analysis
Please visit infoq.com
Topics covered by
this talk
• Monitoring in Production
• Data Driven Quality
• Data Pipeline
• Alert
3. About Me
Has been SQL Server team for 8 years
Lucky to always report to great manager
Mainly focus on Windows Azure SQL Database Now
And I will share the lesson learned from monitoring our
service through telemetry.
Blog: http://blogs.msdn.com/b/qingsongyao/
Read my test career blog
4. What is Windows Azure SQL Database
Windows Azure SQL Database, formerly SQL Azure, is a fully
managed relational database service that delivers flexible
manageability, includes built-in high availability, offers predictable
performance, and supports massive scale-out.
In other word, you create server and database, and we manage for
you to achieve HA, reliable performance with low cost
5. What we have before
• Flexible tool to get service status in real-time through rich API
• Rich telemetry data exposed in different ways:
o PerfStore stores all perf counters
o OpStore stores all operation records
o Cluster Manager contains state for machines, watchdogs and alerts
The Problem
o Hard to correlated data from different sources
o No separation of telemetry data with customer data
o Very hard to write and deploy new telemetry and alerts
6. Project Overview
• Problem Statement
We have lot of data, but lack of ways to retrieve and present them in an
easy way.
• What is the project about?
We want to have a central place to display real time information about
all clusters
• What is the goal of this project
o Help people to get service insight
o Effective Detect production issues and assist people to solve them quickly
o Help to deep analyze the issues and understand the root cause
8. Business Value
Trend Analysis
• Using Dashboard for Livesite incident
• Drive repair items and feature planning
Internal Monitoring Alert
• From reactive to proactive
• Reduce issue detection and migration time
Monitoring Testing Clusters
• All A1 clusters are monitoring
10. On 3/8 7PM UTC, a couple of machines are down, and 150 DBs are impacted,
we start to use dashboard to monitor the recovery progress
Availability Trend
11. Bug # Assert Count
1229076 "Assert Assert Failed: Stack: at System.Environment.GetStackTrace(Exception 1
"Assert Assert Failed: ClientId: 00000000-0000-0000-0000-000000000000 NodeInfo: 66
1229073 "Assert Assert Failed: Incoming epoch 0-130072965305395635-6f6f103456a59b4f4a44d 71
1192590 "Assert Assert Failed: PartitionId <App>dbo</App><TG>UserDb</TG><Lo>0x8000000000 85806
1228173 "FabricUnhandledException System.ArgumentException: Illegal characters in path. 2
1229079 "FabricUnhandledException System.ComponentModel.Win32Exception (0x80004005) 11
1229087 "FabricUnhandledException System.Data.Fabric.Common.AsyncCallbackException: 9
1224236 "FabricUnhandledException System.InsufficientMemoryException: Insufficient winso 818
1229089 "FabricUnhandledException System.IO.FileNotFoundException: Could not find file ' 5
1226404 "FabricUnhandledException System.IO.IOException: The process cannot access the f 9
1228178 "FabricUnhandledException System.NullReferenceException: Object reference not se 2
1229081 "FabricUnhandledException System.ObjectDisposedException: Cannot access a dispos 1
1229084 "FabricUnhandledException System.Runtime.CallbackException: Async Callback threw 3
When we have outage in one cluster, we scan all exceptions and measure the
potential impact of other clusters
Incident
12. SE Repl LCK_M_X Hit Per Cluster
Trend Analysis and Prediction
14. New Alert based on dashboard
Original goal is to collect real time cluster information in a
dashboard
Quickly turn into an very important way of alert and
resolve live side issue
Highlights
• Data Lag is usually less than 10 minutes
• Data aggregated at central DW
• Write and Deploy a new alert take hours
• We can always watch and turn your alert at any time
16. From Passive to Reactive and to Predictive
What happens yesterday:
• Customer noticed us that we have outage.
• Every day we only look at issues happens in the past.
What happen today with the assistance of dashboard
• You always know what happens in a cluster now.
• You noticed live site issue as soon as it happens
• You have enough information to trouble shooting.
17. Long Term Alert Process
Monitoring
Data generators
•SAWA
•Autopilot
•MDS
•Internal Customer
•Real time Log
parsing
•(no alert will fire at
here).
Automatic Data
Aggregation
•Filter noise data
•Align data by time
series
•Enable cross
domain/dimension
analysis.
Automatic Issue
detection
• Base on cluster health
model
•Built-in knowledge of
issue diagnostics (replace
TSG)
• Heuristics and Statistics
models
Fast and Accurate
Solution for issues
• largely reduce false
failures
•Root causes are
correctly identified
•Route to the right
team
•Auto-health
support will be
built-in into the
system
19. Choose the right technique is important
o You don’t necessary need Hadoop to process large
amount of data.
o Latency does matter, the faster you can get the data,
the more valuable it is.
o Allow other can quickly authoring and consume your
data.
20. Build resilience into your data pipeline
o The flow of one kind of data does not impact any
other flows
o Build-in retry logic in your data flow
o Always assuming that your data flow can be
failed, and allow reprocess the same flow
21. Monitoring your pipeline
• Data processing time
• Data processing error frequency
• Performance of your database
22. How we running Cluster Dashboard
• DevOps model:
o new change need pass unit tests
o deployed to testcluster dashboard for a couple of hours
o Xcopy deploy to production on demand.
• HA and Monitoring built-in
o Having back collector machine and DW machines
o DW has daily full backup and hours incremental backup.
o Measure, monitor and alert both collector and DW machine
• Data size and Performance
o Key table and queries are extensive tuned for better performance
o Data retention policy applied for several tables.
24. What I am doing
everyday?
• 0% writing tests
• 0% sign-off
• 0% test planning
• 0% on test lab
• 60% monitoring the
production
• 40% learning and thinking
25. Key Takeaway
• Data visualization is needed for people to understand
the data.
• Your telemetry/bigdata project should drive actions,
instead of only providing data.
• It will take time and resource to build a data pipeline
and it is fun and learning process to build such
pipeline
• Alert has a life cycle as well.
Hinweis der Redaktion
This template can be used as a starter file to give updates for project milestones.
Sections
Right-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.
Notes
Use the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation.
Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)
Coordinated colors
Pay particular attention to the graphs, charts, and text boxes.
Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.
Graphics, tables, and graphs
Keep it simple: If possible, use consistent, non-distracting styles and colors.
Label all graphs and tables.
What is the project about?
Define the goal of this project
Is it similar to projects in the past or is it a new effort?
Define the scope of this project
Is it an independent project or is it related to other projects?
* Note that this slide is not necessary for weekly status meetings
The following slides show several examples of timelines using SmartArt graphics.
Include a timeline for the project, clearly marking milestones, important dates, and highlight where the project is now.