4. 1 / The need for diagnostic data in cloud applications
2 / Data we can we monitor
3 / Using the Microsoft Azure Diagnostic Agent
4 / Real-world guidance for troubleshooting Microsoft Azure apps
10. Resolution
o Step 0 – Enable Azure
diagnostics
• Set key performance
counters
o Step 1 – Add logging
statements around key
functionality
• Especially external services
o Step 3 – Test, test, test
o Step 4 – Analyze
o Step 5 – Fix it
Scenario
o
o
o
o
o
o
17. Diagnostic Item Table Name Blob Container Name
Windows Event Logs WADWindowsEventLogsTable
Performance Counters WADPerformanceCountersTable
Trace Log Statements WADLogsTable
Azure Diagnostic Infrastructure
Logs
WADDiagnosticInfrastructureLogs
Custom Logs
(i.e. log4net, NLog, etc.)
<custom>
IIS Logs WADDirectoriesTable* wad-iis-logfiles
IIS Failed Request Logs WADDirectoriesTable* wad-iis-failedreqlogfiles
Crash Dumps WADDirectoriesTable*
* Location of the blob log file is specified in the Container field and name of the blob in the RelativePath field. The
AbsolutePath field contains the name of the file as it existed on the role instance.
26. Instruct WAD to transfer specific data sources to storage
Overwrites current diagnostic configuration
http://msdn.microsoft.com/en-us/library/gg433075.aspx
31. Compute node
resource usage
Windows Event
logs
Database
queries
response times
Application
specific
exceptions
Database
connection &
cmd failures
Microsoft Azure
Storage
Analytics
Process for Azure hosted solutions is not that different
from traditional, on-premises solutions.
Successful projects share one common trait . . Not what you might think
Latest hot language
Hot platform
Smartest people
Agile vs. waterfall
Money
http://assets.bitnami.com/assets/windows_azure_logo-metro.png
http://technologiesreview.com/wp-content/uploads/2011/02/AWS_LOGO_CMYK.png
http://www.istockphoto.com/stock-photo-35165202-portrait-of-male-college-student.php?st=73c78c9
The #1 problem I see over & over
Multiple servers – more difficult to handle
Keep locally? Hard
What if a server dies?
Need a central location
Configure in Visual Studio
Show the declarative way
Show were the file is located – bin and root for Web and Worker respective (D:ProjectsDemosJustAzureAzureDiagnosticsAzureDiagnosticscsxRelease
olesWorkerRole1approot)
Show in storage – show using AMS.
Show the file in blob storage (pghtechfest14)
Show viewing data in Visual Studio
Show LinqPad
Show AMS
Log all calls to external services
Include as much detail as possible (destination, method, timing info, result, etc.)
Log details of transient faults
Number of retry actions
Cause of the fault
Did the application fail over to a secondary instance?
Detect an emerging problem!
Partition telemetry data by date (or hour) – reduce impact of data aggregation or reporting
Use a different storage account!
Remove old / non-relevant telemetry data
Detect before issues impact your users
Poll data sources, monitor, and alert
Centralized repository
Transient vs. Systemic
Transient: SQL Database throttling
Systemic: bug in the code; no retries will fix
Recover First
Right data can help speed up the process . . . Even with Microsoft support
Your problem . . . Your solution
Root Cause Analysis
What, why, and how to fix going forward
We Don’t Know What We Don’t Know
Incredibly hard to find problems solely by looking at code
Preemptive vs. Reactionary
Regular analysis of telemetry data can help find problems before they become severe.
Recovery & Root Cause Analysis
What is failing?
Are we making it better or worse?
What caused this problem in the first place?
http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx