SlideShare ist ein Scribd-Unternehmen logo
1 von 19
CONNECT THE DOTS
http://www.fotopedia.com/items/flickr-3247053188
David Shafer ● UI Tech Forum ● May 30, 2013
Schaefer
Schaeffer
Schafer
Schaffer
Shafer
Shaffer
Shafner (Not Really An Option)
This One
1995: My first numeric pager
1999: Upgraded to
alphanumeric (fancy!)
Now: iPhone, PagerDuty,
push notifications, e-mail alerts
Tomorrow: Nagios for Google Glass?
http://www.last.fm/music/Foreigner/+images/81717445
IT’S ALL URGENT
DON’T BE A HERO
Only You Can Prevent Active Directory Forest Fires
http://www.flickr.com/photos/navalsurfaceforces/5552828713/
FIX ALERTING FIRST
86
74
52
26
0
20
40
60
80
100
Jun–Aug Sep–Nov Dec–Feb Mar–May
Storage Services Alertable Incidents
(June 2012–May 2013)
CATALOG YOUR SERVICES
(Models Optional)
http://www.flickr.com/photos/telstar/2131982712/
INFORMATION
NEEDS
CONTEXT
http://www.flickr.com/photos/lrargerich/2984777106/
PLATFORM MONITORING TOOLS:
UNIVERSALLY BAD
MOST OPEN SOURCE TOOLS:
NOT MUCH BETTER
SOLUTION:
BUILD YOUR OWN
MONITOR
WHAT THE
USER
CARES ABOUT
TELL A STORY
http://www.flickr.com/photos/sanjoselibrary/3812501631/
KEEP THE STORY GOING…
Define
(and Refine)
Your Services
Monitor
Relevant
Metrics
Communicate
the Good with
the Bad
David S. Shafer
ITS Storage Services
The University of Iowa
david-shafer@uiowa.edu
DavidScottShafer
@DavidSShafer
QUESTIONS?

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Connect the Dots: Monitoring, Metrics, and Alerting

Hinweis der Redaktion

  1. My name is Dave Shafer, and I’m going to talk about a bunch of random things. We’ll see if we can find some connection between them. (I’m serious… This is pretty random.)
  2. There are many ways to spell my last name. You can see the one I use, though I’ll respond to any of them. I’ve received mail for all of these at one time or another. (If we’re being honest, I don’t think “Shafner” is really an option, but that didn’t stop someone using that variation once.)I managed to snag david.shafer@gmail.com early on, during the decade when Gmail was considered Beta. Now all the other David Shafers with Gmail accounts can’t remember their addresses, so when they give somebody the wrong address, I end up getting their mail. I’ve received e-mail for a state legislator in Georgia, a guy who works in craft services in Hollywood, and a racist amateur pilot in Texas.I work with the ITS Storage Services group along with Susanne Branson, Seth Clarke, and Mark Weber. Our group is responsible for 1.5 petabytes of disk storage for applications you use every day, from personal file storage to ICON and institutional databases. I started in ITS 14 years ago as a Unix systems administrator, and 7 years ago I moved from Unix administration to storage services. For the past 14 years, and 4 years before that, I’ve been on-call.
  3. My first pager in 1995 was a numeric pager. It couldn’t even do text. Then I upgraded to an alphanumeric pager when I started in ITS in 1999. I thought that was pretty fancy, and I should’ve stopped there, because sometime later I got a Motorola RAZR and figured out how to check my e-mail on it, and it was all downhill from there. Now I’ve got my iPhone and push notifications from PagerDuty and HawkAlerts and I’m pretty much always tethered to the University.This scenario will sound familiar to many of you: You’re sound asleep, burrowed under the covers getting some well-deserved rest, when it happens. Your phone is trying to get your attention.
  4. (If you’re like me, your phone is repeating the chorus from Foreigner’s 1981 hit, “Urgent”.)You squint to see the time and think, “What could possibly be broken on a Thursday night at 3 a.m.?” But you already know the answer,and it’s bad.You instantly go into troubleshooter mode. Because you are the fixer. You’re going to fix this. You’ll wrestle with servers, you’ll tame wild processes, you’ll beat HTML forms into submission. You will emerge victorious, and you will be the hero.
  5. The problem is that being a hero all the time takes a toll. As budgets get leaner, and systems get more complex, it gets harder to keep up with all the fires.The basic dysfunction is that we spend so much time reacting to incidents, we don’t have time to manage our systems proactively, which would prevent the incidents from happening in the first place and save time in the long run.Wedon’t notice the application that’s been slower for the past month. We don’t see the disks that are filling up twice as fast. We’re fighting fires when we should have been looking for smoke.The worst part is that we know we’re doing it, but we can’t break out of the cycle.In our data centers, we monitor for smoke in a literal sense. But we don’t always monitor for the metaphorical smoke.We have access to so much data, but we don’t use it to make better decisions _or_ to give clarity to the rest of the organization.Today the I.T. hero isn’t the person who puts out the fires. The hero is the person who uses complex data in clever ways to make plans, to explain things to others, and to prevent the fires from happening. The hero is the person who can look at all that data and help themselves and others to connect the dots. (See how I worked the title in there?)
  6. As a first step, you have to take control of incident response. The emergencies will happen. You can’t stop them altogether, but you can make them less painful and start to understand the size of the problem.When I first started in ITS, we used (and still use) a system called Spong, or “Son of Pong” (“pong” being the only appropriate response to a “ping”). That worked great for servers, but not for proprietary storage systems. When a storage system has a problem, it expects to send an e-mail. If you’re lucky, you can get it to e-mail you a text message. As we added more storage systems, this model wasn’t really sustainable.One of the most important things we did was move the alerting function to a service called PagerDuty (www.pagerduty.com), which we started using in February 2012. All of our systems send their alerts to PagerDuty’s servers. Through the PagerDuty web interface, we’re able to define groups of systems, on-call schedules and overrides, escalation policies, and notification methods. When an incident happens, PagerDuty decides who should be notified based on the schedule.It’s not free– the cost is about $16/month per system administrator, but it’s been worth every penny.
  7. This is a graph of incidents received by PagerDuty for the past year. Even as our storage systems have continued to grow, both in size and complexity, the number of alertable incidents has actually decreased – because now we’re better able to 1) filter out the noise, and 2) identify the recurring problems– things that need a long-term solution, instead of a quick fix. PagerDuty has worked so well for us, we’ve expanded it to two other groups in ITS: Core I.T. Facilities, and the DNA Team. (Educational discounts are available, so talk to me if you’d like to try it out.)
  8. Once you understand the size of the problem, getting out of the firefighting mentality starts with knowing what you’re supposed to be doing, and why you’re doing it. You need to make a list. A catalog of services. A “Service Catalog”™©®, if you will. This doesn’t have to be big and complicated; think about what you do, from the perspective of your users.Start by listing the services you’re responsible for providing. Note that technologies or specific products are not services. For example, server-side virus scanning is one service the Storage Services group provides today, but our users don’t care whether it’s from McAfee or Symantec. Virus scanning is the service; McAfee is the technology we’re using today. Sometimes the technology is the service. But many times it isn’t.Your service catalog will drive every other thing you do. Every dollar, every hour, every project should relate to a service on that list, or a service you’re adding to the list. The service catalog is the foundation that gives structure to everything else you do.For each service on the list, you need to figure out:What is the service?Who makes important decisions about the service? (Who is the service owner?)Who are the users?What metrics are important to the users? (Availability? Responsiveness?)Figure out the measures that tell you whether you’re meeting users’ expectations, and set goals. Those are your service level objectives, and those are the things you should be monitoring. Maybe it’s 99.99% uptime, or 20 millisecond response times. Whatever the case, you should be alerted when you’re not meeting those objectives.You might be hesitant to quantify objectives. Don’t be. It’s not a contract, it’s just a way to establish mutual understanding. A lack of understanding is misunderstanding; that’s what you’re trying to avoid. Start by measuring what you achieve today, and ask your users if they’re satisfied. Did you achieve 99.9% uptime this year? Were your users satisfied? Then you’ve got an easy objective for next year. If they weren’t satisfied, you can talk about making incremental improvements.
  9. The service catalog is an area where we’re still improving. We’ve defined our available storage services, and the pricing. Now that we’ve finished the move to the new data center, we’ll be revisiting the service definitions and we’ll have a new site detailing the infrastructure.Where I think we can also improve is in defining service level objectives, especially performance, and also uptime. We have some rough guidelines– for example, we want our high-performance EMC SAN storage to provide response times at or below 20 milliseconds– but we don’t have a good way to get alerts when we exceed those performance objectives. This will be something we continue to work on over the next year. Also quantifying uptime; I know we’ve had very few outages, and people seem to be happy with our uptime, but I want to put some numbers on it.
  10. This leads to the next set of questions: We have so much information at our disposal, but do we have the right data when we need it, and can we relate it to other things that matter?Imagine you’re staring at a single data point, isolated. Maybe it’s a process completion time. Or a disk I/O rate. Or a page load time. It’s not terribly useful on its own. To make it useful, you have to relate the data in four directions: up, down, backward, and forward.Relate the data up to understand how it affects services. What other things will be impacted? Is it relevant to our service level objectives? Is it something we should communicate to management, other workgroups, or our users? How can we help them understand?Relate down to lower-level measures to understand deeper meaning, causes, and hidden factors. Relate backward in time to understand the data in historical context, establish benchmarks and baselines, and expose trends.And relate the data forward in time with forecasts and projections that help plan for future capacity.(My husband used to work in IT engineering at GoDaddy.com. GoDaddy’s customers are the people who own domain names and web sites, but the engineering team’s ultimate customer was GoDaddy.com owner and CEO Bob Parsons. They were able to summarize every service into one single metric: dollars per minute. When dollars per minute was in line with historical trends and forward-looking projections, the customer– Bob– was happy.)
  11. In a perfect world, we’d have the tools to do all these things. But the reality is that we usually don’t. In the storage world, for instance, every storage platform has a different management console. Here you can see the management interfaces for our EMC, Dell EqualLogic, and NetApp systems. They don’t talk to each other, and they don’t always expose the data we need.And the problem isn’t unique to storage. In 2011, I worked with Steve Troester (ITS Network Services) to look at how different groups in ITS monitored their systems, and where we might be able to improve things. We found every group was doing something different, because the systems they were responsible for were so different. To this day, all of our monitoring tools are separate, so there’s no easy way to find correlations between the networks, storage, servers, databases, and applications. There’s also no easy way to present information in real-time about the status of our services to users.There’s more than one piece of commercial software that promises a unified view of all your systems, across the entire environment, with visibility into every layer of the stack. But it’s all very expensive, and doesn’t work particularly well, and requires a lot of effort to implement. In some cases, you need dedicated full-time staff just to keep everything in sync. (When you hear the sales team mention “a single pane of glass”, run.)
  12. So instead we look to the Internet, because the largest sites out there have also tackled this problem, but they haven’t done it with commercial software. Instead, they each use a customized tool chest of open source and homegrown tools. Because it turns out that sending a text message alert through PagerDuty is a fairly universal need– that’s not something we have to write for ourselves– but maintaining records about storage allocations across EMC, EqualLogic, and NetApp storage systems and reconciling that with the ITS service billing system? That’s more specialized, so we developed our own solution using SQL Server and PowerShell. Susanne Branson is responsible for our storage accounting system, and you can ask her if you’d like to know more about it.
  13. Pingdom is another tool we’ve looked at it, and it’s being used actively by some other groups in ITS. Like PagerDuty, it’s a hosted service with a monthly subscription fee. Pingdom can monitor your web site or your application from their servers around the world, tell you how fast it’s responding, and notify you when problems occur.Our users aren’t always on campus. Because Pingdom is monitoring from off-campus, it can recreate the user’s experience better than anything we could do ourselves. As a result, it gets closer to our goal of measuring the service the way that users experience it, instead of just measuring the back-end technology. You can monitor one system for free with Pingdom, so here we’re using it to monitor one of our monitoring servers.
  14. Most of the open source monitoring packages we’ve investigated haven’t been very useful. Here you can see Cacti, Zabbix, Nagios, and a Nagios derivative called OpsView. They’ve been useful for other things in the past, but not terribly useful for proprietary storage systems. We needed a new approach.
  15. Graphite (https://github.com/graphite-project) is an exampleof something we’re doing with open software source software and locally developed scripts. Graphite offers some advantages over some of the traditional open source monitoring tools. It’s really just a data collection and graphing engine. Using Graphite, we can ingest massive amounts of data from different sources and then make graphs on the fly (literally connecting the dots).We’re just starting to use Graphite with our NetApp storage systems. The upper graphs show increased latency on one of our VMware volumes, and the lower graph shows that one VMware is generating more NFS operations than the rest.We may add our EMC and EqualLogic systems to Graphite in the future. Ask Mark Weber if you’d like to know more.
  16. When you have good data about your services, and you can make the data relevant to others, it becomes a very powerful tool. It’s especially powerful for an introvert like me, because good data, when presented in the right context, can often speak for itself. And it’s a lot easier to ask for things– money, staff, equipment-- when you have data explaining where your capacity has gone, and what’s driving growth.You can highlight where you’ve done well– something that too often goes unnoticed in I.T. And you can also explain where you haven’t done so well. None of us likes to fail, but the best thing you can do in that situation is be absolutely transparent– show you understand the problem, you understand the cause, and you have a plan to prevent it from happening again.Of course it doesn’t stop after you submit that request for a new server. You continue the narrative, keep your eyes on the road ahead, and continue telling the story.
  17. In April, 2011, I started writing a weekly Friday status update for the Storage Services group, which I publish to a blog on the ITS Intranet and e-mail to a dozen people who’ve asked, including most of the ITS Leadership Team. In the two years since I started, I’ve only missed a handful of weeks when I was out of the office. I include updates on our current projects, and any news about our services– good or bad. The whole thing is less than a page, even in a busy week. It’s a chance for me to highlight the work we’re doing and communicate what I think our priorities are. It’s also a great way to cap off the week; I get to review our accomplishments and collect my thoughts going into the weekend. It’s become such an important part of my process, I can’t imagine ever stopping. It’s another way I can use our understanding of our services, and the data we collect, to continue telling a story about the work we do.This talk is also an example of story telling:I’ve shown you data demonstrating how we’re reducing alertable incidentsI’ve told you about new tools we’ve developed and we’re usingI’ve shown you how we’re collecting and using in-depth performance dataI’ve told you about our weekly blog updatesSo even as I’m talking to you, I’m reinforcing the messages about the services we provide.
  18. The bottom line is this: To break out of the reactive cycle, you have to begin using data in a proactive way.Define your services first. Understand what you’re doing, why you’re doing it, and what others expect. This tells you what you need to be monitoring.Then monitor those things-- not just the underlying technologies, but the high-level metrics that matter most to people. Make sure you’re meeting people’s expectations.Communicate frequently, not just when you need something. Highlight your successes, and be transparent about problems. Help people relate to the data.Review your service definitions regularly, listen to feedback, and think about whether your service definitions can be refined.When you start to do all these things, then you help other people to see through the smoke and flames, to appreciate the complexity and the work you’re doing, and in their own heads they can begin to connect the dots as well.
  19. Feel free to send me any feedback, add me on LinkedIn, follow me on Twitter, etc. I’d love to hear your thoughts!