Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Design Review Best Practices 
Velocity Barcelona 2014 
Mandi Walls 
Consulting Director, EMEA, Chef 
mandi@getchef.com
whoami 
• Mandi Walls 
• Professional Services for Chef 
• @lnxchk
Why Design Reviews? 
• Your opportunity to get a look at an incoming project 
• Hopefully early enough to have influence! ...
What You Want 
• Maximize the conversation 
• Get a good feel for important details 
• Don’t get too in the weeds, focus o...
Set Your Goals 
• Initial deployment of the application 
• Runtime management and day-to-day operations 
• Releases and up...
Have A Checklist 
• Combination of global requirements and needs specific to the project 
• Doesn’t have to be exhaustive,...
Some Application 
FE 
Web Tier 
LB 
App Tier 
External 
Services 
Data Tier BI
Investigating Layer by Layer 
• Different components will require different focus! 
• Some topics will hit all layers
Frontend Entry Point 
• To look at during deploy 
• Gathering all FQDNs 
• Determining SSL Requirements 
• Looking at Geog...
FE Gotchas: Rewrite Rules, Divided Farms 
• Sophisticated load balancing equipment allow for lots of cleverness 
• Documen...
Web Tier 
• Server type and version 
• Access methods to other tiers 
• Caching? 
• Locations 
• Modules and their configu...
Web Insanity: Custom HTTP Servers 
https://www.flickr.com/photos/16210667@N02/13973604274
Application Tier 
• Basics: server type and version, port requirements, protocols 
• Outbound connections to outside servi...
App Tier Gotchas 
• Things that affect scaling, replacement of hosts 
• Whitelisting! Why!? 
• What changes require restar...
Data Tier – Our Data 
• In multi-location topologies, where is the master? 
• Is there whitelisting or other app authoriza...
Doing Horrible Things to Data 
• Premature optimization can hamstring a project 
• Are the data tiers susceptible to queui...
Data Tier – Imports and ETL 
• Schedules 
• Measuring successful load 
• Contact information for upstream provider 
• Stor...
Operationalizing ETL 
• Some of the fiddliest fiddly bits 
https://www.flickr.com/photos/joming/2182022195
Data Tier – Exports, Reporting, BI 
• Packaging 
• Scheduling 
• Live replica versus full dump and export 
• Should have n...
External APIs - Consumer 
• Relying on services you don’t own is tricky business 
• Account name and owner 
• SLA of servi...
Graceful Degradation 
• What does the App do if the external service is unavailable? 
• Shouldn’t hang 
• Should log some ...
Red Button 
• Can you completely disconnect the code from a broken service? 
• Who can make the call? 
https://www.flickr....
Gotchas for All Tiers 
• Host firewalls 
• Application user accounts – in the central service? 
• Logging policy 
• Metric...
Alerts 
• A stacktrace is not an alertable event 
• Log hygiene is important – what’s in the prod logs needs to be useful ...
Releases 
• Schedule – Over night? Weekends? Rolling continuous deploy? 
• Getting operations requirements into Dev cycle ...
Nightmare: Attaching Prod App to Dev DB 
• All artifacts should ship production ready 
• Localized configurations ready in...
Software Library 
• Set organizational standards 
• Rely on your OS vendors when possible 
• Pre-check that all dependenci...
Spelunking Through History 
• Keeping up with OS releases makes later scaling, replacement of 
dead hosts smoother 
• Alwa...
Performance and Tuning 
• Hard to be concise with new projects 
• Performance regression should be part of the testing pro...
Outage Planning 
• Known failure patterns and indicators 
• Who is oncall from Dev? 
• Who is oncall from Product or other...
The Dark Art of Healthchecks 
• At some point, there should be a check that works back through the 
whole stack for status...
Static Failover Pages 
• Mechanism for not serving up 500s or blanks during downtime 
• Set to non-cachable, show a pictur...
Other Team Info 
• Know who’s who 
• Dev teams should have some sort of oncall in case of problems Ops 
can’t fix 
• Intak...
Success Metrics 
• What will success actually look like after launch? 
• Is the product team watching a particular set of ...
Things That are Challenging to Find in Review 
• Bottlenecks 
• That flat table-top graph isn’t a good thing 
• Impact of ...
Learn From Experience 
• Over time, the infrastructure build out will get easier 
• Operations teams develop their own heu...
Thanks!
Design Reviews for Operations - Velocity Europe 2014
Nächste SlideShare
Wird geladen in …5
×

Design Reviews for Operations - Velocity Europe 2014

2.818 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Design Reviews for Operations - Velocity Europe 2014

  1. 1. Design Review Best Practices Velocity Barcelona 2014 Mandi Walls Consulting Director, EMEA, Chef mandi@getchef.com
  2. 2. whoami • Mandi Walls • Professional Services for Chef • @lnxchk
  3. 3. Why Design Reviews? • Your opportunity to get a look at an incoming project • Hopefully early enough to have influence! Talk to development before it’s too late to make changes to support Ops
  4. 4. What You Want • Maximize the conversation • Get a good feel for important details • Don’t get too in the weeds, focus on what alleviates the most headaches
  5. 5. Set Your Goals • Initial deployment of the application • Runtime management and day-to-day operations • Releases and upgrades • Handling Outages
  6. 6. Have A Checklist • Combination of global requirements and needs specific to the project • Doesn’t have to be exhaustive, or hit everything the first time • Sample: http://bit.ly/lnxchk_reviewcklist • Sets goals by layer • You can also set goals by activity • It’s a living document • Make changes as you learn, launch more projects
  7. 7. Some Application FE Web Tier LB App Tier External Services Data Tier BI
  8. 8. Investigating Layer by Layer • Different components will require different focus! • Some topics will hit all layers
  9. 9. Frontend Entry Point • To look at during deploy • Gathering all FQDNs • Determining SSL Requirements • Looking at Geographic Topologies
  10. 10. FE Gotchas: Rewrite Rules, Divided Farms • Sophisticated load balancing equipment allow for lots of cleverness • Documentation of intent is key • Later migrations, requirements changes are more difficult /foo /bar
  11. 11. Web Tier • Server type and version • Access methods to other tiers • Caching? • Locations • Modules and their configurations
  12. 12. Web Insanity: Custom HTTP Servers https://www.flickr.com/photos/16210667@N02/13973604274
  13. 13. Application Tier • Basics: server type and version, port requirements, protocols • Outbound connections to outside services • User affinity • Necessary libraries and their release cycles
  14. 14. App Tier Gotchas • Things that affect scaling, replacement of hosts • Whitelisting! Why!? • What changes require restarts? • New FE connections? New BE connections to cache? To DB?
  15. 15. Data Tier – Our Data • In multi-location topologies, where is the master? • Is there whitelisting or other app authorization? • What are compliance/reporting requirements? • SOX? HIPAA? PCI? • Will schemas be versioned?
  16. 16. Doing Horrible Things to Data • Premature optimization can hamstring a project • Are the data tiers susceptible to queuing? • Management of data should be automated
  17. 17. Data Tier – Imports and ETL • Schedules • Measuring successful load • Contact information for upstream provider • Storage locations and intended size
  18. 18. Operationalizing ETL • Some of the fiddliest fiddly bits https://www.flickr.com/photos/joming/2182022195
  19. 19. Data Tier – Exports, Reporting, BI • Packaging • Scheduling • Live replica versus full dump and export • Should have no impact on production operation of the datastore!
  20. 20. External APIs - Consumer • Relying on services you don’t own is tricky business • Account name and owner • SLA of service provider • Contact info for outages
  21. 21. Graceful Degradation • What does the App do if the external service is unavailable? • Shouldn’t hang • Should log some message • Should have some reasonable timeout, plus a hold down • Users shouldn’t notice
  22. 22. Red Button • Can you completely disconnect the code from a broken service? • Who can make the call? https://www.flickr.com/photos/bluetsunami/535781178
  23. 23. Gotchas for All Tiers • Host firewalls • Application user accounts – in the central service? • Logging policy • Metrics collection • Backups • OS Versions
  24. 24. Alerts • A stacktrace is not an alertable event • Log hygiene is important – what’s in the prod logs needs to be useful • If it’s worth writing to disk, someone should know what to do with it 2014-06-01:16:44:00:UTC ERROR: $$$$$$$$ 2014-06-01:16:44:07:UTC ERROR: Host Not Found
  25. 25. Releases • Schedule – Over night? Weekends? Rolling continuous deploy? • Getting operations requirements into Dev cycle • Security requirements • Upgrades and patches • Service restarts • Graceful horizontal restarts • Full-zone downtime
  26. 26. Nightmare: Attaching Prod App to Dev DB • All artifacts should ship production ready • Localized configurations ready in advance • Lean on config management tools for other environments https://www.flickr.com/photos/garon/121923087
  27. 27. Software Library • Set organizational standards • Rely on your OS vendors when possible • Pre-check that all dependencies are available in all repos for all environments • Suppress the urge to build special packages for absolutely everything • Streamlined, repeatable, reliable installs
  28. 28. Spelunking Through History • Keeping up with OS releases makes later scaling, replacement of dead hosts smoother • Always be current or no more than one release back in all envs
  29. 29. Performance and Tuning • Hard to be concise with new projects • Performance regression should be part of the testing process! • Know in advance what tunables exist, what their indicators are • Heap • GC • Network stacks • In-memory caching
  30. 30. Outage Planning • Known failure patterns and indicators • Who is oncall from Dev? • Who is oncall from Product or other decision-making stakeholders?
  31. 31. The Dark Art of Healthchecks • At some point, there should be a check that works back through the whole stack for status • These can be expensive • You have to know, when you hit the FE, that it is able to serve all the components of the app
  32. 32. Static Failover Pages • Mechanism for not serving up 500s or blanks during downtime • Set to non-cachable, show a picture, give a better experience
  33. 33. Other Team Info • Know who’s who • Dev teams should have some sort of oncall in case of problems Ops can’t fix • Intake process for bug reports, feedback, getting Ops issues fixed • Like those crazy log messages! • Get invited to team meetings! Go occasionally!
  34. 34. Success Metrics • What will success actually look like after launch? • Is the product team watching a particular set of KPIs? • How will the data for success be collected • BI, beacons, third party collectors
  35. 35. Things That are Challenging to Find in Review • Bottlenecks • That flat table-top graph isn’t a good thing • Impact of third-party inclusions • Rigor of the integration testing environment • What parts of the system users are really going to use • People might like something you thought was going to be minor
  36. 36. Learn From Experience • Over time, the infrastructure build out will get easier • Operations teams develop their own heuristics for difficulty • Always be learning and improving • Give everyone a chance to learn and improve too • Create standards so you don’t have to investigate all dark corners • Update your checklist periodically
  37. 37. Thanks!

×