5. Dilemma: Innovative or Stable? Innovative Often (bi-weekly) releases of new features Higher risk of bugs and downtimes Stable Higher uptime and better customer perception Seasonal releases of new features
6. We wanted both … … be innovative and agile while staying as much stable as possible
7. Stability in our terms 99.999% uptime for serving ads 2 datacenters + clouds 500 M requests / day
9. Challenges we ha(d/ve) Detect issues in production as soon as possible Test new features in production while reducing impact for customers Roll-out new features in a controlled manner
10. Detect issues in production ASAP Monitoring Choose monitoring system carefully It took us about 1 year (Zabbix) First list all your possible monitoring use cases Prepare your software for monitoring Logging is a must have! Performance / SLA counters help to measure and understand software better Create a clear baseline to compare with after releases
11. Detect issues in production ASAP Automated functional tests Designed to detect end-user issues Differently than unit and integration tests UI / business logic Still not as many as we want (Selenium UI / C#) Ongoing process of unifying automated QA tests Run after each release and on periodic basis Very important if you have > 1 server Huge time saver if tests are repetitive
12. Though unit tests help in finding bugs during coding, they are more vital when software evolves! Finding
13. Test new features in production Even ideal staging environment is not equal to production environment Before starting rolling-out new feature it is important to check its Resource consumption CPU / RAM / HDD / IO / Network Performance impact on existing functionality Response times / SLA Stability Errors / memory leaks
14. Test new features in production Use Case #1: Safely rollout new feature that integrates into core data collection pipeline
15. Test new features in production Dark releases Works best with brand new features Release new feature to one or several servers New feature gets real load, but is not available for customers Have automated rollback package in case something goes wrong
16. Test new features in production Dark release notes from our release plan
17. Test new features in production Use Case #2: Safely migrate to the new SQL connection pooling mechanism
18. Test new features in production Feature flags and switchers Works both for brand new features and updates Feature can be switched on / off any time if (FeatureEnabled) then … if (UseNewLogic) then … else … Can effect existing customers Possible to test each server one by one by switching feature on / off
19. Test new features in production Use Case #3: Safely migrate to the brand-new intelligent targeting subsystem
20. Test new features in production Valves Very similar to switches Feature can get from 0% to 100% of real load Very handy to gradually roll-out new features on each server one by one So far helped us a lot though require extra development effort
21. Test new features in production Caveats we had so far Make sure you can turn features on / off without effecting connected users Create simple interface to display current status of all switches and valves on each affected server Secure access to switches and valves
22. Controlling roll-out of new feature Switches and valves enable very smooth and controlled roll-out Partial roll-out to different datacenters / clouds Different datacenters / clouds have different version of feature released Redirect all traffic to the new or old version of feature
23. Controlling roll-out of new feature Future research: application level load balancing Load balancer can act as a switches / valve without actually programming load distribution logic Ability to automatically redirect users to the new version of application while preserving old one
24. Summary Monitoring system is very important, but your software should be prepared for this Automated functional tests are functional monitoring of your software Switches and valves are very powerful concept for testing in production and roll-outs, but require extra development and maintenance time Dark releases and partial roll-outs are the most cost effective safety mechanism