In the world of software-as-a-service, just about anyone with a laptop and an Internet connection can spin up their very own cloud-based web service. Software startups, in particular, are often big on ideas but small on staff. This makes streamlining the traditional develop-test-integrate-deploy-monitor pipeline critically important. Melissa Benua says that an effective way to accomplish this is to reduce the number of different test suites that verify many of the same things for each stage. Melissa explains how teams can avoid this by authoring the right set of tests and using the right frameworks. Drawing on lessons learned in companies both large and small, Melissa shows how teams can drastically slash time spent developing automation, verifying builds for release, and monitoring code in production—without sacrificing availability or reliability.
2. The challenge: Monitoring SaaS products
Software as a service is exploding, and so is testing complexity:
1. Not enough just to run tests at build time, now you also need need
deploy-time integration tests and continuous network monitoring
2. Every layer of tests adds
complexity & maintenance costs
3. There are a limited amount
of engineer-hours in the day
4. Engineers want to use their time
with maximum efficiency
Time spent writing the same tests over again is time that could be spent doing more
interesting and important stuff!
4. Cloud Monitoring Services
Providers:
• Keynote
• Gomez
• Pingdom
Pros:
• Lightweight
• Integrated alerting
• Public vs. private status pages
Cons:
• Difficult to manage multiple contributors
• Can’t do complex checks easily (log in a user and verify inventory)
• Can get expensive or require enterprise contracts
5. Hosted Monitoring Services
Providers:
• Sensu
• System Center Operations
Manager (SCOM)
• Nagios
Pros:
• Extremely powerful
• Older technology
Cons:
• Complex to set up
• Single centralized server
• Overkill for many services hosted in the cloud
6. OUR APPROACH
Do it the PlayFab way!
October 5, 2015 PlayFab Confidential 6
7. Our Solution
1. Author one set of HTTP-level tests
• Same as how clients connect
• Self-contained and self-initializing
• Repeatable and reliable
2. Deploy tests both within the build environment
and within the monitoring cloud
3. Collect data from tests into one central location
4. Present data for use by both devops and customers
Pros:
• Efficient use of engineering resources
• VM hosting bill is very small
• Can run complex tests without
worrying about maintainability
Cons:
• Pipeline requires some maintenance
• Requires knowing how to use two
different clouds
• Must be able to do test setup from
within a different ecosystem
8. Our solution, cont’d
Goals:
• Minimize number of lines of code
duplicated per functional piece
• Reliable & trustworthy reporting
• Affordable cost
• Adequate geo-location
• Very low maintenance time cost
• Easy to access
• More free time for engineering!
Limitations:
• Smaller # of monitoring leaf nodes (~10 instead of ~100 or ~1000)
• Vulnerable to gaps in dev logic
• Not as straightforward to set up
• Monitoring is only as good as your testing!
10. Scenario A – RESTful API
Sample characteristics:
• Custom service in Java layered on Apache
• Private hosting
• Tests via Junit
• Authenticates using private login
• Connects to several different backend
services (mongodb, sql, analytics,
queueing, etc)
11. Scenario B – MVC Website
Sample characteristics:
• Built on .net MVC
• Hosted in Azure
• Testing via custom harness
• Authenticates using OAuth and Facebook
• Backends into locally-hosted SQL server
12. Scenario C - PlayFab
Characteristics:
• JSON API built on C# + management website
• https://api.playfab.com/documentation
• Hosted in Windows on AWS
• Tests via VSTest
• Many moving parts
• Game server hosting
• Client versus server authentication
• Third-party purchasing and auth providers
• Various backend data sources
14. Architecture
14
Build Server
Compiles code
Runs tests
Production
Deploys
Web Server
Collects Data
Web Site
Displays Data
Developer
Writes
Tests
Europe
Microsoft Azure
US-West US-East Asia
Amazon Web Services
Submits Code
15. Utilized Tech
Test Framework
• VSTest or Junit or custom executor
• Must output a predictable, machine-readable format
(.TRX from VSTest comes with an XSD for easy parsing)
Execution + Communication Layer
• Consul or custom cross-DC chatter
• Consul API is in many languages, easy to secure and simple configure
• Regularly executes the test executable
• Shares test results as ‘service health checks’ across DCs
Custom Data Bridge
• Transform test framework output into Consul input
16. Picking Monitoring Tests
October 5, 2015 PlayFab Confidential 16
Full App
Integration Test
Suite
Internal Service A
Test Suite
Library Unit
Test Suite
Integration
Suite
Internal Service B
Test Suite
Integration
Suite
17. Picking Monitoring Tests, con’t
Must-haves:
• Happen at same layer clients access (HTTP,
generally)
• Cover key ‘P0’ functionality areas
• Cover areas with lots of ‘moving parts’
Nice-to-haves:
• All exposed APIs
• Third-party integrations
• Full success-testing run
Ideal world:
• Full integration test suite
18. Scenario Must-Have Test Cases
REST API
• Login/Authenticate
• Logout
• One test per downstream
service
• Stretch: one test per API
MVC Website
• One test per login method
(OAuth, Facebook)
• Key pages
• Basic SQL coverage
19. Deployment Pipeline
The fewer manual steps the better!
Sample flow:
Submit Code
to Repo
CI Runs
Build
CI Runs Tests
Deployment
Packages
Created
Tests
Deployed
into Monitor
Cloud
Storage
Cloud
Storage
Distributes
to VMs
20. Monitoring Cloud
Any cloud will do!
Number of regions is important
• Azure has https://azure.microsoft.com/en-us/regions/#services
• AWS has http://docs.aws.amazon.com/general/latest/gr/rande.html#ec2_region
VMs can be teeny – no need for heavy compute or memory usage
21. Test Execution Frequency
How complex is it to run your tests?
• Run a simple executable?
• Have to download a lot of data?
• Long setup phase?
• How long does a full test pass take?
Periodic execution (every N seconds)
Faster is better! Pingdom ‘free’ tier is every 15 minutes per check
Ideal range is between 30 seconds and 5 minutes
Be careful not to drown your ‘real traffic’
• Test traffic hiding problems with real users is a legitimate issue!
• Try to stay under 10% of total traffic if possible
22. Collecting Results
Execute Tests
Put machine-readable test results into collator
• Consul accepts Datacenter, CaseName, Pass/Warn/Fail, Note (we store latency)
• Agents may be updated using SDK or direct to HTTP interface
• Example: http://localhost:8500/v1/agent/check/pass/mytestcase
• Full HTTP API: https://www.consul.io/docs/agent/http.html
Small adapter program reads test results and outputs to Consul Agent
(SDK or HTTP)
24. Alerting
Ideal to hear about outages as a push rather than a pull
Determine what ‘failure’ means to you
• Balance between false alarms and missing real alarms
Many options!
• Post alerts into VictorOps for paging
• Send email from monitoring website
• Send push notification through your cloud
28. Consul Commands
Full HTTP API: https://www.consul.io/docs/agent/http.html
Add a health check:
$body =
{
"ID": “mypath",
"Name": "Path Works",
"Notes": "Checking uptime and latency",
"HTTP": "http://my.service.com/path",
"TTL": "45s"
}
• Invoke-WebRequest http://localhost:8500/v1/agent/check/register -Body $body
List the health checks:
• Invoke-WebRequest http://localhost:8500/v1/health/checks/myservice
[
{
"Node": "somenode",
"CheckID": “mypath",
"Name": “Path Works",
"Status": "passing",
},
]
29. Consul Commands
Update a health check:
• Can add ?note=foo to pass details like latency
• Invoke-WebRequest
http://localhost:8500/v1/agent/check/pass/mypath
• Invoke-WebRequest
http://localhost:8500/v1/agent/check/warn/mypath
• Invoke-WebRequest
http://localhost:8500/v1/agent/check/fail/mypath