2. Netflix, Inc.
“Netflix is the world’s leading Internet television
network with more than 33 million members in
40 countries enjoying more than one billion
hours of TV shows and movies per month,
including original series . . .”
Source: http://ir.netflix.com
3. Me
Director of Engineering @ Netflix
Responsible for:
Cloud app, product, infrastructure, ops security
Previously:
Led security team @ VMware
Earlier, primarily security consulting at @stake, iSEC Partners
15. On the way to the cloud . . . (organization)
(or NoOps, depending on definitions)
16. Some As-Is #s
33m+ subscribers
10,000s of systems
100s of engineers, apps
~250 test deployments/day **
~70 production deployments/day **
** Sample based on one week‟s activities
18. A common graph @ Netflix
Weekend afternoon ramp-up
Lots of watching in prime time Not as much in early morning
Old way - pay and provision for peak, 24/7/365
Multiply this pattern across the dozens of apps that comprise the
Netflix streaming service
20. Autoscaling
Goals:
# of systems matches load requirements
Load per server is constant
Happens without intervention (the „auto‟ in autoscaling)
Results:
Clusters continuously add & remove nodes
New nodes must mirror existing
21. Every change requires a new cluster push
(not an incremental change to existing systems)
23. Netflix Deployment Pipeline
RPM with
app-specific VM template
bits ready to launch
YUM AMI
Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems
24. Operational Impact
No changes to running systems
No systems mgmt infrastructure (Puppet, Chef, etc.)
Fewer logins to prod
No snowflakes
Trivial “rollback”
25. Security Impact
Need to think differently on:
Vulnerability management
Patch management
User activity monitoring
File integrity monitoring
Forensic investigations
29. Points of Emphasis
Integrate Two contexts:
1. Integration with your
Make the right way easy engineering ecosystem
Self-service, with 2. Integration of your security
exceptions controls
Organization
Trust, but verify
SCM, build and release
Monitoring and alerting
29
30. Integration: Base AMI Testing
Base AMI – VM/instance template used for all cloud systems
Average instance age = ~24 days (one-time sample)
The base AMI is managed like other packages, via P4, Jenkins, etc.
We watch the SCM directory & kick off testing when it changes
Launch an instance of the AMI, perform vuln scan and other checks
SCAN COMPLETED ALERT
Site name: AMI1
Stopped by: N/A
Total Scan Time: 4 minutes 46 seconds
Critical Vulnerabilities: 5
Severe Vulnerabilities: 4
Moderate Vulnerabilities: 4
31. Integration: Control Packaging and Installation
From the RPM spec file of a webserver:
Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
Pulls in the following RPMs:
HIDS agent
Config assessment/firewall agent
Host hardening package
WAF
32. Integration: Timeline (Chronos)
What IP addresses have been blacklisted by the WAF in
the last few weeks?
GET /api/v1/event?timelines=type:blacklist&start=20130125000000000
Which security groups have changed today?
GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
33. Integration: Static Analysis
Available self-service through build environment
FindBugs, PMD
Jenkins plugin to display graphs and support drill
through to results
35. Integration: Alerting (Central Alerting Gateway)
Single place to generate and deliver alerts
Python, Java libraries (or JSON post)
Ties in to PagerDuty notification/escalation system
Permits stateful alerting and some response
A prerequisite that our security tools will leverage
36. CAG Example
import CORE.Gateway
gw = CORE.Gateway.Gateway()
# testcluster is a defined app with associated escalation
# schedule in PagerDuty
gw.send("testcluster", "normal", "Something went wrong")
37. Points of Emphasis
Integrate Developers are lazy
Make the right way easy
Self-service, with
exceptions
Trust, but verify
38. Making it Easy: Cryptex
Crypto: DDIY (“Don‟t Do It Yourself”)
Many uses of crypto in web/distributed systems:
Encrypt/decrypt (cookies, data, etc.)
Sign/verify (URLs, data, etc.)
Netflix also uses heavily for device activation, DRM
playback, etc.
39. Making it Easy: Cryptex
Multi-layer crypto system (HSM basis, scale out layer)
Easy to use
Key management handled transparently
Access control and auditable operations
40. Making it Easy: Cloud-Based SSO
In the AWS cloud, access to data center services is
problematic
Examples: AD, LDAP, DNS
But, many cloud-based systems require authN, authZ
Examples: Dashboards, admin UIs
Asking developers to securely handle/accept credentials
is also problematic
41. Making it Easy: Cloud-Based SSO
Solution: Leverage OneLogin SaaS SSO (SAML) used
by IT for enterprise apps (e.g. Workday, Google Apps)
Uses Active Directory credentials
Provides a single & centralized login page
Developers don‟t accept username & password directly
Built filter for our base server to make SSO/authN trivial
42. Points of Emphasis
Integrate Self-service is perhaps the
most transformative cloud
Make the right way easy characteristic
Self-service, with Failing to adopt this for security
exceptions controls will lead to friction
Trust, but verify
43. Self-Service: Security Groups
Asgard cloud orchestration tool allows developers to
configure their own firewall rules
Limited to same AWS account, no IP-based rules
44. Points of Emphasis
Integrate Culture precludes traditional
“command and control”
Make the right way easy approach
Self-service, with Organizational desire for agile,
exceptions DevOps, CI/CD blur traditional
security engagement
Trust, but verify touchpoints
45. Trust but Verify: Security Monkey
Cloud APIs make verification Includes:
and analysis of configuration Certificate checking
and running state simpler Firewall analysis
Security Monkey created as IAM entity analysis
the framework for this analysis Limit warnings
Resource policy analysis
46. Trust but Verify: Security Monkey
From: Security Monkey
Date: Wed, 24 Oct 2012 17:08:18 +0000
To: Security Alerts
Subject: prod Changes Detected
Table of Contents:
Security Groups
Changed Security Group
<sgname> (eu-west-1 / prod)
<#Security Group/<sgname> (eu-west-1 / prod)>
47. Trust but Verify: Exploit Monkey
AWS Autoscaling group is unit of deployment, so
changes signal a good time to rerun dynamic scans
On 10/23/12 12:35 PM, Exploit Monkey wrote:
I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.
I'm starting a vulnerability scan against test app from these
private/public IPs:
10.29.24.174
48. Trust but Verify: ELB Checker (gauntlt)
AWS Elastic Load Balancer (ELB) provides cross-
datacenter traffic balancing, but no security controls
If your cluster is attached to an ELB, it is available to the Internet
Engineers may misunderstand:
ELB use cases (and alternatives)
Security features
Other measures used to protect ELB-fronted clusters
49. Trust but Verify: ELB Checker (gauntlt)
1. Launch gauntlt test runner instance,
loaded with “master list” of ELBs and
expected state
2. Determine “target list” of current ELBs
to evaluate
3. Generate per-ELB listener gauntlt
attack files
4. Execute attacks
5. Alert on failures and new ELBs
6. Triage findings and update master list
50. Takeaways
Netflix runs a large, dynamic service in AWS
Newer concepts like cloud & DevOps need an
updated approach to application security
Specific context can help jumpstart a pragmatic
and effective security program
Don‟t swim upstream - integrate and collaborate
with your engineering partners