Every environment comes with its own set of unique challenges. Looking across our global client base, advanced techniques have emerged to solve common or sometimes, very specific, problems. Techniques such as a re-imagining of autonomous healing, advanced networking and proxying patterns, data ex-filtration controls, and continuous delivery of networks will be covered. This fast-paced technical session will provide an in the trenches view of some of the solutions, discussion of considerations at scale, demonstration, and provide actionable designs to take into your organisation. Join us while we present tips, strategies, and cutting edge patterns from Sourced's battle-hardened consultants
Speakers:
John Painter, Principal Consultant, Sourced Group
Brent Harrison, Consultant, Sourced Group
4. • Behind the firewall
• BYO AWS Account
• Guaranteed single tenancy
• Multi-cloud options
• Customer-controlled encryption
• Customer retains custody of all data
Over half a petabyte per year of Splunk
throughput under management
ENGINEERED SERVICES
5. PATTERN 1 – AUTO HEALING GEN2
AUTOMATED HEALING OF SINGLE INSTANCES WITH DEEP
HEALTH CHECKING
6. THE BASIC AUTO-HEALING PATTERN
1. Create ASG with min:max of 1:1
2. Elastic Load Balancer (ELB) provides
deep health checking
!= EC2 Auto Recovery
Top Tip: Deep Health Check
1. Script that checks multiple variables (eg:
process + disk space + memory) and
opens/closes a port via Netcat
2. Set the ELB to “port” type check
Auto Scaling Group
min: 1, max: 1
7. There are strong fiscal motivators to reduce tier-1 operational costs
via the use of automated healing actions
8. ASGS, ETH0, AND STATIC IPS
• LOTS of “cloud” applications want static
networks
• ASG instances receive dynamic IPs
• Add a secondary interface (Elastic
Network Interface – ENI) which maintains
a fixed network address
Auto Scaling Group
min: 1, max: 1
Other cluster
members/users
“Re-mappable” ENI
9. THE PREVIOUS APPROACH
1. Virtual Private Cloud (VPC) with large subnets (eg:>/24)
2. ASG with min:max of 1:1
3. Scripts call EC2 API on boot to “bring” a re-mappable interface to the instance
Runs in the operating system, simple to pause apps for interface
Lots of upstream/deployment co-ordination required
Maintain support for multiple operating systems
Incompatible with the increasing number of AWS Marketplace offerings
Prone to failure if AWS API is under duress (which is also probably when you really want to
be healing!)
10. ONE ALTERNATIVE
Auto Scaling Group
min: 1, max: 1
Other cluster
members/users
ASG Notifications SQS/SNS Lambda
AWS CLI
Slow
Prone to AWS API/backplane duress
Does not understand the state of the operating system
11. GOING BACK TO FIRST PRINCIPLES
• What hands out IP addresses in AWS?
–DHCP (VPC DHCP Options Group)
• Where does the range of IPs come from?
–Subnet size
• Can we reduce the number of IPs available from DHCP to 1?
–Provision an ENI in the subnet -> 1 less IP (for FREE!)
–Provision lots of ENIs in a subnet and there will only be 1 IP left
12. AUTO-HEAL GEN2
1. Create a Subnet for the Auto-Heal node (at the moment /28 is the smallest)
2. Create enough ENIs to remove all but 1 IP from DHCP
3. Create the normal ASG with min:max 1:1
4. Create the ELB with deep health checking as per normal
No scripts, co-ordination, or complexity inside the OS or the deployment framework
Fully compatible with the wide range of black-box AMIs from AWS Marketplace
✕Wastes address space (which may not be an issue depending on your network design
and integration points)
13. Same technique can be used for “fixed clusters”, sets of quorum
servers, container systems
14. PATTERN 2 - ADVANCED PROXY
A SCALABLE , HIGHLY AVAILABLE PROXY WITH ACTIVE DATA
CONTROLS AND STATIC IP RANGES
15. EC2 with Outbound Internet Access
MNAT
EIP
Public Subnet
Private Subnet
EC2EC2EC2
Uncontrolled access to the internet
Reactive techniques such as VPC flow
logs + Lambda are not capable of
running in real-time
* Diagram simplified for clarity, excludes multi availability zone elements
16. • Limitation of VPC: Routes can only
reference single interface
Active control of traffic
HTTP/S inspection
? Non-trivial engineering required
Not truly HA
Relatively low and finite throughput
Prone to EC2 backplane saturation
100s Mb/s per EIP
~HA Transparent Proxy Design
Whitelist
Blacklist
IP List
EIP
Public Subnet
Private Subnet
PROXY PROXY
EC2EC2EC2
* Diagram simplified for clarity, excludes multi availability zone elements
ENI
17. Public SubnetPublic Subnet
Availability Zone A
Auto Scaling Proxy
Availability Zone B
ASG
PROXYPROXY
Active control of traffic
Actively load balanced
Truly HA
≈ Infinite bandwidth
Variable public IPs
Private Subnet Private Subnet
EC2EC2EC2
EC2EC2EC2
“Auto Scaled” EIPs
19. Auto Scaling Proxies with Static IPs
Active control of traffic
Actively load balanced
Static external IP addresses
? ≈ Infinite bandwidth requires co-
ordination
Private SubnetsPrivate Subnets
Availability Zone A Availability Zone B
ASG
EC2EC2EC2
EC2EC2EC2
MNAT
PROXY
PROXY
PROXY
PROXY
PROXY
PROXY
Public Subnets
MNAT
Public Subnets
EIP EIP
20. Why? .....and hang on, I still see EIPs?
Scaling Increments:
10GB @ $42/month/10GB
100’s of Mb/s @ ~$210/month/100’s Mb/s
• Provision 50/100/200Gb/s upfront
• If you move in increments of
+/-100Gb/s, see pattern 3.
• Simple, HA, static IP proxies for a relatively
low uplift in cost
Private SubnetsPrivate Subnets
Availability Zone A Availability Zone B
ASG
EC2EC2EC2
EC2EC2EC2
MNAT
PROXY
PROXY
PROXY
PROXY
PROXY
PROXY
Public Subnets
MNAT
Public Subnets
EIP EIP
21. Complex Inspection Sandwich
• Lots of vendor solutions can now support
healing
• Some even support scaling
• Few support ENI/EIP handling
Private SubnetsPrivate Subnets
Availability Zone A Availability Zone B
EC2EC2EC2
EC2EC2EC2
MNAT
INSPECTION SANDWICH
Public Subnets
MNAT
Public Subnets
EIP EIP
22. PATTERN 3 - AUTO SCALING ANYTHING
A TECHNIQUE LEVERAGING EXISTING SERVICES TO AUTOSCALE ALMOST
ANYTHING
23. The fiscal and operational benefits of Auto Scaling are well understood.
Auto Scaling is currently limited to scaling EC2 instances
We want to apply scaling to entire solutions, not just EC2
24. SCALING CLUSTERS? SCALING CELLS?
• Enterprises have many applications that cannot scale on compute alone
– Sharded databases
– Life Sciences Clusters
– Simulation Clusters
• Organisations are starting to adopt “Cell Architecture” to account for scale
• Auto Scaling Auto Healing
Client Example
~8000 instances connected in “rings” of 20 nodes via a cluster protocol + ~1500 Cassandra
nodes. 50% variance in daily traffic volume. Ideal use-case for Auto Scale
25. THE GENERAL CASE - CELL / SHARD / CLUSTER
EC2
Node1
EC2
Node2
EC2
Node-n…
CloudFormation Stack
Health Check
26. THE GENERAL CASE - CELL / SHARD / CLUSTER
EC2
Node1
EC2
Node2
EC2
Node-n…
CloudFormation Stack
Health Check
EC2
Node1
EC2
Node2
EC2
Node-n…
CloudFormation Stack
Health Check
27. STEP 1 – INSTRUMENT THE SCALING METRIC
… …
CloudWatch Custom Metric
Number of Users CloudWatch Alarm
ScaleUp
CloudWatch Alarm
ScaleDown
28. OPTION 1 – USE LAMBDA
ScaleUp ScaleDown
… …
Number of Users
SNS
…
SNS
Build
Lambda
TeardownL
ambda
CloudFormation
29. WHY NOT LAMBDA?
• Duplication of AWS engineering investment
• Ongoing cost to maintain cadence with the growing features of Auto Scaling
– Scheduled Scaling
– Percentile Scaling
– Machine Learning Scaling / Predictive Scaling
• Lambda still needs a state machine
• We don’t have healing
30. There are strong fiscal and complexity
motivators to use native ASGs
31. STEP 2 – “SHADOW” ASG
ScaleUp
ScaleDown
Number of Users
… … …
Shadow Shadow Shadow
Shadow ASG
32. STEP 3 – ADD THE CFN LAMBDAS
ScaleUp
ScaleDown
Number of Users
… … …
Shadow Shadow Shadow
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
EC2_INSTANCE_TERMINATE
Delete Stack
Shadow ASG
41. CONTINUOUS DELIVERY FOR CLUSTERS
… … … … … …
• Using nothing but the ASG capacity, blue/green roll clusters of almost any size
• Increment ASG in V2.0, wait for health check, decrement ASG in V1.0
V1.0 V2.0
42. AUTO SCALE ANYTHING
• Solution works with many non-scaling AWS services
• CloudFormation can use Custom Resources to create almost anything
• The “Shadow” system only needs the scaling alarms from any CloudWatch metric and a
health check endpoint. Decoupled and does not interact with the system in any way.
47. Number of items in the queue ScaleUp Alarm
SNS
Lambda
Life Sciences
Application
CloudFormation
Shadow ASGLife Sciences
Application
Life Sciences
Application
ScaleDown Alarm
CUSTOM SCALING EXAMPLES
48. Number of planes currently in the air ScaleUp Alarm
SNS
Lambda
Flight Analysis
Stack
CloudFormation
Shadow ASGFlight Analysis
Stack
Flight Analysis
Stack
ScaleDown Alarm
CUSTOM SCALING EXAMPLES
49. Number of door entries ScaleUp Alarm
SNS
Lambda
Trading Stack
CloudFormation
Shadow ASGTrading Stack Trading Stack
ScaleDown Alarm
CUSTOM SCALING EXAMPLES
50. Order Volume ScaleUp Alarm
SNS
Lambda
Number of
robots on
station
CloudFormation
Shadow ASG
Number of
robots on
station
Number of
robots on
station
ScaleDown Alarm
CUSTOM SCALING EXAMPLES
51. Find Out MORE:
Visit Us: At our booth or online – www.sourcedgroup.com
Careers: www.sourcedgroup.com/careers
In the news:
• Computerworld (2016):
• Foreign Exchange Service OFX Embarks on Cloud Migration
• Connecting the Australian Channel (2015):
• Meet the Partner who took Qantas to the AWS Cloud
• The Australian Business Review (2015):
• Greater Buying Power lets Aussie bank on Adobe Experience Manager
Our Awards:
• AWS – Sydney Partners Summit - Invent & Simplify (2015)
• AWS – Global - Customer Obsessed Partner (2014)