Advanced AWS techniques from the trenches of the Enterprise – Sourced Group

Advanced AWS Patterns from the
trenches of the enterprise

John Painter
Principal Consultant
john.painter@sourcedgroup.com
Brent Harrison
Consultant
brent.harrison@sourcedgroup.com
OUR TEAM TODAY
SYDNEY | TORONTO | VANCOUVER | KELOWNA

CONSULTING
Banks Aviation
Telecom
FinTech
Media
Healthcare
Smartphone
Manufacturer
Utilities

• Behind the firewall
• BYO AWS Account
• Guaranteed single tenancy
• Multi-cloud options
• Customer-controlled encryption
• Customer retains custody of all data
Over half a petabyte per year of Splunk
throughput under management
ENGINEERED SERVICES

PATTERN 1 – AUTO HEALING GEN2
AUTOMATED HEALING OF SINGLE INSTANCES WITH DEEP
HEALTH CHECKING

THE BASIC AUTO-HEALING PATTERN
1. Create ASG with min:max of 1:1
2. Elastic Load Balancer (ELB) provides
deep health checking
!= EC2 Auto Recovery
Top Tip: Deep Health Check
1. Script that checks multiple variables (eg:
process + disk space + memory) and
opens/closes a port via Netcat
2. Set the ELB to “port” type check
Auto Scaling Group
min: 1, max: 1

There are strong fiscal motivators to reduce tier-1 operational costs
via the use of automated healing actions

ASGS, ETH0, AND STATIC IPS
• LOTS of “cloud” applications want static
networks
• ASG instances receive dynamic IPs
• Add a secondary interface (Elastic
Network Interface – ENI) which maintains
a fixed network address
Auto Scaling Group
min: 1, max: 1
Other cluster
members/users
“Re-mappable” ENI

THE PREVIOUS APPROACH
1. Virtual Private Cloud (VPC) with large subnets (eg:>/24)
2. ASG with min:max of 1:1
3. Scripts call EC2 API on boot to “bring” a re-mappable interface to the instance
 Runs in the operating system, simple to pause apps for interface
Lots of upstream/deployment co-ordination required
Maintain support for multiple operating systems
Incompatible with the increasing number of AWS Marketplace offerings
Prone to failure if AWS API is under duress (which is also probably when you really want to
be healing!)

ONE ALTERNATIVE
Auto Scaling Group
min: 1, max: 1
Other cluster
members/users
ASG Notifications SQS/SNS Lambda
AWS CLI
 Slow
 Prone to AWS API/backplane duress
 Does not understand the state of the operating system

GOING BACK TO FIRST PRINCIPLES
• What hands out IP addresses in AWS?
–DHCP (VPC DHCP Options Group)
• Where does the range of IPs come from?
–Subnet size
• Can we reduce the number of IPs available from DHCP to 1?
–Provision an ENI in the subnet -> 1 less IP (for FREE!)
–Provision lots of ENIs in a subnet and there will only be 1 IP left

AUTO-HEAL GEN2
1. Create a Subnet for the Auto-Heal node (at the moment /28 is the smallest)
2. Create enough ENIs to remove all but 1 IP from DHCP
3. Create the normal ASG with min:max 1:1
4. Create the ELB with deep health checking as per normal
No scripts, co-ordination, or complexity inside the OS or the deployment framework
Fully compatible with the wide range of black-box AMIs from AWS Marketplace
✕Wastes address space (which may not be an issue depending on your network design
and integration points)

Same technique can be used for “fixed clusters”, sets of quorum
servers, container systems

PATTERN 2 - ADVANCED PROXY
A SCALABLE , HIGHLY AVAILABLE PROXY WITH ACTIVE DATA
CONTROLS AND STATIC IP RANGES

EC2 with Outbound Internet Access
MNAT
EIP
Public Subnet
Private Subnet
EC2EC2EC2
 Uncontrolled access to the internet
 Reactive techniques such as VPC flow
logs + Lambda are not capable of
running in real-time
* Diagram simplified for clarity, excludes multi availability zone elements

• Limitation of VPC: Routes can only
reference single interface
 Active control of traffic
 HTTP/S inspection
? Non-trivial engineering required
 Not truly HA
 Relatively low and finite throughput
 Prone to EC2 backplane saturation
 100s Mb/s per EIP
~HA Transparent Proxy Design
Whitelist
Blacklist
IP List
EIP
Public Subnet
Private Subnet
PROXY PROXY
EC2EC2EC2
* Diagram simplified for clarity, excludes multi availability zone elements
ENI

Public SubnetPublic Subnet
Availability Zone A
Auto Scaling Proxy
Availability Zone B
ASG
PROXYPROXY
 Actively load balanced
 Truly HA
 ≈ Infinite bandwidth
 Variable public IPs
Private Subnet Private Subnet
EC2EC2EC2
EC2EC2EC2
“Auto Scaled” EIPs

Variable edge IPs are undesirable in the enterprise

Auto Scaling Proxies with Static IPs
 Actively load balanced
 Static external IP addresses
? ≈ Infinite bandwidth requires co-
ordination
Private SubnetsPrivate Subnets
Availability Zone A Availability Zone B
ASG
EC2EC2EC2
EC2EC2EC2
MNAT
PROXY
PROXY
PROXY
PROXY
PROXY
PROXY
Public Subnets
MNAT
Public Subnets
EIP EIP

Why? .....and hang on, I still see EIPs?
Scaling Increments:
10GB @ $42/month/10GB
100’s of Mb/s @ ~$210/month/100’s Mb/s
• Provision 50/100/200Gb/s upfront
• If you move in increments of
+/-100Gb/s, see pattern 3.
• Simple, HA, static IP proxies for a relatively
low uplift in cost
ASG
EC2EC2EC2
EC2EC2EC2
MNAT
PROXY
PROXY
PROXY
PROXY
PROXY
PROXY
Public Subnets
MNAT
Public Subnets
EIP EIP

Complex Inspection Sandwich
• Lots of vendor solutions can now support
healing
• Some even support scaling
• Few support ENI/EIP handling
EC2EC2EC2
EC2EC2EC2
MNAT
INSPECTION SANDWICH
Public Subnets
MNAT
Public Subnets
EIP EIP

PATTERN 3 - AUTO SCALING ANYTHING
A TECHNIQUE LEVERAGING EXISTING SERVICES TO AUTOSCALE ALMOST
ANYTHING

The fiscal and operational benefits of Auto Scaling are well understood.
Auto Scaling is currently limited to scaling EC2 instances
We want to apply scaling to entire solutions, not just EC2

SCALING CLUSTERS? SCALING CELLS?
• Enterprises have many applications that cannot scale on compute alone
– Sharded databases
– Life Sciences Clusters
– Simulation Clusters
• Organisations are starting to adopt “Cell Architecture” to account for scale
• Auto Scaling  Auto Healing
Client Example
~8000 instances connected in “rings” of 20 nodes via a cluster protocol + ~1500 Cassandra
nodes. 50% variance in daily traffic volume. Ideal use-case for Auto Scale

THE GENERAL CASE - CELL / SHARD / CLUSTER
EC2
Node1
EC2
Node2
EC2
Node-n…
CloudFormation Stack
Health Check

THE GENERAL CASE - CELL / SHARD / CLUSTER
EC2
Node1
EC2
Node2
EC2
Node-n…
Health Check
EC2
Node1
EC2
Node2
EC2
Node-n…
Health Check

STEP 1 – INSTRUMENT THE SCALING METRIC
… …
CloudWatch Custom Metric
Number of Users CloudWatch Alarm
ScaleUp
CloudWatch Alarm
ScaleDown

OPTION 1 – USE LAMBDA
ScaleUp ScaleDown
… …
Number of Users
SNS
…
SNS
Build
Lambda
TeardownL
ambda
CloudFormation

WHY NOT LAMBDA?
• Duplication of AWS engineering investment
• Ongoing cost to maintain cadence with the growing features of Auto Scaling
– Scheduled Scaling
– Percentile Scaling
– Machine Learning Scaling / Predictive Scaling
• Lambda still needs a state machine
• We don’t have healing

There are strong fiscal and complexity
motivators to use native ASGs

STEP 2 – “SHADOW” ASG
ScaleUp
ScaleDown
Number of Users
… … …
Shadow Shadow Shadow
Shadow ASG

STEP 3 – ADD THE CFN LAMBDAS
ScaleUp
ScaleDown
Number of Users
… … …
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
EC2_INSTANCE_TERMINATE
Delete Stack
Shadow ASG

$5.76 per month per stack
(Unoptimized)

STEP 4 – HEALTH CHECK THE CLUSTERS
ScaleUp
ScaleDown
Number of Users
… … …
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
Delete Stack
Shadow ASG

HEALING SCENARIO 1 – CLUSTER FAILS
ScaleUp
ScaleDown
Number of Users
… … …
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
Delete Stack
Shadow ASG

HEALING SCENARIO 1 – SHADOW TERMINATED
ScaleUp
ScaleDown
Number of Users
… … …
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
Delete Stack
Shadow ASG

HEALING SCENARIO 1 – ASG IS IMPACTED
ScaleUp
ScaleDown
Number of Users
… …
Shadow Shadow
Shadow ASG
Desired: 3
Actual: 2

HEALING SCENARIO 1 – CLUSTER RESTORED
ScaleUp
ScaleDown
Number of Users
… … …
Auto Scaling SNS
EC2_INSTANCE_LAUNCH
Create Stack
Delete Stack
Shadow ASG

Continuous Delivery for Clusters
Blue/Green Updates for Clusters at Huge Scale

CONTINUOUS DELIVERY FOR CLUSTERS
… … … … … …
• Using nothing but the ASG capacity, blue/green roll clusters of almost any size
• Increment ASG in V2.0, wait for health check, decrement ASG in V1.0
V1.0 V2.0

AUTO SCALE ANYTHING
• Solution works with many non-scaling AWS services
• CloudFormation can use Custom Resources to create almost anything
• The “Shadow” system only needs the scaling alarms from any CloudWatch metric and a
health check endpoint. Decoupled and does not interact with the system in any way.

Database Throughput ScaleUp Alarm
SNS
Lambda
RDS Read
Slave
CloudFormation
Shadow ASGRDS Read
Slave
RDS Read
Slave
ScaleDown Alarm
CUSTOM SCALING EXAMPLES

Number of user sign-ups/logins ScaleUp Alarm
SNS
Lambda
Application
Shard
CloudFormation
Shadow ASGApplication
Shard
Application
Shard
ScaleDown Alarm

CPU/Memory ScaleUp Alarm
SNS
Lambda
VMWare Node
CloudFormation
Shadow ASGVMWare Node VMWare Node
ScaleDown Alarm

CPU/Memory ScaleUp Alarm
SNS
Lambda
Other
infrastructure
platforms
CloudFormation
Shadow ASG
Other
infrastructure
platforms
Other
infrastructure
platforms
ScaleDown Alarm

Number of items in the queue ScaleUp Alarm
SNS
Lambda
Life Sciences
Application
CloudFormation
Shadow ASGLife Sciences
Application
Life Sciences
Application
ScaleDown Alarm

Number of planes currently in the air ScaleUp Alarm
SNS
Lambda
Flight Analysis
Stack
CloudFormation
Shadow ASGFlight Analysis
Stack
Flight Analysis
Stack
ScaleDown Alarm

Number of door entries ScaleUp Alarm
SNS
Lambda
Trading Stack
CloudFormation
Shadow ASGTrading Stack Trading Stack
ScaleDown Alarm

Order Volume ScaleUp Alarm
SNS
Lambda
Number of
robots on
station
CloudFormation
Shadow ASG
Number of
robots on
station
Number of
robots on
station
ScaleDown Alarm

Find Out MORE:
Visit Us: At our booth or online – www.sourcedgroup.com
Careers: www.sourcedgroup.com/careers
In the news:
• Computerworld (2016):
• Foreign Exchange Service OFX Embarks on Cloud Migration
• Connecting the Australian Channel (2015):
• Meet the Partner who took Qantas to the AWS Cloud
• The Australian Business Review (2015):
• Greater Buying Power lets Aussie bank on Adobe Experience Manager
Our Awards:
• AWS – Sydney Partners Summit - Invent & Simplify (2015)
• AWS – Global - Customer Obsessed Partner (2014)

Advanced AWS techniques from the trenches of the Enterprise – Sourced Group

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Advanced AWS techniques from the trenches of the Enterprise – Sourced Group

Ähnlich wie Advanced AWS techniques from the trenches of the Enterprise – Sourced Group (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Advanced AWS techniques from the trenches of the Enterprise – Sourced Group