2. AWS Cost Reduction Plan and Targets
• Current AWS Spend (all services) is just above $490k/yr.
• The current AWS technology choices, implementations
and practices result in significant overspend and an
opportunity to reduce costs with no lose of features or
endangerment of SLAs
• Reductions will be iterative starting in Q1 with most
reductions implemented in Q2 and Q3
• Reductions month over month can be validated and
demonstrated
Target will be approximately $250k (>50%) full year
operational expense reductions
3. AWS Cost Reduction Plan
Changes will focus on five specific areas
• Autoscaling Environments
• Managing Dev and Production Instance Usage
• Using Spot Instances for ML Model Training
• Switch Model Builds to Serverless Technologies
• Control S3 Storage and Implement Data Life Cycles
…and one Architecture Roadmap item
• Introduce Architecture Changes - Serverless
4. Autoscaling Environments
Observation:
• Current non-Open Enrollment (OE) Daytime utilization is between 1% to 5% with
occasional very transient peaks to 15%.
Cost Reduction Approach:
• Use time based and Utilization Threshold triggers during OE to ensure capacity
and SLA compliance and always maintain n+1 HA configuration
• Reduce to ~10% of current levels for weekday evening and weekend (with
Autoscaling for unexpected demand spikes)
• Increase size and number of instances to match expected load during the
workday (with Autoscaling for unexpected demand spikes)
5. Autoscaling Environments - Details
• Current deployment has 4 high compute capacity EC2
instances running 24x7x365
• The current capacity exceeds the maximum load demand for peak
usage periods (OE periods) by a factor of almost 2:1
• During non-peak usage periods, the over provisioning for max load
is closer to 20:1 with an average of 50:1 over provisioning
• AWS Autoscaling can allow DevOps to define time of
day/time of year based provisioning targets with load
based scale up and down thresholds
• Rough scale and impact of these changes are
illustrated on following slides
10. Managing Dev and Production
Instance Usage
Observation:
• All environments run 24x7x365 and deployments do not reflect usage patterns
Cost Reduction Approach:
• Turn off Dev, Staging & UAT overnight and weekends - 65% reduction as it will be off
and not accruing billing (there are 14 12hr periods, environments should be on only
weekday daytime)
• We will provide capability to turn on & shut down evening and weekends as needed
• This process is mostly scriptable so startup and shutdown will be fast, error free and not impede
development
11. Managing Dev and Production
Instance Usage (2)
Observation:
• Based on usage pattern, most cost effective EC2 products are not deployed
Cost Reduction Approach:
• Use AWS Reserved Instances, aka RIs, (class of RI: 1 year term, full up front) for RDS and
other “baseline” services
• Cost savings of about 45% to 60% based on instance Region, Type and Size
12. Using Spot instances for ML Model
Training
Observation:
• Machine Learning and Model Training do not use the most cost effective EC2 Products
Cost Reduction Approach:
• Significant cost savings will result in a simple configuration change to use “Spot
Instances” instead of “On-demand instances”
• Effort to switch only involves setting the option to use Spot Instances in a configuration
file for ML model training and elated jobs
• Spot Instance average 75% less per hour than On–Demand Instances and equates to a
$8k/month savings
• This has already been implemented and is delivering a $4k/month immediate cost
savings
13. Switch Model Builds to Serverless
Technologies
Observation:
• Process and AWS Products used in Machine Learning and Model Training require
significant additional 3rd party product licensing
Cost Reduction Approach:
• Investigate replacing DataBricks with PyWren or AWS SageMaker.
• Switch will result in minimal cost of AWS usage for model builds and training
• Given an estimate of $40k for development costs to re-engineer process to eliminate
DataBricks, there would be an overall cost reduction of $100k/yr in DataBricks licensing
costs.
• ROI with the elimination of DataBricks Licensing would be 5 months AND Picwell will
also migrate off the problematic DataBricks product.
14. Control S3 Use and Implement Data
Life Cycles
Observation:
• S3 usage is out of control. Volume of data in S3 has been growing unchecked and is now over
102 Terabytes
• Initial attempts to task individuals with S3 reduction in Sprint work have wet with limited
success
• Current “S3 Standard” costs are about $30k/yr, should be closer to $10k/yr
Cost Reduction Approach:
• Plan S3 “Data Party”: Give staff time to review their S3 buckets. Plan meeting to walk though
all buckets and set disposition of all content (Delete or move to IA or Glacier)
• Implement Mandatory Data Life Cycles for all S3 data. Provide tool to monitor large buckets
(i.e any bucket with a total volume > 1 Tb) and publish via Slackbot in an appropriate channel
• Educate staff on use of AWS S3 Glacier for long term (i.e. compliance need based) storage and
define data maintenance lifecycle (i.e. set delete dates where appropriate based on legal or
contractual obligations)
• AWS S3 Glacier cost is 75% lower than AWS S3 Standard storage
• Educate staff on use of S3 Standard-IA (Infrequent Access) data maintenance lifecycle where
appropriate
• AWS Standard-IA cost is 50% lower than AWS S3 Standard storage
15. Expected Savings
Monthly Target
(k)
2018 Target
(k)
Full Year
(k)
Autoscaling Environments $8 $16 $96
Managing Dev and Production Instance Usage $7 $35 $84
Using Spot Instances for ML Model Training $4 $36 $48
Switch Model Builds to Serverless Technologies* $0.5 $1.5 $6
Control S3 Storage and Implement Data Life Cycles $1.5 $10.5 $18
Totals $99,000 $252,000
* An additional $100k/yr reduction in elimination of DataBricks licensing will be realized but
will require $40k in Engineering spend to realize savings (i.e. positive ROI in about 5
months after completing work)
16. Introduce Architecture Changes -
Serverless
Observations:
• Current “traditional” architecture approach can scale but at significantly higher cost than Serverless
architectures
• DevOps and other labor requirements increase at significantly lower growth rate compared to usage
growth rate with Serverless architecture
• Multi-region high availability architectures are significantly easier to maintain and grow under Serverless
• The company’s applications will have a linear and predictable cost growth rate.
• API usage pricing can be implemented and managed
• Lower operational expense – 75% to as much as 90% lower for same throughput
• Most code examined will need few changes to adapt to Serverless deployments. The DevOps changes are more involved
• The Serverless IaaS and PaaS basis is more agile and responds to demand spikes and outages more
effectively than other reference architectures
Cost Reduction Approach:
• This is a very complex topic that will be the focus of another architecture working session
AWS Load Balancers can support asymmetric load balancing (between machines with different load capacities or “types” and “sizes”), but using EC2 instances of all the same type and size reduces complexity and opportunities for configuration errors