Microservices is a software architectural method where you decompose complex applications into smaller, independent services. Containers are great for running small decoupled services, but how do you coordinate running microservices in production at scale and what AWS services do you use?
In this session, we will explore the reasoning and concepts behind microservices and how containers simplify building microservices based applications. We will also demonstrate how you can easily launch microservices on Amazon EC2 Container Service and how you can use ELB and Route 53 to easily do service discovery between microservices.
16. Expedia’s ECS Base AMI
• Based on Amazon’s ECS Optimized AMI
• e.g. “amzn-ami-2016.03.b-amazon-ecs-optimized”
• CloudFormation userdata runs at launch time to set up:
• OS Hardening
• Security
• Network configuration
• Log forwarding
• Cron job: Push EC2 statistics and custom metrics
• Run ‘cadvisor’ and ‘docker-cleanup’ as ECS Tasks on each instance (using ‘start-
task’)
17. Zero-Downtime Instance Replacement
• Uses a Lambda to avoid outages in production during a cluster instance rolling update
• Lambda is triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events
• Lambda deregisters the instance from the ECS cluster
• Lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for
30mins
• This is generally enough to allow ECS to reschedule any tasks that are part of a service to another
instance
• Downsides:
• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can
get bumped from instance to instance until all instances are replaced
• 30mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during that
time can cause confusion around why old and new versions of service are running behind ELB
• ECS agent pulls Docker containers serially so can take a while to launch a bunch of new tasks
39. Detecting & Remediating Broken Instances?
• Custom CloudWatch metrics
• How long does “docker images” take? Alarm if longer than 4 seconds for 5mins
• How long does “docker ps” take? Alarm if longer than 4 seconds for 5mins
• Is the ecs agent running? Alarm if not for 5mins
• Manual remediation based on email alert
• Run “evict_instance” script
• Terminates instance via ASG – allows Lambda to deregister and pause termination
• aws autoscaling terminate-instance-in-auto-scaling-group --region $REGION --
instance-id $INSTANCE_ID --no-should-decrement-desired-capacity
41. Auto-Scaling ECS Host Instances
• Scale Up:
• CPU Reservation across entire cluster > 70% for 5mins
or
• Memory Reservation across entire cluster > 60% for 5mins
• Scale Down
• CPU Reservation < 20% for 5mins
or
• Memory Reservation < 40% for 5mins
42. Lessons Learnt
• Use Immutable Servers with CloudFormation
• Suspend ASG Processes During CFN Rolling Update
• Scale Down Carefully
43. Future Work
• Auto Scaling at task level
• Bulk Instance Replacement
• Workload Profiles
• Treat Clusters as Cattle