SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Five years of EC2
          distilled
                  Grig Gheorghiu

Silicon Valley Cloud Computing Meetup, Feb. 19th 2013

                     @griggheo
             agiletesting.blogspot.com
whoami

• Dir of Technology at Reliam (managed
  hosting)
• Sr Sys Architect at OpenX
• VP Technical Ops at Evite
• VP Technical Ops at Nasty Gal
EC2 creds

• Started with personal m1.small instance in
    2008
• Still around!
• UPTIME:
•   5:13:52 up 438 days, 23:33,   1 user,   load average:
    0.03, 0.09, 0.08
EC2 at OpenX
• end of 2008
• 100s then 1000s of instances
• one of largest AWS customers at the time
• NAMING is very important
 • terminated DB server by mistake
 • in ideal world naming doesn’t matter
EC2 at OpenX (cont.)
• Failures are very frequent at scale
• Forced to architect for failure and
  horizontal scaling
• Hard to scale at all layers at the same time
  (scaling app server layer can overwhelm DB
  layer; play wack-a-mole)
• Elasticity: easier to scale out than scale back
EC2 at OpenX (cont.)
• Automation and configuration management
  become critical
 • Used little-known tool - ‘slack’
 • Rolled own EC2 management tool in
    Python, wrapped around EC2 Java API
 • Testing deployments is critical (one
    mistake can get propagated everywhere)
EC2 at OpenX (cont.)
• Hard to scale at the DB layer (MySQL)
 • mysql-proxy for r/w split
 • slaves behind HAProxy for reads
• HAProxy for LB, then ELB
 • ELB melted initially, had to be gradually
    warmed up
EC2 at Evite

• Sharded MySQL at DB layer; application
  very write-intensive
• Didn’t do proper capacity planning/dark
  launching; had to move quickly from data
  center to EC2 to scale horizontally
• Engaged Percona at the same time
EC2 at Evite (cont.)
• Started with EBS volumes (separate for
  data, transaction logs, temp files)
• EBS horror stories
• CPU Wait up to 100%, instances AWOL
• I/O very inconsistent, unpredictable
• Striped EBS volumes in RAID0 helps with
  performance but not with reliability
EC2 at Evite (cont.)
•   EBS apocalypse in April 2011

•   Hit us even with masters and slaves in diff.
    availability zones (but all in single region -
    mistake!)

•   IMPORTANT: rebuilding redundancy into your
    system is HARD

•   For DB servers, reloading data on new server is
    a lengthy process
EC2 at Evite (cont.)
• General operation: very frequent failures
  (once a week); nightmare for pager duty
• Got very good at disaster recovery!
  •   Failover of master to slave

  •   Rebuilding of slave from master (xtrabackup)

• Local disks striped in RAID0 better than
  EBS
EC2 at Evite (cont.)
• Ended up moving DB servers back to data
  center
• Bare metal (Dell C2100, 144 GB RAM,
  RAID10); 2 MySQL instances per server
• Lots of tuning help from Percona
• BUT: EC2 was great for capacity planning!
  (Zynga does the same)
EC2 at Evite (cont.)
• Relational databases are not ready for the
  cloud (reliability, I/O performance)
• Still keep MySQL slaves in EC2 for DR
• Ryan Macktechnologies so“Wecould better
  understood
              (Facebook):
                           we
                              chose well-

  predict capacity needs and rely on our existing
  monitoring and operational tool kits."
EC2 at Evite (cont.)
• Didn’t use provisioned IOPS for EBS
• Didn’t use VPC
• Great experience with Elastic Map Reduce,
  S3, Route 53 DNS
• Not so great experience with DynamoDB
• ELB OK but still need HAProxy behind it
EC2 at NastyGal
• VPC - really good idea!
 • Extension of data center infrastructure
 • Currently using it for dev/staging + some
    internal backend production
 • Challenging to set up VPN tunnels to
    various firewall vendors (Cisco, Fortinet)
    - not much debugging on VPC side
Interacting with AWS
• AWS API (mostly Java based, but also Ruby
  and Python)
• Multi-cloud libraries: jclouds (Java), libcloud
  (Python), deltacloud (Ruby)
• Chef knife
• Vagrant EC2 provider
• Roll your own
Proper infrastructure care
              and feeding
• Monitoring - alerting, logging, graphing
• It’s not in production if it’s not monitored
  and graphed
• Monitoring is for ops what testing is for
  dev
  • Great way to learn a new infrastructure
  • Dev and ops on pager
Proper infrastructure care
       and feeding
• Going from #monitoringsucks to
  #monitoringlove and @monitorama
• Modern monitoring/graphing/logging tools
 • Sensu, Graphite, Boundary, Server
    Density, New Relic, Papertrail, Pingdom,
    Dead Man’s Snitch
Proper infrastructure care
       and feeding
•   Dashboards!

•   Mission Control page with graphs based on
    Graphite and Google Visualization API

•   Correlate spikes and dips in graphs with errors
    (external and internal monitoring)

    •   Akamai HTTP 500 alerts correlated with Web
        server 500 errors and DB server I/O wait
        increase
Proper infrastructure care
       and feeding



•   HTTP 500 errors as a percentage of all HTTP
    requests across all app servers in the last 60
    minutes
Proper infrastructure care
       and feeding
•   Expect failures and recover quickly

•   Capacity planning
    •   Dark launching

    •   Measure baselines

    •   Correlate external symptoms (HTTP 500) with
        metrics (CPU I/O Wait) then keep metrics
        under certain thresholds by adding resources
Proper infrastructure care
       and feeding
•   Automate, automate, automate! - Chef, Puppet,
    CFEngine, Jenkins, Capistrano, Fabric

•   Chef - can be single source of truth for
    infrastructure
    •   Running chef-client continuously on nodes
        requires discipline

    •   Logging into remote node is anti-pattern (hard!)
Proper infrastructure care
       and feeding
•   Chef best practices

    •   Use knife - no snowflakes!

    •   Deploy new nodes, don’t do massive updates
        in place

•   BUT! beware of OS monoculture

    •   kernel bug after 200+ days

    •   leapocalypse
Is the cloud worth the
          hype?
•   It’s a game changer, but it’s not magical; try before
    you buy! (benchmarks could surprise you)

•   Cloud expert? Carry pager or STFU

•   Forces you to think about failure recovery,
    horizontal scalability, automation

•   Something to be said about abstracting away the
    physical network - the most obscure bugs are
    network-related (ARP caching, routing tables)
So...when should I use
      the cloud?
• Great for dev/staging/testing
• Great for layers of infrastructure that
  contain many identical nodes and that are
  forgiving of node failures (web farms,
  Hadoop nodes, distributed databases)
• Not great for ‘snowflake’-type systems
• Not great for RDBMS (esp. write-intensive)
If you still want to use
       the cloud
•   Watch that monthly bill!

•   Use multiple cloud vendors
•   Design your infrastructure to scale horizontally
    and to be portable across cloud vendors

    •   Shared nothing

    •   No SAN, NAS
If you still want to use
       the cloud
•   Don’t get locked into vendor-proprietary
    services
    •   EC2, S3, Route 53, EMR are OK

    •   Data stores are not OK (DynamoDB)

    •   OpsWorks - debatable (based on Chef, but still
        locks you in)

    •   Wrap services in your own RESTful endpoints
Does EC2 have rivals?
•   No (or at least not yet)
•   Anybody use GCE?
•   Other public clouds are either toys or
    smaller, with less features (no names named)
•   Perception matters - not a contender unless
    featured on High Scalability blog
•   APIs matter less (can use multi-cloud libs)
Does EC2 have rivals?
•   OpenStack, CloudStack, Eucalyptus all seem
    promising
•   Good approach: private infrastructure (bare
    metal, private cloud) for performance/
    reliability + extension into public cloud for
    elasticity/agility (EC2 VPC, Rack Connect)

• How about PaaS?
 • Personally: too hard to relinquish control

Weitere ähnliche Inhalte

Was ist angesagt?

(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014Amazon Web Services
 
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsRainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle
 
AppScale Talk at SBonRails
AppScale Talk at SBonRailsAppScale Talk at SBonRails
AppScale Talk at SBonRailsChris Bunch
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rbChris Bunch
 
AppScale + Neptune @ HPCDB
AppScale + Neptune @ HPCDBAppScale + Neptune @ HPCDB
AppScale + Neptune @ HPCDBChris Bunch
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisionsTrent Hornibrook
 
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaFullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaCodemotion Tel Aviv
 
How Serverless Changes DevOps
How Serverless Changes DevOpsHow Serverless Changes DevOps
How Serverless Changes DevOpsRichard Donkin
 
From vagrant to production - Mark Eijsermans
From vagrant to production - Mark EijsermansFrom vagrant to production - Mark Eijsermans
From vagrant to production - Mark EijsermansDevopsdays
 
Building a PaaS with Docker and AWS
Building a PaaS with Docker and AWSBuilding a PaaS with Docker and AWS
Building a PaaS with Docker and AWSvesirin
 
SF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesSF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesYash Ranadive
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the CloudEberhard Wolff
 
Appscale at CLOUDCOMP '09
Appscale at CLOUDCOMP '09Appscale at CLOUDCOMP '09
Appscale at CLOUDCOMP '09Chris Bunch
 
Deployment pipeline for Azure SQL Databases
Deployment pipeline for Azure SQL DatabasesDeployment pipeline for Azure SQL Databases
Deployment pipeline for Azure SQL DatabasesEduardo Piairo
 

Was ist angesagt? (20)

(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
 
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsRainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
 
AppScale Talk at SBonRails
AppScale Talk at SBonRailsAppScale Talk at SBonRails
AppScale Talk at SBonRails
 
AppScale @ LA.rb
AppScale @ LA.rbAppScale @ LA.rb
AppScale @ LA.rb
 
AppScale + Neptune @ HPCDB
AppScale + Neptune @ HPCDBAppScale + Neptune @ HPCDB
AppScale + Neptune @ HPCDB
 
Olivier_Tisserand_projects
Olivier_Tisserand_projectsOlivier_Tisserand_projects
Olivier_Tisserand_projects
 
Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaFullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
 
How Serverless Changes DevOps
How Serverless Changes DevOpsHow Serverless Changes DevOps
How Serverless Changes DevOps
 
From vagrant to production - Mark Eijsermans
From vagrant to production - Mark EijsermansFrom vagrant to production - Mark Eijsermans
From vagrant to production - Mark Eijsermans
 
Building a PaaS with Docker and AWS
Building a PaaS with Docker and AWSBuilding a PaaS with Docker and AWS
Building a PaaS with Docker and AWS
 
SF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesSF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup Slides
 
Campus days Azure HDInsight automation
Campus days Azure HDInsight automationCampus days Azure HDInsight automation
Campus days Azure HDInsight automation
 
Inrastructure as Code
Inrastructure as CodeInrastructure as Code
Inrastructure as Code
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
 
Neptune @ SoCal
Neptune @ SoCalNeptune @ SoCal
Neptune @ SoCal
 
Appscale at CLOUDCOMP '09
Appscale at CLOUDCOMP '09Appscale at CLOUDCOMP '09
Appscale at CLOUDCOMP '09
 
Deployment pipeline for Azure SQL Databases
Deployment pipeline for Azure SQL DatabasesDeployment pipeline for Azure SQL Databases
Deployment pipeline for Azure SQL Databases
 

Andere mochten auch

Thing. An unexpected journey. Devoxx UK 2014
Thing. An unexpected journey. Devoxx UK 2014Thing. An unexpected journey. Devoxx UK 2014
Thing. An unexpected journey. Devoxx UK 2014darach
 
Government as a Platform
Government as a PlatformGovernment as a Platform
Government as a PlatformTim O'Reilly
 
Agile Testing Pasadena JUG Aug2009
Agile Testing Pasadena JUG Aug2009Agile Testing Pasadena JUG Aug2009
Agile Testing Pasadena JUG Aug2009Grig Gheorghiu
 
[デブサミ2012]趣味と実益の脆弱性発見
[デブサミ2012]趣味と実益の脆弱性発見[デブサミ2012]趣味と実益の脆弱性発見
[デブサミ2012]趣味と実益の脆弱性発見Yosuke HASEGAWA
 
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)Tim O'Reilly
 

Andere mochten auch (6)

Tools I Carry
Tools I CarryTools I Carry
Tools I Carry
 
Thing. An unexpected journey. Devoxx UK 2014
Thing. An unexpected journey. Devoxx UK 2014Thing. An unexpected journey. Devoxx UK 2014
Thing. An unexpected journey. Devoxx UK 2014
 
Government as a Platform
Government as a PlatformGovernment as a Platform
Government as a Platform
 
Agile Testing Pasadena JUG Aug2009
Agile Testing Pasadena JUG Aug2009Agile Testing Pasadena JUG Aug2009
Agile Testing Pasadena JUG Aug2009
 
[デブサミ2012]趣味と実益の脆弱性発見
[デブサミ2012]趣味と実益の脆弱性発見[デブサミ2012]趣味と実益の脆弱性発見
[デブサミ2012]趣味と実益の脆弱性発見
 
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)
Picnic version: The Clothesline Paradox and the Sharing Economy (pdf with notes)
 

Ähnlich wie Five Years of EC2 Distilled

Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018David Stockton
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Journey towards serverless infrastructure
Journey towards serverless infrastructureJourney towards serverless infrastructure
Journey towards serverless infrastructureVille Seppänen
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?DATAVERSITY
 
Store
StoreStore
StoreESUG
 
NDev Talk - Serverless Design Patterns
NDev Talk - Serverless Design PatternsNDev Talk - Serverless Design Patterns
NDev Talk - Serverless Design PatternsRyan Green
 
NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016Igmar Palsenberg
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsMongoDB
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScalemmoline
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Hashicorp at holaluz
Hashicorp at holaluzHashicorp at holaluz
Hashicorp at holaluzRicard Clau
 
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSCloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSAWS Vietnam Community
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 

Ähnlich wie Five Years of EC2 Distilled (20)

Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Journey towards serverless infrastructure
Journey towards serverless infrastructureJourney towards serverless infrastructure
Journey towards serverless infrastructure
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
Store
StoreStore
Store
 
PaaS with Java
PaaS with JavaPaaS with Java
PaaS with Java
 
NDev Talk - Serverless Design Patterns
NDev Talk - Serverless Design PatternsNDev Talk - Serverless Design Patterns
NDev Talk - Serverless Design Patterns
 
NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and Results
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Moving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScaleMoving to the Cloud: AWS, Zend, RightScale
Moving to the Cloud: AWS, Zend, RightScale
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Hashicorp at holaluz
Hashicorp at holaluzHashicorp at holaluz
Hashicorp at holaluz
 
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWSCloudsolutionday 2016: DevOps workflow with Docker on AWS
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 

Kürzlich hochgeladen

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Five Years of EC2 Distilled

  • 1. Five years of EC2 distilled Grig Gheorghiu Silicon Valley Cloud Computing Meetup, Feb. 19th 2013 @griggheo agiletesting.blogspot.com
  • 2. whoami • Dir of Technology at Reliam (managed hosting) • Sr Sys Architect at OpenX • VP Technical Ops at Evite • VP Technical Ops at Nasty Gal
  • 3. EC2 creds • Started with personal m1.small instance in 2008 • Still around! • UPTIME: • 5:13:52 up 438 days, 23:33, 1 user, load average: 0.03, 0.09, 0.08
  • 4. EC2 at OpenX • end of 2008 • 100s then 1000s of instances • one of largest AWS customers at the time • NAMING is very important • terminated DB server by mistake • in ideal world naming doesn’t matter
  • 5. EC2 at OpenX (cont.) • Failures are very frequent at scale • Forced to architect for failure and horizontal scaling • Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole) • Elasticity: easier to scale out than scale back
  • 6. EC2 at OpenX (cont.) • Automation and configuration management become critical • Used little-known tool - ‘slack’ • Rolled own EC2 management tool in Python, wrapped around EC2 Java API • Testing deployments is critical (one mistake can get propagated everywhere)
  • 7. EC2 at OpenX (cont.) • Hard to scale at the DB layer (MySQL) • mysql-proxy for r/w split • slaves behind HAProxy for reads • HAProxy for LB, then ELB • ELB melted initially, had to be gradually warmed up
  • 8. EC2 at Evite • Sharded MySQL at DB layer; application very write-intensive • Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally • Engaged Percona at the same time
  • 9. EC2 at Evite (cont.) • Started with EBS volumes (separate for data, transaction logs, temp files) • EBS horror stories • CPU Wait up to 100%, instances AWOL • I/O very inconsistent, unpredictable • Striped EBS volumes in RAID0 helps with performance but not with reliability
  • 10. EC2 at Evite (cont.) • EBS apocalypse in April 2011 • Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!) • IMPORTANT: rebuilding redundancy into your system is HARD • For DB servers, reloading data on new server is a lengthy process
  • 11. EC2 at Evite (cont.) • General operation: very frequent failures (once a week); nightmare for pager duty • Got very good at disaster recovery! • Failover of master to slave • Rebuilding of slave from master (xtrabackup) • Local disks striped in RAID0 better than EBS
  • 12. EC2 at Evite (cont.) • Ended up moving DB servers back to data center • Bare metal (Dell C2100, 144 GB RAM, RAID10); 2 MySQL instances per server • Lots of tuning help from Percona • BUT: EC2 was great for capacity planning! (Zynga does the same)
  • 13. EC2 at Evite (cont.) • Relational databases are not ready for the cloud (reliability, I/O performance) • Still keep MySQL slaves in EC2 for DR • Ryan Macktechnologies so“Wecould better understood (Facebook): we chose well- predict capacity needs and rely on our existing monitoring and operational tool kits."
  • 14. EC2 at Evite (cont.) • Didn’t use provisioned IOPS for EBS • Didn’t use VPC • Great experience with Elastic Map Reduce, S3, Route 53 DNS • Not so great experience with DynamoDB • ELB OK but still need HAProxy behind it
  • 15. EC2 at NastyGal • VPC - really good idea! • Extension of data center infrastructure • Currently using it for dev/staging + some internal backend production • Challenging to set up VPN tunnels to various firewall vendors (Cisco, Fortinet) - not much debugging on VPC side
  • 16. Interacting with AWS • AWS API (mostly Java based, but also Ruby and Python) • Multi-cloud libraries: jclouds (Java), libcloud (Python), deltacloud (Ruby) • Chef knife • Vagrant EC2 provider • Roll your own
  • 17. Proper infrastructure care and feeding • Monitoring - alerting, logging, graphing • It’s not in production if it’s not monitored and graphed • Monitoring is for ops what testing is for dev • Great way to learn a new infrastructure • Dev and ops on pager
  • 18. Proper infrastructure care and feeding • Going from #monitoringsucks to #monitoringlove and @monitorama • Modern monitoring/graphing/logging tools • Sensu, Graphite, Boundary, Server Density, New Relic, Papertrail, Pingdom, Dead Man’s Snitch
  • 19. Proper infrastructure care and feeding • Dashboards! • Mission Control page with graphs based on Graphite and Google Visualization API • Correlate spikes and dips in graphs with errors (external and internal monitoring) • Akamai HTTP 500 alerts correlated with Web server 500 errors and DB server I/O wait increase
  • 20. Proper infrastructure care and feeding • HTTP 500 errors as a percentage of all HTTP requests across all app servers in the last 60 minutes
  • 21. Proper infrastructure care and feeding • Expect failures and recover quickly • Capacity planning • Dark launching • Measure baselines • Correlate external symptoms (HTTP 500) with metrics (CPU I/O Wait) then keep metrics under certain thresholds by adding resources
  • 22. Proper infrastructure care and feeding • Automate, automate, automate! - Chef, Puppet, CFEngine, Jenkins, Capistrano, Fabric • Chef - can be single source of truth for infrastructure • Running chef-client continuously on nodes requires discipline • Logging into remote node is anti-pattern (hard!)
  • 23. Proper infrastructure care and feeding • Chef best practices • Use knife - no snowflakes! • Deploy new nodes, don’t do massive updates in place • BUT! beware of OS monoculture • kernel bug after 200+ days • leapocalypse
  • 24. Is the cloud worth the hype? • It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you) • Cloud expert? Carry pager or STFU • Forces you to think about failure recovery, horizontal scalability, automation • Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)
  • 25. So...when should I use the cloud? • Great for dev/staging/testing • Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases) • Not great for ‘snowflake’-type systems • Not great for RDBMS (esp. write-intensive)
  • 26. If you still want to use the cloud • Watch that monthly bill! • Use multiple cloud vendors • Design your infrastructure to scale horizontally and to be portable across cloud vendors • Shared nothing • No SAN, NAS
  • 27. If you still want to use the cloud • Don’t get locked into vendor-proprietary services • EC2, S3, Route 53, EMR are OK • Data stores are not OK (DynamoDB) • OpsWorks - debatable (based on Chef, but still locks you in) • Wrap services in your own RESTful endpoints
  • 28. Does EC2 have rivals? • No (or at least not yet) • Anybody use GCE? • Other public clouds are either toys or smaller, with less features (no names named) • Perception matters - not a contender unless featured on High Scalability blog • APIs matter less (can use multi-cloud libs)
  • 29. Does EC2 have rivals? • OpenStack, CloudStack, Eucalyptus all seem promising • Good approach: private infrastructure (bare metal, private cloud) for performance/ reliability + extension into public cloud for elasticity/agility (EC2 VPC, Rack Connect) • How about PaaS? • Personally: too hard to relinquish control