SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Planning
to fail

@davegardnerisme
#phpuk2013
dave
the taxi app
Planning
 to fail
Planning
for failure
Planning
 to fail
The beginning
<?php
My website: single VPS running PHP + MySQL
No growth, low volume, simple functionality, one engineer (me!)
Large growth, high volume, complex functionality, lots of engineers
• Launched in London
  November 2011

• Now in 5 cities in 3 countries
  (30%+ growth every month)

• A Hailo hail is accepted around
  the world every 5 seconds
“.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in

complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of
  instances)

• 10+ engineers building services
                and you?
                (hailo is hiring)
Our overall
reliability is in
    danger
Embracing failure

(a coping strategy)
VPC
(running PHP+MySQL)




                      reliable?
Reliable
  !==
Resilient
Choosing a stack
“Hailo”
(running PHP+MySQL)




                      reliable?
Service    Service         Service        Service


      each service does one job well



          Service Oriented Architecture
• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation
  if needed
Service
(running PHP+MySQL)




                      reliable?
Service                     MySQL




   MySQL running on different box
MySQL
Service
                            MySQL



 MySQL running in Multi-Master mode
Going global
CRUD
                          Locking
MySQL                     Search
                          Analytics
                          ID generation
                          also queuing…

        Separating concerns
At Hailo we look for technologies that are:

• Distributed
  run on more than one machine

• Homogenous
  all nodes look the same

• Resilient
  can cope with the loss of node(s) with no
  loss of data
“There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”




http://blog.b3k.us/2012/01/24/some-rules.html
• Highly performant, scalable and
  resilient data store

• Underpins much of what we do
  at Hailo

• Makes multi-DC easy!
ZooKeeper
• Highly reliable distributed
  coordination

• We implement locking and
  leadership election on top of ZK
  and use sparingly
• Distributed, RESTful, Search
  Engine built on top of Apache
  Lucene

• Replaced basic foo LIKE ‘%bar%’
  queries (so much better)
NSQ
• Realtime message processing
  system designed to handle
  billions of messages per day

• Fault tolerant, highly available
  with reliable message delivery
  guarantee
Cruftflake
• Distributed ID generation with
  no coordination required

• Rock solid
• All these technologies have
  similar properties of distribution
  and resilience

• They are designed to cope with
  failure

• They are not broken by design
Lessons learned
Minimise the
critical path
What is the minimum viable service?
class HailoMemcacheService {
    private $mc = null;

    public function __call() {
        $mc = $this->getInstance();
        // do stuff
    }

    private function getInstance() {
        if ($this->instance === null) {
             $this->mc = new Memcached;
             $this->mc->addServers($s);
        }
        return $this->mc;
    }
}        Lazy-init instances; connect on use
Configure clients
   carefully
$this->mc = new Memcached;
$this->mc->addServers($s);

$this->mc->setOption(
    Memcached::OPT_CONNECT_TIMEOUT,
    $connectTimeout);
$this->mc->setOption(
    Memcached::OPT_SEND_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_RECV_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_POLL_TIMEOUT,
    $connectionPollTimeout);
         Make sure timeouts are configured
here?




Choose timeouts based on data
“Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”




http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html
here?




95th percentile
Test
• Kill memcache on box A,
  measure impact on application

• Kill memcache on box B,
  measure impact on application


All fine.. we’ve got this covered!
FAIL
• Box A, running in AWS, locks up

• Any parts of application that
  touch Memcache stop working
Things fail in
exotic ways
$ iptables -A INPUT -i eth0 
     -p tcp --dport 11211 -j REJECT



    $ php test-memcache.php

    Working OK!




Packets rejected and source notified by ICMP. Expect fast fails.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 -j DROP



$ php test-memcache.php

Working OK!




 Packets silently dropped. Expect long time outs.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 
 -m state --state ESTABLISHED 
 -j DROP



$ php test-memcache.php




           Hangs! Uh oh.
• When AWS instances hang they
  appear to accept connections
  but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/
+bug/583031
Fix, rinse, repeat
It would be
nice if we could
 automate this
Automate!
• Hailo run a dedicated automated
  test environment

• Powered by bash, JMeter and
  Graphite

• Continuous automated testing
  with failure simulations
Fix attempt 1: bad timeouts configured
Fix attempt 2: better timeouts
Simulate in
system tests
Simulate failure

Assert monitoring endpoint
picks this up




       Assert features still work
In conclusion
“the best way to avoid
failure is to fail constantly.”




http://www.codinghorror.com/blog/2011/04/worki
ng-with-the-chaos-monkey.html
TIMED BLOCK ALL
THE THINGS
Thanks


Software used at Hailo

http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix

Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/

ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems

Weitere ähnliche Inhalte

Was ist angesagt?

Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...
Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...
Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...Claus Ibsen
 
Distributed automation sel_conf_2015
Distributed automation sel_conf_2015Distributed automation sel_conf_2015
Distributed automation sel_conf_2015aragavan
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpOntico
 
NATS - A new nervous system for distributed cloud platforms
NATS - A new nervous system for distributed cloud platformsNATS - A new nervous system for distributed cloud platforms
NATS - A new nervous system for distributed cloud platformsDerek Collison
 
Building a better web
Building a better webBuilding a better web
Building a better webFastly
 
Basic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsBasic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsGary Yeh
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing DemandJonas BonĂŠr
 
How to work with Selenium Grid and Cloud Solutions
How to work with Selenium Grid and Cloud SolutionsHow to work with Selenium Grid and Cloud Solutions
How to work with Selenium Grid and Cloud SolutionsNoam Zakai
 
High Performance Java EE with JCache and CDI
High Performance Java EE with JCache and CDIHigh Performance Java EE with JCache and CDI
High Performance Java EE with JCache and CDIPayara
 
Altitude SF 2017: Advanced VCL: Shielding and Clustering
Altitude SF 2017: Advanced VCL: Shielding and ClusteringAltitude SF 2017: Advanced VCL: Shielding and Clustering
Altitude SF 2017: Advanced VCL: Shielding and ClusteringFastly
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service DiscoveryJohn Billings
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsBrettTasker
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormationSWIFTotter Solutions
 
Xen_and_Rails_deployment
Xen_and_Rails_deploymentXen_and_Rails_deployment
Xen_and_Rails_deploymentAbhishek Singh
 
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUpAutoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUparagavan
 

Was ist angesagt? (17)

Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...
Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...
Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...
 
Distributed automation sel_conf_2015
Distributed automation sel_conf_2015Distributed automation sel_conf_2015
Distributed automation sel_conf_2015
 
Vert.x vs akka
Vert.x vs akkaVert.x vs akka
Vert.x vs akka
 
Mitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorpMitchell Hashimoto, HashiCorp
Mitchell Hashimoto, HashiCorp
 
NATS - A new nervous system for distributed cloud platforms
NATS - A new nervous system for distributed cloud platformsNATS - A new nervous system for distributed cloud platforms
NATS - A new nervous system for distributed cloud platforms
 
Building a better web
Building a better webBuilding a better web
Building a better web
 
Basic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsBasic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.js
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing Demand
 
Carlos Conde : AWS Game Days - TIAD Paris
Carlos Conde : AWS Game Days - TIAD ParisCarlos Conde : AWS Game Days - TIAD Paris
Carlos Conde : AWS Game Days - TIAD Paris
 
How to work with Selenium Grid and Cloud Solutions
How to work with Selenium Grid and Cloud SolutionsHow to work with Selenium Grid and Cloud Solutions
How to work with Selenium Grid and Cloud Solutions
 
High Performance Java EE with JCache and CDI
High Performance Java EE with JCache and CDIHigh Performance Java EE with JCache and CDI
High Performance Java EE with JCache and CDI
 
Altitude SF 2017: Advanced VCL: Shielding and Clustering
Altitude SF 2017: Advanced VCL: Shielding and ClusteringAltitude SF 2017: Advanced VCL: Shielding and Clustering
Altitude SF 2017: Advanced VCL: Shielding and Clustering
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service Discovery
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applications
 
A brief introduction to CloudFormation
A brief introduction to CloudFormationA brief introduction to CloudFormation
A brief introduction to CloudFormation
 
Xen_and_Rails_deployment
Xen_and_Rails_deploymentXen_and_Rails_deployment
Xen_and_Rails_deployment
 
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUpAutoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
 

Andere mochten auch

Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGMatthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durabilityMatthew Dennis
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarMatthew Dennis
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoDave Gardner
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 

Andere mochten auch (16)

Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 

Ähnlich wie Planning to Fail #phpuk13

Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
Integration in the age of DevOps
Integration in the age of DevOpsIntegration in the age of DevOps
Integration in the age of DevOpsAlbert Wong
 
Integration in the Age of DevOps
Integration in the Age of DevOpsIntegration in the Age of DevOps
Integration in the Age of DevOpsBrian Ashburn
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex ProblemsTyler Treat
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Apcera
 
How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...Dmytro Mykhailov
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceEvan McGee
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo Amazon Web Services
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Emerson Eduardo Rodrigues Von Staffen
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...Amazon Web Services
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier GarcĂ­a Magna
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityOSSCube
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera ClusterAbdul Manaf
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Advanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAdvanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAmazon Web Services
 
Web scale architecture design
Web scale architecture designWeb scale architecture design
Web scale architecture designNepalAdz
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
 

Ähnlich wie Planning to Fail #phpuk13 (20)

Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
Integration in the age of DevOps
Integration in the age of DevOpsIntegration in the age of DevOps
Integration in the age of DevOps
 
Integration in the Age of DevOps
Integration in the Age of DevOpsIntegration in the Age of DevOps
Integration in the Age of DevOps
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a Microservice
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Advanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAdvanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorks
 
Web scale architecture design
Web scale architecture designWeb scale architecture design
Web scale architecture design
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
 

Mehr von Dave Gardner

Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Dave Gardner
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetupDave Gardner
 
Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskDave Gardner
 
Introduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupIntroduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupDave Gardner
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Dave Gardner
 
PHP and Cassandra
PHP and CassandraPHP and Cassandra
PHP and CassandraDave Gardner
 

Mehr von Dave Gardner (6)

Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetup
 
Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = Brisk
 
Introduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupIntroduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web Meetup
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2
 
PHP and Cassandra
PHP and CassandraPHP and Cassandra
PHP and Cassandra
 

KĂźrzlich hochgeladen

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

KĂźrzlich hochgeladen (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Planning to Fail #phpuk13

  • 9. My website: single VPS running PHP + MySQL
  • 10. No growth, low volume, simple functionality, one engineer (me!)
  • 11. Large growth, high volume, complex functionality, lots of engineers
  • 12. • Launched in London November 2011 • Now in 5 cities in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds
  • 13. “.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.” http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
  • 14. • SOA (10+ services) • AWS (3 regions, 9 AZs, lots of instances) • 10+ engineers building services and you? (hailo is hiring)
  • 17.
  • 20.
  • 23. Service Service Service Service each service does one job well Service Oriented Architecture
  • 24. • Fewer lines of code • Fewer responsibilities • Changes less frequently • Can swap entire implementation if needed
  • 26. Service MySQL MySQL running on different box
  • 27. MySQL Service MySQL MySQL running in Multi-Master mode
  • 29. CRUD Locking MySQL Search Analytics ID generation also queuing… Separating concerns
  • 30. At Hailo we look for technologies that are: • Distributed run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data
  • 31. “There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.” http://blog.b3k.us/2012/01/24/some-rules.html
  • 32. • Highly performant, scalable and resilient data store • Underpins much of what we do at Hailo • Makes multi-DC easy!
  • 33. ZooKeeper • Highly reliable distributed coordination • We implement locking and leadership election on top of ZK and use sparingly
  • 34. • Distributed, RESTful, Search Engine built on top of Apache Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much better)
  • 35. NSQ • Realtime message processing system designed to handle billions of messages per day • Fault tolerant, highly available with reliable message delivery guarantee
  • 36. Cruftflake • Distributed ID generation with no coordination required • Rock solid
  • 37. • All these technologies have similar properties of distribution and resilience • They are designed to cope with failure • They are not broken by design
  • 40. What is the minimum viable service?
  • 41. class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-init instances; connect on use
  • 42. Configure clients carefully
  • 43. $this->mc = new Memcached; $this->mc->addServers($s); $this->mc->setOption( Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure timeouts are configured
  • 45. “Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.” http://techblog.netflix.com/2011/04/lessons- netflix-learned-from-aws-outage.html
  • 47. Test
  • 48. • Kill memcache on box A, measure impact on application • Kill memcache on box B, measure impact on application All fine.. we’ve got this covered!
  • 49. FAIL
  • 50. • Box A, running in AWS, locks up • Any parts of application that touch Memcache stop working
  • 52. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source notified by ICMP. Expect fast fails.
  • 53. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long time outs.
  • 54. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -m state --state ESTABLISHED -j DROP $ php test-memcache.php Hangs! Uh oh.
  • 55. • When AWS instances hang they appear to accept connections but drop packets • Bug! https://bugs.launchpad.net/libmemcached/ +bug/583031
  • 57. It would be nice if we could automate this
  • 59. • Hailo run a dedicated automated test environment • Powered by bash, JMeter and Graphite • Continuous automated testing with failure simulations
  • 60. Fix attempt 1: bad timeouts configured
  • 61. Fix attempt 2: better timeouts
  • 63. Simulate failure Assert monitoring endpoint picks this up Assert features still work
  • 65. “the best way to avoid failure is to fail constantly.” http://www.codinghorror.com/blog/2011/04/worki ng-with-the-chaos-monkey.html
  • 67. Thanks Software used at Hailo http://cassandra.apache.org/ http://zookeeper.apache.org/ http://www.elasticsearch.org/ http://www.acunu.com/acunu-analytics.html https://github.com/bitly/nsq https://github.com/davegardnerisme/cruftflake https://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not mentioned.
  • 68. Further reading Hystrix: Latency and Fault Tolerance for Distributed Systems https://github.com/Netflix/Hystrix Timelike: a network simulator http://aphyr.com/posts/277-timelike-a-network-simulator Notes on distributed systems for young bloods http://www.somethingsimilar.com/2013/01/14/notes-on-distributed- systems-for-young-bloods/ Stream de-duplication (relevant to NSQ) http://www.davegardner.me.uk/blog/2012/11/06/stream-de- duplication/ ID generation in distributed systems http://www.slideshare.net/davegardnerisme/unique-id-generation-in- distributed-systems

Hinweis der Redaktion

  1. I’m dave!
  2. I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
  3. The title of my talk is “planning to fail”
  4. First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
  5. This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
  6. But first, let’s rewind to the beginning
  7. The pure joy of inserting a php tag in the middle of an HTML table
  8. My website still follows this pattern. I’d like to think my website is quite reliable.
  9. My website is reliable, but simple. Doesn’t change very often.
  10. Hailo is complex!
  11. Hailo is growing.
  12. Key quote: less machinery is quadratically better.
  13. Hailo have a lot of machinery!
  14. Enter the chaos monkey… If you want to be good at something, practice often!
  15. How about the “reliable” VPC that runs my website?
  16. But not resilient; my website would not cope well with the chaos monkey approach.
  17. This doesn’t matter for my website – this is not a bus timetable app – this is not life and death stuff.
  18. We have to choose our stack appropriately if we are going to go down the chaos monkey route.
  19. Hailo didn’t start out this way; but the PHP component did
  20. Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
  21. Advantages
  22. Here’s one of our services… is this reliable?
  23. But Hailo is going global
  24. At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
  25. Don’t pick things that arebroken by design
  26. We remove services from the critical path using lazy-init pattern
  27. We want to define timeouts so that under failure conditions we don’t hang forever
  28. Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
  29. Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
  30. Yay!
  31. Boo
  32. This was after we fixed the bug, but we had the timeouts configured badly.
  33. Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
  34. Runnable .md based system tests