SlideShare a Scribd company logo
1 of 88
Planning
to fail

@davegardnerisme
#phpne13
dave
the taxi app
Planning
 to fail
Planning
for failure
Planning
 to fail
Why?


http://en.wikipedia.org/wiki/High_availability
99.9%        (three nines)

Downtime:

43.8 minutes per month
8.76 hours per year
99.99%       (four nines)

Downtime:

4.32 minutes per month
52.56 minutes per year
99.999% (five nines)

Downtime:

25.9 seconds per month
5.26 minutes per year
www.whoownsmyavailability.com



             ?
www.whoownsmyavailability.com



          YOU
The beginning
<?php
My website: single VPS running PHP + MySQL
No growth, low volume, simple functionality, one engineer (me!)
Large growth, high volume, complex functionality, lots of engineers
• Launched in London
  November 2011

• Now in 5 cities in 3 countries
  (30%+ growth every month)

• A Hailo hail is accepted around
  the world every 5 seconds
“.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in

complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of
  instances)

• 10+ engineers building services
                and you?
                (hailo is hiring)
Our overall
reliability is in
    danger
Embracing failure

(a coping strategy)
VPC
(running PHP+MySQL)




                      reliable?
Reliable
  !==
Resilient
Choosing a stack
“Hailo”
(running PHP+MySQL)




                      reliable?
Service    Service         Service        Service


      each service does one job well



          Service Oriented Architecture
• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation
  if needed
Service
(running PHP+MySQL)




                      reliable?
Service                     MySQL




   MySQL running on different box
MySQL
Service
                            MySQL



 MySQL running in Multi-Master mode
Going global
CRUD
                          Locking
MySQL                     Search
                          Analytics
                          ID generation
                          also queuing…

        Separating concerns
At Hailo we look for technologies that are:

• Distributed
  run on more than one machine

• Homogenous
  all nodes look the same

• Resilient
  can cope with the loss of node(s) with no
  loss of data
“There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”




http://blog.b3k.us/2012/01/24/some-rules.html
• Highly performant, scalable and
  resilient data store

• Underpins much of what we do
  at Hailo

• Makes multi-DC easy!
ZooKeeper
• Highly reliable distributed
  coordination

• We implement locking and
  leadership election on top of ZK
  and use sparingly
• Distributed, RESTful, Search
  Engine built on top of Apache
  Lucene

• Replaced basic foo LIKE ‘%bar%’
  queries (so much better)
NSQ
• Realtime message processing
  system designed to handle
  billions of messages per day

• Fault tolerant, highly available
  with reliable message delivery
  guarantee
• Real time incremental analytics
  platform, backed by Apache
  Cassandra

• Powerful SQL-like interface

• Scalable and highly available
Cruftflake
• Distributed ID generation with
  no coordination required

• Rock solid
• All these technologies have
  similar properties of distribution
  and resilience

• They are designed to cope with
  failure

• They are not broken by design
Lessons learned
Minimise the
critical path
What is the minimum viable service?
class HailoMemcacheService {
    private $mc = null;

    public function __call() {
        $mc = $this->getInstance();
        // do stuff
    }

    private function getInstance() {
        if ($this->instance === null) {
             $this->mc = new Memcached;
             $this->mc->addServers($s);
        }
        return $this->mc;
    }
}        Lazy-init instances; connect on use
Configure clients
   carefully
$this->mc = new Memcached;
$this->mc->addServers($s);

$this->mc->setOption(
    Memcached::OPT_CONNECT_TIMEOUT,
    $connectTimeout);
$this->mc->setOption(
    Memcached::OPT_SEND_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_RECV_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_POLL_TIMEOUT,
    $connectionPollTimeout);
         Make sure timeouts are configured
here?




Choose timeouts based on data
“Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”




http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html
here?




95th percentile
Test
• Kill memcache on box A,
  measure impact on application

• Kill memcache on box B,
  measure impact on application


All fine.. we’ve got this covered!
FAIL
• Box A, running in AWS, locks up

• Any parts of application that
  touch Memcache stop working
Things fail in
exotic ways
$ iptables -A INPUT -i eth0 
     -p tcp --dport 11211 -j REJECT



    $ php test-memcache.php

    Working OK!




Packets rejected and source notified by ICMP. Expect fast fails.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 -j DROP



$ php test-memcache.php

Working OK!




 Packets silently dropped. Expect long time outs.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 
 -m state --state ESTABLISHED 
 -j DROP



$ php test-memcache.php




           Hangs! Uh oh.
• When AWS instances hang they
  appear to accept connections
  but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/
+bug/583031
Fix, rinse, repeat
RabbitMQ     RabbitMQ    RabbitMQ


                          HA cluster

      AMQP (port 5672)


 Service
$ iptables -A INPUT -i eth0   
 -p tcp --dport 5672          
 -m state --state ESTABLISHED 
 -j DROP



$ php test-rabbitmq.php




  Fantastic! Block AMQP port, client times out
FAIL
“RabbitMQ clusters do not
tolerate network partitions
well.”




http://www.rabbitmq.com/partitions.html
$ epmd –names
epmd: up and running on port
4369 with data:
name rabbit at port 60278




 Each node listens on a port assigned by EPMD
$ iptables -A INPUT -i eth0   
 -p tcp --dport 60278         
 -m state --state ESTABLISHED 
 -j DROP



$ php test-rabbitmq.php




           Hangs! Uh oh.
Mnesia('rabbit@dmzutilities03-global01-
     test'): ** ERROR ** mnesia_event got
     {inconsistent_database,
     running_partitioned_network,
     'rabbit@dmzutilities01-global01-test'}




     application: rabbitmq_management
     exited: shutdown
     type: temporary




RabbitMQ logs show partitioned network error; nodes shutdown
while ($read < $n
    && !feof($this->sock->real_sock())
    && (false !== ($buf = fread(
        $this->sock->real_sock(),
        $n - $read)))) {
    $read += strlen($buf);
    $res .= $buf;
}




  PHP library didn’t have any time limit on reading a frame
Fix, rinse, repeat
It would be
nice if we could
 automate this
Automate!
• Hailo run a dedicated automated
  test environment

• Powered by bash, JMeter and
  Graphite

• Continuous automated testing
  with failure simulations
Fix attempt 1: bad timeouts configured
Fix attempt 2: better timeouts
Simulate in
system tests
Simulate failure

Assert monitoring endpoint
picks this up




       Assert features still work
In conclusion
“the best way to avoid
failure is to fail constantly.”




http://www.codinghorror.com/blog/2011/04/worki
ng-with-the-chaos-monkey.html
You should test for
failure

How does the software react?
How does the PHP client react?
Automation makes
continuous failure
testing feasible
Systems that cope well
with failure are easier
to operate
TIMED BLOCK ALL
THE THINGS
Thanks


Software used at Hailo

http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix

Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/

ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems

More Related Content

What's hot

Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Marcus Barczak
 
NYC Cassandra Day - Java Intro
NYC Cassandra Day - Java IntroNYC Cassandra Day - Java Intro
NYC Cassandra Day - Java IntroChristopher Batey
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service DiscoveryJohn Billings
 
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic ContentCaching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic ContentFastly
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHenning Spjelkavik
 
MHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirksMHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirksColin Charles
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthPhilip Norton
 
Rails Caching Secrets from the Edge
Rails Caching Secrets from the EdgeRails Caching Secrets from the Edge
Rails Caching Secrets from the EdgeMichael May
 
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsCamel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsBilgin Ibryam
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Tingjaxconf
 
Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Fastly
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling TwitterBlaine
 
Caching the uncacheable with Varnish - DevDays 2021
Caching the uncacheable with Varnish - DevDays 2021Caching the uncacheable with Varnish - DevDays 2021
Caching the uncacheable with Varnish - DevDays 2021Thijs Feryn
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 
LJC: Microservices in the real world
LJC: Microservices in the real worldLJC: Microservices in the real world
LJC: Microservices in the real worldChristopher Batey
 
Apache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsApache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsGladson Manuel
 
Why Reactive Architecture Will Take Over The World (and why we should be wary...
Why Reactive Architecture Will Take Over The World (and why we should be wary...Why Reactive Architecture Will Take Over The World (and why we should be wary...
Why Reactive Architecture Will Take Over The World (and why we should be wary...Steve Pember
 
Hawkular overview
Hawkular overviewHawkular overview
Hawkular overviewTed Won
 
Basic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsBasic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsGary Yeh
 

What's hot (20)

Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
 
NYC Cassandra Day - Java Intro
NYC Cassandra Day - Java IntroNYC Cassandra Day - Java Intro
NYC Cassandra Day - Java Intro
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service Discovery
 
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic ContentCaching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.no
 
MHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirksMHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirks
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
 
Rails Caching Secrets from the Edge
Rails Caching Secrets from the EdgeRails Caching Secrets from the Edge
Rails Caching Secrets from the Edge
 
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsCamel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101Altitude SF 2017: Debugging Fastly VCL 101
Altitude SF 2017: Debugging Fastly VCL 101
 
Scaling Twitter
Scaling TwitterScaling Twitter
Scaling Twitter
 
Caching the uncacheable with Varnish - DevDays 2021
Caching the uncacheable with Varnish - DevDays 2021Caching the uncacheable with Varnish - DevDays 2021
Caching the uncacheable with Varnish - DevDays 2021
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
LJC: Microservices in the real world
LJC: Microservices in the real worldLJC: Microservices in the real world
LJC: Microservices in the real world
 
Apache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsApache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basics
 
Why Reactive Architecture Will Take Over The World (and why we should be wary...
Why Reactive Architecture Will Take Over The World (and why we should be wary...Why Reactive Architecture Will Take Over The World (and why we should be wary...
Why Reactive Architecture Will Take Over The World (and why we should be wary...
 
Hawkular overview
Hawkular overviewHawkular overview
Hawkular overview
 
Basic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.jsBasic Understanding and Implement of Node.js
Basic Understanding and Implement of Node.js
 

Viewers also liked

Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGMatthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durabilityMatthew Dennis
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarMatthew Dennis
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoDave Gardner
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 

Viewers also liked (17)

Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 

Similar to Planning to Fail #phpne13

StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackChiradeep Vittal
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeSarah Z
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceEvan McGee
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex ProblemsTyler Treat
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldTimothy Perrett
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Apcera
 
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...Amazon Web Services
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUsDavid Klee
 
Integration in the age of DevOps
Integration in the age of DevOpsIntegration in the age of DevOps
Integration in the age of DevOpsAlbert Wong
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationChris Dagdigian
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo Amazon Web Services
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE
 

Similar to Planning to Fail #phpne13 (20)

StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStack
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less Coffee
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a Microservice
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional World
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
Performance
PerformancePerformance
Performance
 
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
(ARC402) Deployment Automation: From Developers' Keyboards to End Users' Scre...
 
TIAD : Automating the modern datacenter
TIAD : Automating the modern datacenterTIAD : Automating the modern datacenter
TIAD : Automating the modern datacenter
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
 
Integration in the age of DevOps
Integration in the age of DevOpsIntegration in the age of DevOps
Integration in the age of DevOps
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Epidemic Failures
Epidemic FailuresEpidemic Failures
Epidemic Failures
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 

More from Dave Gardner

Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Dave Gardner
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetupDave Gardner
 
Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskDave Gardner
 
Introduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupIntroduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupDave Gardner
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Dave Gardner
 
PHP and Cassandra
PHP and CassandraPHP and Cassandra
PHP and CassandraDave Gardner
 

More from Dave Gardner (6)

Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetup
 
Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = Brisk
 
Introduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupIntroduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web Meetup
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2
 
PHP and Cassandra
PHP and CassandraPHP and Cassandra
PHP and Cassandra
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Planning to Fail #phpne13

  • 8. 99.9% (three nines) Downtime: 43.8 minutes per month 8.76 hours per year
  • 9. 99.99% (four nines) Downtime: 4.32 minutes per month 52.56 minutes per year
  • 10. 99.999% (five nines) Downtime: 25.9 seconds per month 5.26 minutes per year
  • 14. <?php
  • 15. My website: single VPS running PHP + MySQL
  • 16. No growth, low volume, simple functionality, one engineer (me!)
  • 17. Large growth, high volume, complex functionality, lots of engineers
  • 18. • Launched in London November 2011 • Now in 5 cities in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds
  • 19. “.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.” http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
  • 20. • SOA (10+ services) • AWS (3 regions, 9 AZs, lots of instances) • 10+ engineers building services and you? (hailo is hiring)
  • 23.
  • 28. Service Service Service Service each service does one job well Service Oriented Architecture
  • 29. • Fewer lines of code • Fewer responsibilities • Changes less frequently • Can swap entire implementation if needed
  • 31. Service MySQL MySQL running on different box
  • 32. MySQL Service MySQL MySQL running in Multi-Master mode
  • 34. CRUD Locking MySQL Search Analytics ID generation also queuing… Separating concerns
  • 35. At Hailo we look for technologies that are: • Distributed run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data
  • 36. “There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.” http://blog.b3k.us/2012/01/24/some-rules.html
  • 37. • Highly performant, scalable and resilient data store • Underpins much of what we do at Hailo • Makes multi-DC easy!
  • 38. ZooKeeper • Highly reliable distributed coordination • We implement locking and leadership election on top of ZK and use sparingly
  • 39. • Distributed, RESTful, Search Engine built on top of Apache Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much better)
  • 40. NSQ • Realtime message processing system designed to handle billions of messages per day • Fault tolerant, highly available with reliable message delivery guarantee
  • 41. • Real time incremental analytics platform, backed by Apache Cassandra • Powerful SQL-like interface • Scalable and highly available
  • 42. Cruftflake • Distributed ID generation with no coordination required • Rock solid
  • 43. • All these technologies have similar properties of distribution and resilience • They are designed to cope with failure • They are not broken by design
  • 46. What is the minimum viable service?
  • 47. class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-init instances; connect on use
  • 48. Configure clients carefully
  • 49. $this->mc = new Memcached; $this->mc->addServers($s); $this->mc->setOption( Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure timeouts are configured
  • 51. “Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.” http://techblog.netflix.com/2011/04/lessons- netflix-learned-from-aws-outage.html
  • 53. Test
  • 54. • Kill memcache on box A, measure impact on application • Kill memcache on box B, measure impact on application All fine.. we’ve got this covered!
  • 55. FAIL
  • 56. • Box A, running in AWS, locks up • Any parts of application that touch Memcache stop working
  • 58. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source notified by ICMP. Expect fast fails.
  • 59. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long time outs.
  • 60. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -m state --state ESTABLISHED -j DROP $ php test-memcache.php Hangs! Uh oh.
  • 61. • When AWS instances hang they appear to accept connections but drop packets • Bug! https://bugs.launchpad.net/libmemcached/ +bug/583031
  • 63. RabbitMQ RabbitMQ RabbitMQ HA cluster AMQP (port 5672) Service
  • 64. $ iptables -A INPUT -i eth0 -p tcp --dport 5672 -m state --state ESTABLISHED -j DROP $ php test-rabbitmq.php Fantastic! Block AMQP port, client times out
  • 65. FAIL
  • 66. “RabbitMQ clusters do not tolerate network partitions well.” http://www.rabbitmq.com/partitions.html
  • 67. $ epmd –names epmd: up and running on port 4369 with data: name rabbit at port 60278 Each node listens on a port assigned by EPMD
  • 68.
  • 69. $ iptables -A INPUT -i eth0 -p tcp --dport 60278 -m state --state ESTABLISHED -j DROP $ php test-rabbitmq.php Hangs! Uh oh.
  • 70. Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'} application: rabbitmq_management exited: shutdown type: temporary RabbitMQ logs show partitioned network error; nodes shutdown
  • 71.
  • 72. while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP library didn’t have any time limit on reading a frame
  • 74. It would be nice if we could automate this
  • 76. • Hailo run a dedicated automated test environment • Powered by bash, JMeter and Graphite • Continuous automated testing with failure simulations
  • 77. Fix attempt 1: bad timeouts configured
  • 78. Fix attempt 2: better timeouts
  • 80. Simulate failure Assert monitoring endpoint picks this up Assert features still work
  • 82. “the best way to avoid failure is to fail constantly.” http://www.codinghorror.com/blog/2011/04/worki ng-with-the-chaos-monkey.html
  • 83. You should test for failure How does the software react? How does the PHP client react?
  • 85. Systems that cope well with failure are easier to operate
  • 87. Thanks Software used at Hailo http://cassandra.apache.org/ http://zookeeper.apache.org/ http://www.elasticsearch.org/ http://www.acunu.com/acunu-analytics.html https://github.com/bitly/nsq https://github.com/davegardnerisme/cruftflake https://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not mentioned.
  • 88. Further reading Hystrix: Latency and Fault Tolerance for Distributed Systems https://github.com/Netflix/Hystrix Timelike: a network simulator http://aphyr.com/posts/277-timelike-a-network-simulator Notes on distributed systems for young bloods http://www.somethingsimilar.com/2013/01/14/notes-on-distributed- systems-for-young-bloods/ Stream de-duplication (relevant to NSQ) http://www.davegardner.me.uk/blog/2012/11/06/stream-de- duplication/ ID generation in distributed systems http://www.slideshare.net/davegardnerisme/unique-id-generation-in- distributed-systems

Editor's Notes

  1. I’m dave!
  2. I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
  3. The title of my talk is “planning to fail”
  4. First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
  5. This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
  6. Why do we care about reliability?
  7. Advantages
  8. Advantages
  9. Advantages
  10. Advantages
  11. Advantages
  12. But first, let’s rewind to the beginning
  13. The pure joy of inserting a php tag in the middle of an HTML table
  14. My website still follows this pattern. I’d like to think my website is quite reliable.
  15. My website is reliable, but simple. Doesn’t change very often.
  16. Hailo is complex!
  17. Hailo is growing.
  18. Key quote: less machinery is quadratically better.
  19. Hailo have a lot of machinery!
  20. Enter the chaos monkey… If you want to be good at something, practice often!
  21. How about the “reliable” VPC that runs my website?
  22. But not resilient; my website would not cope well with the chaos monkey approach.
  23. We have to choose our stack appropriately if we are going to go down the chaos monkey route.
  24. Hailo didn’t start out this way; but the PHP component did
  25. Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
  26. Advantages
  27. Here’s one of our services… is this reliable?
  28. But Hailo is going global
  29. At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
  30. Don’t pick things that arebroken by design
  31. We remove services from the critical path using lazy-init pattern
  32. We want to define timeouts so that under failure conditions we don’t hang forever
  33. Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
  34. Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
  35. Yay!
  36. Boo
  37. Boo
  38. This was after we fixed the bug, but we had the timeouts configured badly.
  39. Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
  40. Runnable .md based system tests