SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Fixing Twitter
... and Finding your own Fail Whale




            John Adams
        Twitter Operations
        <jna@twitter.com>
Operations
• Small team, growing rapidly.
• What do we do?
 • Software Performance (back-end)
 • Availability
 • Capacity Planning (metrics-driven)
 • Configuration Management
• We don’t deal with the physical plant.
Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
 • Need raw processing power, latency too
     high in existing cloud offerings
• Frees us to deal with real, intellectual,
  computer science problems.
752%
2008 Growth
           5



         3.75



          2.5



         1.25



           0
                Dec 07    Feb 08   Apr 08       Jun 08   Aug 08   Oct 08 Dec 08
                Unique Visitors (in Millions)
That was only the beginning...




previous
 graph!
Uniques




 Not slowing down, despite what outsiders say.
  Hard for outsiders to measure API usage!
Growth = Pain
 + an appreciation for Institutionalized Fear
Mantra!


   Find Weakest
       Point




   Metrics +
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective
       Point           Action




   Metrics +         Process
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective    Move to Next
       Point           Action         Weakest Point




   Metrics +         Process         Repeatability
Logs + Science =
    Analysis
Find the Weakest Point

  • Metrics + Graphs
   • Individual metrics are irrelevant
  • Logs
  • SCIENCE!
  • Find out what the actionable items are.
Instrument Everything




                        (cc) seenoevil@flickr
Monitoring

 • Graph and report critical metrics in as near
   real time as possible
 • You already have the tools.
  • RRD
  • Ganglia + custom gMetric scripts
  • MRTG
Dashboards
 • “Criticals” view
 • Smokeping/MRTG
 • Google Analytics
  • Not just for
     HTTP 200s/SEO
 • XML Feeds from
   managed services
 • Data Porn!
Analyze
  • Turn data into information
   • Where is the code base going?
   • Are things worse than they were?
     • Understand the impact of the last
        software deploy
      • Run check scripts during and after
        deploys
  • Capacity Planning, not Fire Fighting!
Forecasting                   Curve-fitting for capacity planning
                               (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Deploys

  • Graph time-of-deploy along side server
      CPU and Latency
  • Display time-of-last-deploy on dashboard


 last deploy times
Whale-Watcher
•   Simple shell script,
    •   MASSIVE WIN.
•   Whale = HTTP 503 (timeout)
•   Robot = HTTP 500 (error)
•   Examines last 100,000 lines of aggregated
    daemon / www logs
•   “Whales per Second” > Wthreshold
    •   Thar be whales! Call in ops.
Take Action !
Feature “Darkmode”
 • Specific site controls to enable and
   disable computationally or IO-Heavy site
   function
 • The “Emergency Stop” button
 • Changes logged and reported to all teams
 • Around 60 switches we can throw
 • Static / Read-only mode
Configuration
Management
• Start automated configuration management
  EARLY in your company.
• Don’t wait until it’s too late.
• Twitter started within the first few months.
Configuration
Management
   • Complex Environment
   • Multiple Admins
   • Unknown Interactions
   • Solution: 2nd set of eyes.
Process through Reviews
Reviewboard
         www.review-board.org

 • SVN pre-commit hook causes a failure if
   the log message doesn’t include
   ‘reviewed’
 • SVN post-commit hook informs people
   what changed via email
 • Watches the entire SVN tree
Improve
Communication


                Campfire
Subsystems
Many limiting factors in the request pipeline

        Apache                      Rails
      MPM Model                  (mongrel)
      MaxClients            2:1 oversubscribed
 TCP Listen queue depth
                                  to cores

                                 Memcached
                                # connections


                                    MySQL
   Varnish (search)            # db connections
      # threads
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Network                     Servers++
                           Latency
                                          Better
 Timeline   Database     Update Delay
                                        algorithm
                                         DBs++
  Search    Database        Delays
                                          Code
 Updates    Algorithm      Latency      Algorithms
CPU: More with Less
• Reduction in 40% of CPU by replacing dual
  and quad core machines with 8 core
• Switching from AMD to Intel Xeon = 30%
  gain
• Saved data center space, power, cost per
  month.
• Not the best option if you own machines.
  Capital expenditure = hard to realize new
  technology gains.
Rails
• Stop blaming Rails.
• Analysis found:
 • Caching + Cache invalidation problems
 • Bad queries generated by ActiveRecord,
    resulting in slow queries against the db
 • Queue Latency
 • Memcache / Page Cache Corruption
 • Replication Lag
Disk is the new Tape.
• Social Networking application profile has
  many   O(ny)   operations.
• Page requests have to happen in < 500mS
  or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
Caching
  • We’re the real-time web, but lots of caching
    opportunity
  • Most caching strategies rely on long TTLs
    (>60 s)
  • Separate memcache pools for different data
    types to prevent eviction
  • Optimize Ruby Gem to libmemcached +
    FNV Hash instead of Ruby + MD5
  • Twitter now largest contributor to
    libmemcached
Caching   50% decrease in load with Native C
                gem + libmemcached
Cache Money!
• Active Record Plugin
 • Cache when reading from the DB
 • Cache when writing to the DB
• Transparently provides caching
 • Removes need for set/get cache code
 • Open Source!
Caching

 • “Cache Everything!” not the best policy
 • Invalidating caches at the right time is
   difficult.
 • Cold Cache problem
 • Network Memory Bus != Infinite
Memcached
• memcached isn’t perfect.
 • Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
  important configuration data
  (loss of darkmode flags, for example)
• Data and Hash Corruption (even in 1.2.6)
 • Exposed corruption issue with specific
    inputs causing SEGV and unexpected
    behavior
API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
  (failover to another backend if Varnish
  dies)
• Remove Cache busting query strings before
  applying hash algorithm
• Using ESI to cache jQuery requests when
  specifying a callback= parameter - big win.
Relational Databases
not a Panacea
• Good for:
 • Users, Relational Data, Transactions
• Bad:
 • Queues. Polling operations. Caching.
• You don’t need ACID for everything.
• Enter the message queue...
Queues
• Many message queue solutions on the
  market
• At high loads, most perform poorly when
  used in ‘durable’ mode.
• Erlang based queues work well
  (RabbitMQ), but you need in house Erlang
  experience.
• We wrote our own.
 • Kestrel to the rescue!
Kestrel
Falco tinnunculus




  • Works like memcache (same protocol)
  • SET = enqueue | GET = dequeue
  • No strict ordering of jobs
  • No shared state between servers
  • Written in Scala.
Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline should not be used to
  handle 3rd party communications or
  back-end work.
• Daemons, Daemons, Daemons.
Don’t make services
dependent
• Move operations out of the synchronous
  request cycle
 • Email
 • Complex object generation (timelines)
 • 3rd party services (bit.ly, sms, etc.)
Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
 • Early Kestrel would crash if queues filled
• “Seppaku” patch
 • Kill daemons after n requests
• Long-running daemons = low memory
MySQL Challenges
• Replication Delay
 • Single threaded. Slow.
• Social Networking not good for RDBMS
 • N x N relationships and social graph /
    tree traversal
 • Sharding importance
 • Disk issues (FS Choice, noatime,
    scheduling algorithm)
MySQL

• Replication delay and cache eviction
  produce inconsistent results to the end
  user.
• Locks create resource contention for
  popular data
Database Replication
  • Major issues around users and statuses
    tables
  • Multiple functional masters (FRP, FWP)
  • Make sure your code reads and writes to
    the write DBs. Reading from master = slow
    death
    • Monitor the DB. Find slow / poorly
      designed queries
  • Kill long running queries before they kill
    you (mkill)
status.twitter.com

• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little information you have
  available.
Key Points

• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
• Don’t make services dependent
• Process asynchronously when possible
Thanks!
Twitter Open Source (Apache License):

- CacheMoney Gem (Write through Caching)
http://github.com/nkallen/cache-money/tree/master

- Libmemcached
http://tangent.org/552/libmemcached.html

- Kestrel (Memcache-like message queue)
http://github.com/robey/kestrel

- mod_memcache_block (Apache 2.x Limiter/blocker)
http://github.com/netik/mod_memcache_block

Weitere ähnliche Inhalte

Was ist angesagt?

Grokking TechTalk #16: Html js and three way binding
Grokking TechTalk #16: Html js and three way bindingGrokking TechTalk #16: Html js and three way binding
Grokking TechTalk #16: Html js and three way bindingGrokking VN
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Databricks
 
Machine learning pipeline
Machine learning pipelineMachine learning pipeline
Machine learning pipelineVadym Kuzmenko
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile DataDataKitchen
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Sydney Continuous Delivery Meetup May 2014
Sydney Continuous Delivery Meetup May 2014Sydney Continuous Delivery Meetup May 2014
Sydney Continuous Delivery Meetup May 2014Andreas Grabner
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsRandy Shoup
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Minimal Viable Architecture - Silicon Slopes 2020
Minimal Viable Architecture - Silicon Slopes 2020Minimal Viable Architecture - Silicon Slopes 2020
Minimal Viable Architecture - Silicon Slopes 2020Randy Shoup
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth PresentationEric Ries
 
Reactive Integrations - Caveats and bumps in the road explained
Reactive Integrations - Caveats and bumps in the road explained  Reactive Integrations - Caveats and bumps in the road explained
Reactive Integrations - Caveats and bumps in the road explained Markus Eisele
 
Scaling Your Architecture for the Long Term
Scaling Your Architecture for the Long TermScaling Your Architecture for the Long Term
Scaling Your Architecture for the Long TermRandy Shoup
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
 
Using JMeter in CloudTest for Continuous Testing
Using JMeter in CloudTest for Continuous TestingUsing JMeter in CloudTest for Continuous Testing
Using JMeter in CloudTest for Continuous TestingSOASTA
 
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?What to Do—Develop Your Own Automation or Use Crowdsourced Testing?
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?TechWell
 
Webinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonWebinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonSOASTA
 
Designing Modern Web Applications
Designing Modern Web ApplicationsDesigning Modern Web Applications
Designing Modern Web ApplicationsLucas Carlson
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaDatabricks
 

Was ist angesagt? (20)

Grokking TechTalk #16: Html js and three way binding
Grokking TechTalk #16: Html js and three way bindingGrokking TechTalk #16: Html js and three way binding
Grokking TechTalk #16: Html js and three way binding
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
 
Machine learning pipeline
Machine learning pipelineMachine learning pipeline
Machine learning pipeline
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Sydney Continuous Delivery Meetup May 2014
Sydney Continuous Delivery Meetup May 2014Sydney Continuous Delivery Meetup May 2014
Sydney Continuous Delivery Meetup May 2014
 
Continuous database deployment
Continuous database deploymentContinuous database deployment
Continuous database deployment
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Minimal Viable Architecture - Silicon Slopes 2020
Minimal Viable Architecture - Silicon Slopes 2020Minimal Viable Architecture - Silicon Slopes 2020
Minimal Viable Architecture - Silicon Slopes 2020
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
 
Reactive Integrations - Caveats and bumps in the road explained
Reactive Integrations - Caveats and bumps in the road explained  Reactive Integrations - Caveats and bumps in the road explained
Reactive Integrations - Caveats and bumps in the road explained
 
Scaling Your Architecture for the Long Term
Scaling Your Architecture for the Long TermScaling Your Architecture for the Long Term
Scaling Your Architecture for the Long Term
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Using JMeter in CloudTest for Continuous Testing
Using JMeter in CloudTest for Continuous TestingUsing JMeter in CloudTest for Continuous Testing
Using JMeter in CloudTest for Continuous Testing
 
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?What to Do—Develop Your Own Automation or Use Crowdsourced Testing?
What to Do—Develop Your Own Automation or Use Crowdsourced Testing?
 
Webinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak SeasonWebinar: Load Testing for Your Peak Season
Webinar: Load Testing for Your Peak Season
 
Designing Modern Web Applications
Designing Modern Web ApplicationsDesigning Modern Web Applications
Designing Modern Web Applications
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
 

Andere mochten auch

Andere mochten auch (8)

Changing Skins
Changing SkinsChanging Skins
Changing Skins
 
Q
QQ
Q
 
Pricing To Sell
Pricing To SellPricing To Sell
Pricing To Sell
 
Buyer Presentation for Real Living
Buyer Presentation for Real LivingBuyer Presentation for Real Living
Buyer Presentation for Real Living
 
Rl listing-presentation powerpoint 5-27-11
Rl listing-presentation powerpoint 5-27-11Rl listing-presentation powerpoint 5-27-11
Rl listing-presentation powerpoint 5-27-11
 
Real Living Lifestyles
Real Living LifestylesReal Living Lifestyles
Real Living Lifestyles
 
Rl listing-presentation powerpoint 5-27-11
Rl listing-presentation powerpoint 5-27-11Rl listing-presentation powerpoint 5-27-11
Rl listing-presentation powerpoint 5-27-11
 
Facebook Friend Lists
Facebook Friend ListsFacebook Friend Lists
Facebook Friend Lists
 

Ähnlich wie Fixing Twitter Improving The Performance And Scalability Of The Worlds Most Popular Micro Blogging Site Presentation

John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009John Adams
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier García Magna
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionAdaryl "Bob" Wakefield, MBA
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisWill Iverson
 

Ähnlich wie Fixing Twitter Improving The Performance And Scalability Of The Worlds Most Popular Micro Blogging Site Presentation (20)

John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most Popular Micro Blogging Site Presentation

  • 1. Fixing Twitter ... and Finding your own Fail Whale John Adams Twitter Operations <jna@twitter.com>
  • 2. Operations • Small team, growing rapidly. • What do we do? • Software Performance (back-end) • Availability • Capacity Planning (metrics-driven) • Configuration Management • We don’t deal with the physical plant.
  • 3. Managed Services • Dedicated team (NTTA) • 24/7 Hands on remote support • No clouds. We tried that! • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems.
  • 4. 752% 2008 Growth 5 3.75 2.5 1.25 0 Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08 Unique Visitors (in Millions)
  • 5. That was only the beginning... previous graph!
  • 6. Uniques Not slowing down, despite what outsiders say. Hard for outsiders to measure API usage!
  • 7. Growth = Pain + an appreciation for Institutionalized Fear
  • 8. Mantra! Find Weakest Point Metrics + Logs + Science = Analysis
  • 9. Mantra! Find Weakest Take Corrective Point Action Metrics + Process Logs + Science = Analysis
  • 10. Mantra! Find Weakest Take Corrective Move to Next Point Action Weakest Point Metrics + Process Repeatability Logs + Science = Analysis
  • 11. Find the Weakest Point • Metrics + Graphs • Individual metrics are irrelevant • Logs • SCIENCE! • Find out what the actionable items are.
  • 12. Instrument Everything (cc) seenoevil@flickr
  • 13. Monitoring • Graph and report critical metrics in as near real time as possible • You already have the tools. • RRD • Ganglia + custom gMetric scripts • MRTG
  • 14. Dashboards • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services • Data Porn!
  • 15. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 16. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 17. Deploys • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard last deploy times
  • 18. Whale-Watcher • Simple shell script, • MASSIVE WIN. • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 100,000 lines of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 20. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 21. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
  • 22. Configuration Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
  • 24. Reviewboard www.review-board.org • SVN pre-commit hook causes a failure if the log message doesn’t include ‘reviewed’ • SVN post-commit hook informs people what changed via email • Watches the entire SVN tree
  • 27. Many limiting factors in the request pipeline Apache Rails MPM Model (mongrel) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
  • 28. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
  • 29. CPU: More with Less • Reduction in 40% of CPU by replacing dual and quad core machines with 8 core • Switching from AMD to Intel Xeon = 30% gain • Saved data center space, power, cost per month. • Not the best option if you own machines. Capital expenditure = hard to realize new technology gains.
  • 30. Rails • Stop blaming Rails. • Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Memcache / Page Cache Corruption • Replication Lag
  • 31. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
  • 32. Caching • We’re the real-time web, but lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 33. Caching 50% decrease in load with Native C gem + libmemcached
  • 34. Cache Money! • Active Record Plugin • Cache when reading from the DB • Cache when writing to the DB • Transparently provides caching • Removes need for set/get cache code • Open Source!
  • 35. Caching • “Cache Everything!” not the best policy • Invalidating caches at the right time is difficult. • Cold Cache problem • Network Memory Bus != Infinite
  • 36. Memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Data and Hash Corruption (even in 1.2.6) • Exposed corruption issue with specific inputs causing SEGV and unexpected behavior
  • 37. API + Caching (search) • Cache and control abusive clients • Varnish between two Apache Virtual Hosts (failover to another backend if Varnish dies) • Remove Cache busting query strings before applying hash algorithm • Using ESI to cache jQuery requests when specifying a callback= parameter - big win.
  • 38. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Caching. • You don’t need ACID for everything. • Enter the message queue...
  • 39. Queues • Many message queue solutions on the market • At high loads, most perform poorly when used in ‘durable’ mode. • Erlang based queues work well (RabbitMQ), but you need in house Erlang experience. • We wrote our own. • Kestrel to the rescue!
  • 40. Kestrel Falco tinnunculus • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 41. Asynchronous Requests • Inbound traffic consumes a mongrel • Outbound traffic consumes a mongrel • The request pipeline should not be used to handle 3rd party communications or back-end work. • Daemons, Daemons, Daemons.
  • 42. Don’t make services dependent • Move operations out of the synchronous request cycle • Email • Complex object generation (timelines) • 3rd party services (bit.ly, sms, etc.)
  • 43. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running daemons = low memory
  • 44. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Sharding importance • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 45. MySQL • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 46. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 47. status.twitter.com • Keep users in the loop, or suffer. • Hosted on different service (Tumblr) • No matter how little information you have available.
  • 48. Key Points • Databases not always the best store. • Instrument everything. • Use metrics to make decisions, not guesses. • Don’t make services dependent • Process asynchronously when possible
  • 49. Thanks! Twitter Open Source (Apache License): - CacheMoney Gem (Write through Caching) http://github.com/nkallen/cache-money/tree/master - Libmemcached http://tangent.org/552/libmemcached.html - Kestrel (Memcache-like message queue) http://github.com/robey/kestrel - mod_memcache_block (Apache 2.x Limiter/blocker) http://github.com/netik/mod_memcache_block