Diagnosing Problems in Production Systems

•

3 gefällt mir•1,684 views

This document provides guidance on diagnosing problems in Cassandra production systems. It outlines various preventative monitoring measures that can be used to monitor systems such as OpsCenter, Munin, Nagios, and Graphite. When problems occur, it recommends narrowing down the issue by examining areas like consistency, repair, queries, compaction, and system metrics. Specific tools are presented for analyzing compaction, system utilities, histograms, query tracing, Java garbage collection, and profiling garbage collection issues.

Technologie

Diagnosing Problems in Production
Jon Haddad, Technical Evangelist, Datastax
Blake Eggleston, Software Developer, Datastax
©2013 DataStax Confidential. Do not distribute without consent.
1

Preventative Measures
• Opscenter
• Metrics Integration
• Munin
• Monit
• Nagios / Icinga
• Graphite / Statsd (application level)
• Variety of 3rd party monitoring services

Narrow Down the Problem
• Weird consistency issues - NTP?
• Problems with Streaming / Repair - version conflicts
• Slow queries
• Compaction
• Histograms
• Tracing
• Nodes flapping / failing
• Dig into system metrics
• JVM GC issues

Compaction
• Compaction merges SSTables
• Too much compaction?
• Opscenter provides insight into compaction cluster wide
• nodetool
• compactionhistory
• getcompactionthroughput
• Leveled vs Size Tiered
• Leveled on SSD + Read Heavy
• Size tiered on Spinning rust
• Size tiered is great for write heavy time series workloads

System Utilities
• iostat
• disk level statistics
• htop
• process overview
• iftop & netstat
• network utilities
• dstat
• all the above in 1 tool
• strace
• …for the hardcore

Histograms
• proxyhistograms
• High level read and write times
• Includes network latency
• cfhistograms <keyspace> <table>
• reports stats for single table on a single
node
• Used to identify tables with
performance problems

JVM GC Overview
• What is garbage collection?
• Manual vs automatic memory management
• Generational garbage collection (ParNew & CMS)
• New Generation
• Old Generation

New Generation
• New objects are created in the new gen
• Minor GC
• Occurs when new gen fills up
• Stop the world
• Dead objects are removed
• Live objects are promoted into old gen
• Removing objects is fast, promoting objects is slow

Old Generation
• Objects are promoted to new gen from old gen
• Major GC
• Old generations fills up some percentage.
• Mostly concurrent
• 2 short stop the world pauses
• Full GC
• Occurs when old gen fills up or objects can’t be promoted
• Stop the world
• Collects all generations
• These are bad!

GC Profiling
• Opscenter gc stats
• Look for correlations between gc spikes
and read/write latency
• Cassandra GC Logging
• Can be activated in cassandra-env.sh
• jstat
• prints gc activity

GC Profiling
• What to look out for:
• Long, multi-second pauses
• Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with
it. Typically means garbage is being promoted out of the new gen too soon
• Long minor GC
• Many of the objects in the new gen are being promoted to the old gen.
• Most commonly caused by new gen being too big
• Sometimes caused by objects being promoted prematurely

©2013 DataStax Confidential. Do not distribute without consent. 13

Empfohlen

Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy

Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad

Python performance profilingJon Haddad

Python & Cassandra - Best FriendsJon Haddad

Fake It 'Til You Make ItJohn Stanford

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015Datadog

Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Legacy Typesafe (now Lightbend)

Scaling monitoring with Datadogalexismidon

Empfohlen

Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy

Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad

Python performance profilingJon Haddad

Python & Cassandra - Best FriendsJon Haddad

Fake It 'Til You Make ItJohn Stanford

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015Datadog

Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Legacy Typesafe (now Lightbend)

Scaling monitoring with Datadogalexismidon

Just enough web ops for web developersDatadog

Lifting the Blinds: Monitoring Windows Server 2012Datadog

Real time data quality on FlinkJaydeep Vishwakarma

Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Amazon Web Services

Datadog- Monitoring In Motion Cloud Native Apps SF

Using Riak for Events storage and analysis at Booking.comDamien Krotkine

The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsLightbend

NoSQL at GumtreeAndy Summers

Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.

CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver

Real time data driven applications (SQL vs NoSQL databases)GoDataDriven

Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy

Standing Up Your First ClusterDataStax Academy

Advanced OperationsDataStax Academy

Diagnosing Problems in Production (Nov 2015)Jon Haddad

Diagnosing Problems in Production - CassandraJon Haddad

Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy

Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy

Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionOutlyer

Google file systemAnkit Thiranh

Low latency Java appsSimon Ritter

Weitere ähnliche Inhalte

Was ist angesagt?

Just enough web ops for web developersDatadog

Lifting the Blinds: Monitoring Windows Server 2012Datadog

Real time data quality on FlinkJaydeep Vishwakarma

Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Amazon Web Services

Datadog- Monitoring In Motion Cloud Native Apps SF

Using Riak for Events storage and analysis at Booking.comDamien Krotkine

The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsLightbend

NoSQL at GumtreeAndy Summers

Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.

CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver

Real time data driven applications (SQL vs NoSQL databases)GoDataDriven

Was ist angesagt? (11)

Just enough web ops for web developers

Lifting the Blinds: Monitoring Windows Server 2012

Real time data quality on Flink

Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...

Datadog- Monitoring In Motion

Using Riak for Events storage and analysis at Booking.com

The Reactive Principles: Eight Tenets For Building Cloud Native Applications

NoSQL at Gumtree

Docker in Open Science Data Analysis Challenges by Bruce Hoff

CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police

Real time data driven applications (SQL vs NoSQL databases)

Ähnlich wie Diagnosing Problems in Production Systems

Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy

Standing Up Your First ClusterDataStax Academy

Advanced OperationsDataStax Academy

Diagnosing Problems in Production (Nov 2015)Jon Haddad

Diagnosing Problems in Production - CassandraJon Haddad

Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy

Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy

Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in ProductionOutlyer

Google file systemAnkit Thiranh

Low latency Java appsSimon Ritter

Fixing twitterRoger Xia

Fixing_Twitterliujianrong

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen

Development of concurrent services using In-Memory Data Gridsjlorenzocima

Make It Cooler: Using Decentralized Version Controlindiver

Dibi Conference 2012Scott Rutherford

John adams talk cloudyJohn Adams

Ähnlich wie Diagnosing Problems in Production Systems (20)

Webinar: Diagnosing Apache Cassandra Problems in Production

Standing Up Your First Cluster

Advanced Operations

Diagnosing Problems in Production (Nov 2015)

Diagnosing Problems in Production - Cassandra

Cassandra Day Chicago 2015: Diagnosing Problems in Production

Cassandra Day London 2015: Diagnosing Problems in Production

Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production

Google file system

Low latency Java apps

Fixing twitter

Fixing_Twitter

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)

Development of concurrent services using In-Memory Data Grids

Make It Cooler: Using Decentralized Version Control

Dibi Conference 2012

John adams talk cloudy

Mehr von DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy

Introduction to DataStax Enterprise Graph DatabaseDataStax Academy

Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy

Cassandra on Docker @ Walmart LabsDataStax Academy

Cassandra 3.0 Data ModelingDataStax Academy

Cassandra Adoption on Cisco UCS & Open stackDataStax Academy

Data Modeling for Apache CassandraDataStax Academy

Coursera Cassandra DriverDataStax Academy

Production Ready CassandraDataStax Academy

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy

Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy

Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy

Real Time Analytics with DseDataStax Academy

Introduction to Data Modeling with Apache CassandraDataStax Academy

Cassandra Core ConceptsDataStax Academy

Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy

Bad Habits Die Hard DataStax Academy

Advanced Data Modeling with Apache CassandraDataStax Academy

Advanced CassandraDataStax Academy

Apache Cassandra and DriversDataStax Academy

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft

Introduction to DataStax Enterprise Graph Database

Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra

Cassandra on Docker @ Walmart Labs

Cassandra 3.0 Data Modeling

Cassandra Adoption on Cisco UCS & Open stack

Data Modeling for Apache Cassandra

Coursera Cassandra Driver

Production Ready Cassandra

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Cassandra @ Sony: The good, the bad, and the ugly part 1

Cassandra @ Sony: The good, the bad, and the ugly part 2

Real Time Analytics with Dse

Introduction to Data Modeling with Apache Cassandra

Cassandra Core Concepts

Enabling Search in your Cassandra Application with DataStax Enterprise

Bad Habits Die Hard

Advanced Data Modeling with Apache Cassandra

Advanced Cassandra

Apache Cassandra and Drivers

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Understanding the Laravel MVC ArchitecturePixlogix Infotech

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Scaling API-first – The story of a global engineering organizationRadu Cotescu

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

🐬 The future of MySQL is Postgres 🐘

Handwritten Text Recognition for manuscripts and early printed texts

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Breaking the Kubernetes Kill Chain: Host Path Mount

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Data Cloud, More than a CDP by Matt Robison

CNv6 Instructor Chapter 6 Quality of Service

08448380779 Call Girls In Civil Lines Women Seeking Men

Understanding the Laravel MVC Architecture

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Maximizing Board Effectiveness 2024 Webinar.pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Unblocking The Main Thread Solving ANRs and Frozen Frames

Boost PC performance: How more available memory can improve productivity

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Scaling API-first – The story of a global engineering organization

SQL Database Design For Developers at php[tek] 2024

Diagnosing Problems in Production Systems

2. Preventative Measures • Opscenter • Metrics Integration • Munin • Monit • Nagios / Icinga • Graphite / Statsd (application level) • Variety of 3rd party monitoring services

3. Narrow Down the Problem • Weird consistency issues - NTP? • Problems with Streaming / Repair - version conflicts • Slow queries • Compaction • Histograms • Tracing • Nodes flapping / failing • Dig into system metrics • JVM GC issues

4. Compaction • Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction cluster wide • nodetool • compactionhistory • getcompactionthroughput • Leveled vs Size Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads

5. System Utilities • iostat • disk level statistics • htop • process overview • iftop & netstat • network utilities • dstat • all the above in 1 tool • strace • …for the hardcore

6. Histograms • proxyhistograms • High level read and write times • Includes network latency • cfhistograms <keyspace> <table> • reports stats for single table on a single node • Used to identify tables with performance problems

7. Query Tracing

8. JVM GC Overview • What is garbage collection? • Manual vs automatic memory management • Generational garbage collection (ParNew & CMS) • New Generation • Old Generation

9. New Generation • New objects are created in the new gen • Minor GC • Occurs when new gen fills up • Stop the world • Dead objects are removed • Live objects are promoted into old gen • Removing objects is fast, promoting objects is slow

10. Old Generation • Objects are promoted to new gen from old gen • Major GC • Old generations fills up some percentage. • Mostly concurrent • 2 short stop the world pauses • Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • These are bad!

11. GC Profiling • Opscenter gc stats • Look for correlations between gc spikes and read/write latency • Cassandra GC Logging • Can be activated in cassandra-env.sh • jstat • prints gc activity

12. GC Profiling • What to look out for: • Long, multi-second pauses • Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with it. Typically means garbage is being promoted out of the new gen too soon • Long minor GC • Many of the objects in the new gen are being promoted to the old gen. • Most commonly caused by new gen being too big • Sometimes caused by objects being promoted prematurely