SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
Diego Pacheco
Jackson Oliveira
Marcelo Serpa (a.k.a Tarzan)
Experiences Building a multi-region
Cassandra Operations Orchestrator on AWS
About us - Diego Pacheco
@diego_pacheco
❏ Cat's Father
❏ Principal Software Architect
❏ Agile Coach
❏ SOA/Microservices Expert
❏ DevOps Practitioner
❏ Speaker
❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
https://diegopacheco.github.io/
About us - Jackson Oliveira
❏ Father
❏ Software Architect
❏ Devops Engineer
❏ GCP Cloud Architect Certified
http://jackson-s-oliveira.blogspot.com/
@cyber_jso
cyberjso
About us - Marcelo Serpa (Tarzan)
@_marceloserpa
❏ Software Developer
❏ Microservice / DevOps Practitioner
❏ Speaker
❏ Meetup coordinator - NodeJS POA
marceloserpa
https://medium.com/@marceloserpa
About ilegra.com
T
CM Blog POSTs
http://diego-pacheco.blogspot.com/2018/07/experiences-building-cassandra.html
http://ilegra.com/beyonddata/2018/08/experiences-building-a-cassandra-orchestratorcm/
T
Agenda
❏ About us
❏ Problem, Principles and Design
❏ Team Practices
❏ Outages, Issues and Lessons
❏ Remediation & Cmsh
❏ Lessons Learned
❏ Q&A
T
Problem, Principles and
Design
D
Problem CM solves
❏ Operation Automation
❏ Create Clusters, Decomissions Clusters, Search Clusters
❏ Observability, Remediation
❏ Deployment Automation
❏ Security Groups
❏ Launch Configurations
❏ Auto Scaling Groups
❏ Route53 DNS Entries
❏ EIP
❏ S3 Buckets
❏ Scaling Cloud Operations
❏ No Code Needed
❏ No Manual Work
D
Why build CM?
D
Why java?
Team Background Troubleshooting CASS is written Java
D
CM Features
❏ Support for CASS 2.2X e 3.1.X
❏ Backups and point in time restores
❏ Seeds / Token Management
❏ Full AWS Automation (SG, LC and ASG)
❏ Automated node Replacement
❏ Automated Node-by-node repairs
❏ Multi-dc support
❏ REST interfaces
❏ CM Internal state durability / Recovery (local disk and S3)
❏ 100%automated operations for:
❏ Cluster: creation, search, shutdown
D
CM Philosophy: Self Healing - Self Operating
Self Healing Self Operating
D
Dynomite Experiences
https://www.youtube.com/watch?v=Z4_rzsZd70o&feature=youtu.be (Netflix 2016)
D
CM use cases
❏ Source of Truth of Most microservices
❏ Single Region Cluster
❏ Batch/Streaming Application (Previously with HBase)
❏ Multi-Region Region Cluster
❏ API Gateway (Kong)
❏ Authentication Microservice
D
CM Architecture
D
Internal Design
D
Heartbeat Algo and Design
J
Heartbeat Algo and Design
J
Step Framework
❏ One task has multiple steps
❏ Order
❏ Run a list of steps for cassandra nodes
❏ Tracker the current step running by node
❏ Skip steps
❏ If step fail, send the message for slack channel
❏ SignalFX
1- Create directories
BACKUP
2- Copy data
3- Send to S3
RESTORE
1- Download backups
2- Copy data
3- Restart cassandra
J
Graceful shutdown
J
Recovery old and new model
❏ OLD way
❏ Disk first
❏ S3 every minute
❏ Flaky: No covering all corner cases
❏ New way
❏ Disk
❏ Send to all Cass nodes
❏ In case of failure call all cass nodes
❏ Get the highest TIMESTAMP and use it.
❏ More reliable
TODO draw jackson
J
Jenkins JOBS
J
Multi-Region Design
❏ CM Topology
❏ Dedicated: 1-1
❏ Shared: 1-N
❏ Infrastructure details:
❏ CM in both regions exchanges
information
❏ CM internode communication with EIP
❏ Public IP + PEM -> VPC Peering
❏ Cassandra:
❏ 2 seeds on US, 1 seed EU
❏ Seeds boots up first
❏ Replicates is async between regions
J
Team Practices
D
❏ Clients are: Developers and Cloud Operators
❏ Plannings per Quarters
❏ Tech Lead / Coach
❏ Retro every month
❏ Coaching Sessions - 101
❏ Design session
❏ Reviews
❏ Refactoring
❏ Kanban + google sheets + trello
❏ DevOps Principles - i.e: Immutable Infrastructure
How the team works? Practices.
D
How the team works? Tracking.
❏ Tell me a engineer who likes JIRA? Just PMs like JIRA.
❏ We was not using issue tracking first
❏ Issues lost
❏ Look for emails
❏ Ask several times about issues
❏ Repeat same design over and over
❏ Come up in a retrospective
❏ Github as issue tracking
❏ Log issues: bugs and enhancements
❏ Github release tracking
D
How the team works? Kanban + Predictability
❏ Simple Google Sheets
❏ Items / Weeks
❏ Check every week is you are on track or not
❏ 100% accuracy for features
❏ 100% WRONG estimate for BUGS (2 weeks ~ 2 months)
❏ Different Nature: Microservices VS Data Layer
❏ Very hard to estimate bugs - Solution?
❏ You can't automate what you don't know
❏ Stability Mindset
❏ Don't introduce bugs == Developer Checklists
❏ Force you to know what to automate later
D
How the team works? Releases. Stabilization Windows
❏ 4 Quarters
❏ ~Monthly releases
❏ Looks like waterfall or buffering
❏ Avoid ship bugs to customers
❏ Avoid downtimes
❏ Avoid losing data
❏ It's a must in data layer
❏ Data layer need to be more reliable them microservices
❏ How we did it ?
❏ Single Region - Stabilization window 1
❏ Multi-DC - Stabilization Window 2
D
How the team works?Documentation and Scalability
❏ About our customer: 42 countries organization
❏ Meetings are bottleneck for scalability
❏ Jenkins DSL (Code in General) kills scalability
❏ Service-Service kills tickets
❏ Documentations kills meetings
❏ Documentation matters
❏ Time Zones
❏ English
❏ Avoid Repetition
D
How the team works? Tests! Stability + Checklists
❏ Unit Tests
❏ Integration Tests
❏ Exploratory Tests
❏ Release 1 - 30 Issues (most bugs)
❏ Release 2 - 20 issues (most enchantments)
❏ Stability Mindset / Principles
❏ Exploration tests are a MUST
❏ Try to maximize coverage spectrum
❏ Developer Checklists Works very well
D
How the team works? Refactorings.
❏ Strategic VS Tactical Programing
❏ Several Important Refactorings(Re-Designs) like:
❏ Thread Model
❏ Tasks Responsibility
❏ Utils
❏ And much more…
❏ Easy to do In java and good tooling like: Eclipse.
❏ Pay off in a long run
❏ Kill you if you don't do it.
D
Flaky Tests
❏ Integration tests
❏ ~20 minutes
❏ Cassandra 3x and Cassandra 2x
❏ Hard to maintain
❏ Async AWS apis (SG, LC and ASG)
❏ Fixed timeout == unstable tests
❏ Solution: Progressive timeout
T
Remediation & Cmsh
D
Remediation
❏ Why Remediate?
❏ Manual Steps are dangerous
❏ Bad time == Lots of pressure
❏ Started with Dynomite
❏ Scale Up
❏ AMI Patch
❏ Refactor to support Cassandra and CM
❏ Calls DM and CM Health Checkers
❏ Procedural process
❏ Relies one: DM cold bootstrap and CM node_replace + repair.
D
Remediation
D
Downtime VS No Downtime: Forklift + Dual Write
❏ Downtime
❏ Dump data to file
❏ Dump Keyspace/Schema to file
❏ Upload to S3
❏ Import in new cluster
❏ No-Downtime
❏ Forklift + Dual writer pattern
❏ Requires code in the microservices
❏ Requires orchestration in Spinnaker.
D
CMSH
D
Outages, issues and
lessons...
J
Troubleshooting / Police Forensic Skills
J
Troubleshooting / Police Forensic Skills
Remediation kill too many
nodes and replace did not
happen... why?
A) AWS Ec2
B) Jenkins
C) CM Java Code
D) Python Demon
E) Java Remediation code
F) AWS S3
G) Cassandra Node
H) Cassandra Cluster
I) Time
J) None above
J
Troubleshooting / Police Forensic Skills
J
Remediation CM
Cass US 2A
Cass US 2B
Cass US 2C
Cass EU 1A
Cass EU 2B
Cass EU 2C
Cluster
activity?
Cass US 2A
ASG (kill box)
New IP?
Fast Vs Slow Issue!
❏ Only with Theories
❏ EVIDENCE to back up our
theories/assumptions
❏ Simulations
❏ Solution:
❏ AWS Chaos service :-)
❏ < 1 mim = FAST
❏ > 3 mim = SLOW
❏ In the end of the day it's all
about 90s internal TTL
❏ Wait for replace to make sure
reflect the REAL world
❏ Wait for HC to make sure
capture real world
J
Kilometers approach
J
Tar Pits
D
Outage in prod
32k
J
Outage in prod
J
Outage in prod: No outage because not live data there!
J
Threads Re-design
D
Threads Re-design
D
Observability Rules!
T
Cass 2.1.x to cass 2.2.x issues
❏ Node replace stop working
❏ We generate cass config files
❏ Position and parameters changed from 2.1 to 2.2
❏ Our code breaked
❏ Big changes on migration from Cass 2.1.x to 2.2.x
❏ Improvements
❏ Improved repair performance.
❏ The commit log is compressed to save disk space.
❏ Fixes
❏ Fix repair hang when snapshot failed (CASSANDRA-10057)
❏ Fix potential NPE on ORDER BY queries with IN (CASSANDRA-10955)
❏ Fix handling of nulls and unsets in IN conditions (CASSANDRA-12981)
❏ https://github.com/apache/cassandra/blob/cassandra-2.2/CHANGES.txt
T
S3 Upload issue
J
CASS Stress/Load Tests
❏ Some bugs only appears when testing with volume
❏ Add volume might be tricky and time consuming
❏ Latency (do not run scripts from you local env)
❏ Filling up a table with a few text files will take too much time
❏ Parelization is needed
❏ Cassandra-Stress tool comes up handy on such scenario
❏ Customize how many rows and how many parallel threads writes
❏ It used tables with blobs
❏ Customize schema, replication factors and consistency level while running scripts
J
OOM Outage! EBS vs Instance Store
❏ EBS is a SPOF
❏ EBS is more expansive
❏ EBS is less performatic
❏ EBS is more flexible
❏ Disk spaces was critical to us
❏ You don’t want run out of disk, believe us..
❏ Dynamic Disk space definition while launching a cluster
❏ Disk space validations before starting a backup
J
Side note on Cass 4.x
❏ Cass 3.x is better than cass 2.x right now
❏ Cass 4.x will be awesome
❏ Netflix work on incremental repairs
❏ Bug fixes - like gossip threads and restart issue
❏ Way more stable - everybody should migrate.
❏ Having Less cassandra versions reduce complexity
❏ Different configurations
❏ Bugs that was fixed and you don't get it - lack of backport(old versions)
D
Lessons Learned
D
Design is strategic: Avoid complexity, bugs and reduce cost
D
Avoid Classitis - FAT classes rules
D
Java over bash always | Right tool for the job
D
Tooling
Refactoring Dev Vs Ops
Tooling / Mindset
Proof of 9 - Validate the code not the tests
D
Hard to Estimate Bugs(Data Layer) = Stabilization Payoff
D
Microservices Data Layer
Make sure you expand you test coverage radius
D
Forense Mindset & Skill | Observability over debug
D
Devops Is Plumbing. Automate the Hidden Pipelines!
❏ Remediation
❏ Scale Up
❏ Patch
❏ Upgrade
❏ Much more...
❏ Os Patches
❏ Telemetry
❏ Discovery
❏ Destroy
❏ Restore
D
Make tools for your Tools
❏ REPL - Cmsh
❏ Better than:
❏ Run books
❏ REST
❏ Bash Alias
❏ Shared Dashboards
❏ Avoid problems
❏ What monitor
❏ Self-Service Jobs
❏ Better than:
❏ Coding
❏ Jenkins DSL
❏ TF Templates
D
Diego Pacheco
Jackson Oliveira
Marcelo Serpa (a.k.a Tarzan)
Experiences Building a multi-region
Cassandra Operations Orchestrator on AWS

Weitere ähnliche Inhalte

Was ist angesagt?

Integrating CloudStack & Ceph
Integrating CloudStack & CephIntegrating CloudStack & Ceph
Integrating CloudStack & CephShapeBlue
 
Devops - why, what and how?
Devops - why, what and how?Devops - why, what and how?
Devops - why, what and how?Malinda Kapuruge
 
How We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and FasterHow We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and FasterScyllaDB
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...DataStax Academy
 
Cassandra from tarball to production
Cassandra   from tarball to productionCassandra   from tarball to production
Cassandra from tarball to productionRon Kuris
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 NotesRoss Lawley
 
Microservices for performance - GOTO Chicago 2016
Microservices for performance - GOTO Chicago 2016Microservices for performance - GOTO Chicago 2016
Microservices for performance - GOTO Chicago 2016Peter Lawrey
 
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent DatabasesScylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent DatabasesScyllaDB
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Henrik Ingo
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesRaghavendra Prabhu
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - SlidesSeveralnines
 
SaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering ChallengesSaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering ChallengesMalinda Kapuruge
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Oreilly Webcast 01 19 10
Oreilly Webcast 01 19 10Oreilly Webcast 01 19 10
Oreilly Webcast 01 19 10Sean Hull
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to ProductionC2B2 Consulting
 
PagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra FailuresPagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra FailuresDataStax Academy
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 

Was ist angesagt? (20)

Integrating CloudStack & Ceph
Integrating CloudStack & CephIntegrating CloudStack & Ceph
Integrating CloudStack & Ceph
 
Devops - why, what and how?
Devops - why, what and how?Devops - why, what and how?
Devops - why, what and how?
 
How We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and FasterHow We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and Faster
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Cassandra at teads
Cassandra at teadsCassandra at teads
Cassandra at teads
 
Cassandra from tarball to production
Cassandra   from tarball to productionCassandra   from tarball to production
Cassandra from tarball to production
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 Notes
 
Netty training
Netty trainingNetty training
Netty training
 
Microservices for performance - GOTO Chicago 2016
Microservices for performance - GOTO Chicago 2016Microservices for performance - GOTO Chicago 2016
Microservices for performance - GOTO Chicago 2016
 
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent DatabasesScylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent Databases
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
 
SaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering ChallengesSaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering Challenges
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Oreilly Webcast 01 19 10
Oreilly Webcast 01 19 10Oreilly Webcast 01 19 10
Oreilly Webcast 01 19 10
 
Infinispan from POC to Production
Infinispan from POC to ProductionInfinispan from POC to Production
Infinispan from POC to Production
 
PagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra FailuresPagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra Failures
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 

Ähnlich wie Experiences building a multi region cassandra operations orchestrator on aws

Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notesDiego Pacheco
 
Cloud-Native DevOps Engineering
Cloud-Native DevOps EngineeringCloud-Native DevOps Engineering
Cloud-Native DevOps EngineeringDiego Pacheco
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streamingMaturin BADO
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...Severalnines
 
Apache Cassandra - part 2
Apache Cassandra - part 2Apache Cassandra - part 2
Apache Cassandra - part 2Diego Pacheco
 
Dip into prometheus
Dip into prometheusDip into prometheus
Dip into prometheusZaar Hai
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldScyllaDB
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9DanHeidinga
 
Antoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloudAntoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloudShapeBlue
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Stress Test & Chaos Engineering
Stress Test & Chaos EngineeringStress Test & Chaos Engineering
Stress Test & Chaos EngineeringDiego Pacheco
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 

Ähnlich wie Experiences building a multi region cassandra operations orchestrator on aws (20)

Kubernetes
KubernetesKubernetes
Kubernetes
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
 
Cloud-Native DevOps Engineering
Cloud-Native DevOps EngineeringCloud-Native DevOps Engineering
Cloud-Native DevOps Engineering
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
 
Apache Cassandra - part 2
Apache Cassandra - part 2Apache Cassandra - part 2
Apache Cassandra - part 2
 
Dip into prometheus
Dip into prometheusDip into prometheus
Dip into prometheus
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
System Design.pdf
System Design.pdfSystem Design.pdf
System Design.pdf
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
 
ES & Kafka
ES & KafkaES & Kafka
ES & Kafka
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Antoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloudAntoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloud
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Stress Test & Chaos Engineering
Stress Test & Chaos EngineeringStress Test & Chaos Engineering
Stress Test & Chaos Engineering
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 

Mehr von Diego Pacheco

Naming Things Book : Simple Book Review!
Naming Things Book : Simple Book Review!Naming Things Book : Simple Book Review!
Naming Things Book : Simple Book Review!Diego Pacheco
 
Continuous Discovery Habits Book Review.pdf
Continuous Discovery Habits  Book Review.pdfContinuous Discovery Habits  Book Review.pdf
Continuous Discovery Habits Book Review.pdfDiego Pacheco
 
Thoughts about Shape Up
Thoughts about Shape UpThoughts about Shape Up
Thoughts about Shape UpDiego Pacheco
 
Encryption Deep Dive
Encryption Deep DiveEncryption Deep Dive
Encryption Deep DiveDiego Pacheco
 
Management: Doing the non-obvious! III
Management: Doing the non-obvious! IIIManagement: Doing the non-obvious! III
Management: Doing the non-obvious! IIIDiego Pacheco
 
Design is not Subjective
Design is not SubjectiveDesign is not Subjective
Design is not SubjectiveDiego Pacheco
 
Architecture & Engineering : Doing the non-obvious!
Architecture & Engineering :  Doing the non-obvious!Architecture & Engineering :  Doing the non-obvious!
Architecture & Engineering : Doing the non-obvious!Diego Pacheco
 
Management doing the non-obvious II
Management doing the non-obvious II Management doing the non-obvious II
Management doing the non-obvious II Diego Pacheco
 
Testing in production
Testing in productionTesting in production
Testing in productionDiego Pacheco
 
Nine lies about work
Nine lies about workNine lies about work
Nine lies about workDiego Pacheco
 
Management: doing the nonobvious!
Management: doing the nonobvious!Management: doing the nonobvious!
Management: doing the nonobvious!Diego Pacheco
 
Dealing with dependencies
Dealing  with dependenciesDealing  with dependencies
Dealing with dependenciesDiego Pacheco
 
Dealing with dependencies in tests
Dealing  with dependencies in testsDealing  with dependencies in tests
Dealing with dependencies in testsDiego Pacheco
 

Mehr von Diego Pacheco (20)

Naming Things Book : Simple Book Review!
Naming Things Book : Simple Book Review!Naming Things Book : Simple Book Review!
Naming Things Book : Simple Book Review!
 
Continuous Discovery Habits Book Review.pdf
Continuous Discovery Habits  Book Review.pdfContinuous Discovery Habits  Book Review.pdf
Continuous Discovery Habits Book Review.pdf
 
Thoughts about Shape Up
Thoughts about Shape UpThoughts about Shape Up
Thoughts about Shape Up
 
Holacracy
HolacracyHolacracy
Holacracy
 
AWS IAM
AWS IAMAWS IAM
AWS IAM
 
CDKs
CDKsCDKs
CDKs
 
Encryption Deep Dive
Encryption Deep DiveEncryption Deep Dive
Encryption Deep Dive
 
Sec 101
Sec 101Sec 101
Sec 101
 
Reflections on SCM
Reflections on SCMReflections on SCM
Reflections on SCM
 
Management: Doing the non-obvious! III
Management: Doing the non-obvious! IIIManagement: Doing the non-obvious! III
Management: Doing the non-obvious! III
 
Design is not Subjective
Design is not SubjectiveDesign is not Subjective
Design is not Subjective
 
Architecture & Engineering : Doing the non-obvious!
Architecture & Engineering :  Doing the non-obvious!Architecture & Engineering :  Doing the non-obvious!
Architecture & Engineering : Doing the non-obvious!
 
Management doing the non-obvious II
Management doing the non-obvious II Management doing the non-obvious II
Management doing the non-obvious II
 
Testing in production
Testing in productionTesting in production
Testing in production
 
Nine lies about work
Nine lies about workNine lies about work
Nine lies about work
 
Management: doing the nonobvious!
Management: doing the nonobvious!Management: doing the nonobvious!
Management: doing the nonobvious!
 
AI and the Future
AI and the FutureAI and the Future
AI and the Future
 
Dealing with dependencies
Dealing  with dependenciesDealing  with dependencies
Dealing with dependencies
 
Dealing with dependencies in tests
Dealing  with dependencies in testsDealing  with dependencies in tests
Dealing with dependencies in tests
 
Kanban 2020
Kanban 2020Kanban 2020
Kanban 2020
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Experiences building a multi region cassandra operations orchestrator on aws

  • 1. Diego Pacheco Jackson Oliveira Marcelo Serpa (a.k.a Tarzan) Experiences Building a multi-region Cassandra Operations Orchestrator on AWS
  • 2. About us - Diego Pacheco @diego_pacheco ❏ Cat's Father ❏ Principal Software Architect ❏ Agile Coach ❏ SOA/Microservices Expert ❏ DevOps Practitioner ❏ Speaker ❏ Author diegopacheco http://diego-pacheco.blogspot.com.br/ https://diegopacheco.github.io/
  • 3. About us - Jackson Oliveira ❏ Father ❏ Software Architect ❏ Devops Engineer ❏ GCP Cloud Architect Certified http://jackson-s-oliveira.blogspot.com/ @cyber_jso cyberjso
  • 4. About us - Marcelo Serpa (Tarzan) @_marceloserpa ❏ Software Developer ❏ Microservice / DevOps Practitioner ❏ Speaker ❏ Meetup coordinator - NodeJS POA marceloserpa https://medium.com/@marceloserpa
  • 7. Agenda ❏ About us ❏ Problem, Principles and Design ❏ Team Practices ❏ Outages, Issues and Lessons ❏ Remediation & Cmsh ❏ Lessons Learned ❏ Q&A T
  • 9. Problem CM solves ❏ Operation Automation ❏ Create Clusters, Decomissions Clusters, Search Clusters ❏ Observability, Remediation ❏ Deployment Automation ❏ Security Groups ❏ Launch Configurations ❏ Auto Scaling Groups ❏ Route53 DNS Entries ❏ EIP ❏ S3 Buckets ❏ Scaling Cloud Operations ❏ No Code Needed ❏ No Manual Work D
  • 11. Why java? Team Background Troubleshooting CASS is written Java D
  • 12. CM Features ❏ Support for CASS 2.2X e 3.1.X ❏ Backups and point in time restores ❏ Seeds / Token Management ❏ Full AWS Automation (SG, LC and ASG) ❏ Automated node Replacement ❏ Automated Node-by-node repairs ❏ Multi-dc support ❏ REST interfaces ❏ CM Internal state durability / Recovery (local disk and S3) ❏ 100%automated operations for: ❏ Cluster: creation, search, shutdown D
  • 13. CM Philosophy: Self Healing - Self Operating Self Healing Self Operating D
  • 15. CM use cases ❏ Source of Truth of Most microservices ❏ Single Region Cluster ❏ Batch/Streaming Application (Previously with HBase) ❏ Multi-Region Region Cluster ❏ API Gateway (Kong) ❏ Authentication Microservice D
  • 18. Heartbeat Algo and Design J
  • 19. Heartbeat Algo and Design J
  • 20. Step Framework ❏ One task has multiple steps ❏ Order ❏ Run a list of steps for cassandra nodes ❏ Tracker the current step running by node ❏ Skip steps ❏ If step fail, send the message for slack channel ❏ SignalFX 1- Create directories BACKUP 2- Copy data 3- Send to S3 RESTORE 1- Download backups 2- Copy data 3- Restart cassandra J
  • 22. Recovery old and new model ❏ OLD way ❏ Disk first ❏ S3 every minute ❏ Flaky: No covering all corner cases ❏ New way ❏ Disk ❏ Send to all Cass nodes ❏ In case of failure call all cass nodes ❏ Get the highest TIMESTAMP and use it. ❏ More reliable TODO draw jackson J
  • 24. Multi-Region Design ❏ CM Topology ❏ Dedicated: 1-1 ❏ Shared: 1-N ❏ Infrastructure details: ❏ CM in both regions exchanges information ❏ CM internode communication with EIP ❏ Public IP + PEM -> VPC Peering ❏ Cassandra: ❏ 2 seeds on US, 1 seed EU ❏ Seeds boots up first ❏ Replicates is async between regions J
  • 26. ❏ Clients are: Developers and Cloud Operators ❏ Plannings per Quarters ❏ Tech Lead / Coach ❏ Retro every month ❏ Coaching Sessions - 101 ❏ Design session ❏ Reviews ❏ Refactoring ❏ Kanban + google sheets + trello ❏ DevOps Principles - i.e: Immutable Infrastructure How the team works? Practices. D
  • 27. How the team works? Tracking. ❏ Tell me a engineer who likes JIRA? Just PMs like JIRA. ❏ We was not using issue tracking first ❏ Issues lost ❏ Look for emails ❏ Ask several times about issues ❏ Repeat same design over and over ❏ Come up in a retrospective ❏ Github as issue tracking ❏ Log issues: bugs and enhancements ❏ Github release tracking D
  • 28. How the team works? Kanban + Predictability ❏ Simple Google Sheets ❏ Items / Weeks ❏ Check every week is you are on track or not ❏ 100% accuracy for features ❏ 100% WRONG estimate for BUGS (2 weeks ~ 2 months) ❏ Different Nature: Microservices VS Data Layer ❏ Very hard to estimate bugs - Solution? ❏ You can't automate what you don't know ❏ Stability Mindset ❏ Don't introduce bugs == Developer Checklists ❏ Force you to know what to automate later D
  • 29. How the team works? Releases. Stabilization Windows ❏ 4 Quarters ❏ ~Monthly releases ❏ Looks like waterfall or buffering ❏ Avoid ship bugs to customers ❏ Avoid downtimes ❏ Avoid losing data ❏ It's a must in data layer ❏ Data layer need to be more reliable them microservices ❏ How we did it ? ❏ Single Region - Stabilization window 1 ❏ Multi-DC - Stabilization Window 2 D
  • 30. How the team works?Documentation and Scalability ❏ About our customer: 42 countries organization ❏ Meetings are bottleneck for scalability ❏ Jenkins DSL (Code in General) kills scalability ❏ Service-Service kills tickets ❏ Documentations kills meetings ❏ Documentation matters ❏ Time Zones ❏ English ❏ Avoid Repetition D
  • 31. How the team works? Tests! Stability + Checklists ❏ Unit Tests ❏ Integration Tests ❏ Exploratory Tests ❏ Release 1 - 30 Issues (most bugs) ❏ Release 2 - 20 issues (most enchantments) ❏ Stability Mindset / Principles ❏ Exploration tests are a MUST ❏ Try to maximize coverage spectrum ❏ Developer Checklists Works very well D
  • 32. How the team works? Refactorings. ❏ Strategic VS Tactical Programing ❏ Several Important Refactorings(Re-Designs) like: ❏ Thread Model ❏ Tasks Responsibility ❏ Utils ❏ And much more… ❏ Easy to do In java and good tooling like: Eclipse. ❏ Pay off in a long run ❏ Kill you if you don't do it. D
  • 33. Flaky Tests ❏ Integration tests ❏ ~20 minutes ❏ Cassandra 3x and Cassandra 2x ❏ Hard to maintain ❏ Async AWS apis (SG, LC and ASG) ❏ Fixed timeout == unstable tests ❏ Solution: Progressive timeout T
  • 35. Remediation ❏ Why Remediate? ❏ Manual Steps are dangerous ❏ Bad time == Lots of pressure ❏ Started with Dynomite ❏ Scale Up ❏ AMI Patch ❏ Refactor to support Cassandra and CM ❏ Calls DM and CM Health Checkers ❏ Procedural process ❏ Relies one: DM cold bootstrap and CM node_replace + repair. D
  • 37. Downtime VS No Downtime: Forklift + Dual Write ❏ Downtime ❏ Dump data to file ❏ Dump Keyspace/Schema to file ❏ Upload to S3 ❏ Import in new cluster ❏ No-Downtime ❏ Forklift + Dual writer pattern ❏ Requires code in the microservices ❏ Requires orchestration in Spinnaker. D
  • 40. Troubleshooting / Police Forensic Skills J
  • 41. Troubleshooting / Police Forensic Skills Remediation kill too many nodes and replace did not happen... why? A) AWS Ec2 B) Jenkins C) CM Java Code D) Python Demon E) Java Remediation code F) AWS S3 G) Cassandra Node H) Cassandra Cluster I) Time J) None above J
  • 42. Troubleshooting / Police Forensic Skills J Remediation CM Cass US 2A Cass US 2B Cass US 2C Cass EU 1A Cass EU 2B Cass EU 2C Cluster activity? Cass US 2A ASG (kill box) New IP?
  • 43. Fast Vs Slow Issue! ❏ Only with Theories ❏ EVIDENCE to back up our theories/assumptions ❏ Simulations ❏ Solution: ❏ AWS Chaos service :-) ❏ < 1 mim = FAST ❏ > 3 mim = SLOW ❏ In the end of the day it's all about 90s internal TTL ❏ Wait for replace to make sure reflect the REAL world ❏ Wait for HC to make sure capture real world J
  • 48. Outage in prod: No outage because not live data there! J
  • 52. Cass 2.1.x to cass 2.2.x issues ❏ Node replace stop working ❏ We generate cass config files ❏ Position and parameters changed from 2.1 to 2.2 ❏ Our code breaked ❏ Big changes on migration from Cass 2.1.x to 2.2.x ❏ Improvements ❏ Improved repair performance. ❏ The commit log is compressed to save disk space. ❏ Fixes ❏ Fix repair hang when snapshot failed (CASSANDRA-10057) ❏ Fix potential NPE on ORDER BY queries with IN (CASSANDRA-10955) ❏ Fix handling of nulls and unsets in IN conditions (CASSANDRA-12981) ❏ https://github.com/apache/cassandra/blob/cassandra-2.2/CHANGES.txt T
  • 54. CASS Stress/Load Tests ❏ Some bugs only appears when testing with volume ❏ Add volume might be tricky and time consuming ❏ Latency (do not run scripts from you local env) ❏ Filling up a table with a few text files will take too much time ❏ Parelization is needed ❏ Cassandra-Stress tool comes up handy on such scenario ❏ Customize how many rows and how many parallel threads writes ❏ It used tables with blobs ❏ Customize schema, replication factors and consistency level while running scripts J
  • 55. OOM Outage! EBS vs Instance Store ❏ EBS is a SPOF ❏ EBS is more expansive ❏ EBS is less performatic ❏ EBS is more flexible ❏ Disk spaces was critical to us ❏ You don’t want run out of disk, believe us.. ❏ Dynamic Disk space definition while launching a cluster ❏ Disk space validations before starting a backup J
  • 56. Side note on Cass 4.x ❏ Cass 3.x is better than cass 2.x right now ❏ Cass 4.x will be awesome ❏ Netflix work on incremental repairs ❏ Bug fixes - like gossip threads and restart issue ❏ Way more stable - everybody should migrate. ❏ Having Less cassandra versions reduce complexity ❏ Different configurations ❏ Bugs that was fixed and you don't get it - lack of backport(old versions) D
  • 58. Design is strategic: Avoid complexity, bugs and reduce cost D
  • 59. Avoid Classitis - FAT classes rules D
  • 60. Java over bash always | Right tool for the job D Tooling Refactoring Dev Vs Ops Tooling / Mindset
  • 61. Proof of 9 - Validate the code not the tests D
  • 62. Hard to Estimate Bugs(Data Layer) = Stabilization Payoff D Microservices Data Layer
  • 63. Make sure you expand you test coverage radius D
  • 64. Forense Mindset & Skill | Observability over debug D
  • 65. Devops Is Plumbing. Automate the Hidden Pipelines! ❏ Remediation ❏ Scale Up ❏ Patch ❏ Upgrade ❏ Much more... ❏ Os Patches ❏ Telemetry ❏ Discovery ❏ Destroy ❏ Restore D
  • 66. Make tools for your Tools ❏ REPL - Cmsh ❏ Better than: ❏ Run books ❏ REST ❏ Bash Alias ❏ Shared Dashboards ❏ Avoid problems ❏ What monitor ❏ Self-Service Jobs ❏ Better than: ❏ Coding ❏ Jenkins DSL ❏ TF Templates D
  • 67. Diego Pacheco Jackson Oliveira Marcelo Serpa (a.k.a Tarzan) Experiences Building a multi-region Cassandra Operations Orchestrator on AWS