Experiences building a multi region cassandra operations orchestrator on aws

Diego Pacheco
Jackson Oliveira
Marcelo Serpa (a.k.a Tarzan)
Experiences Building a multi-region
Cassandra Operations Orchestrator on AWS

About us - Diego Pacheco
@diego_pacheco
❏ Cat's Father
❏ Principal Software Architect
❏ Agile Coach
❏ SOA/Microservices Expert
❏ DevOps Practitioner
❏ Speaker
❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
https://diegopacheco.github.io/

About us - Jackson Oliveira
❏ Father
❏ Software Architect
❏ Devops Engineer
❏ GCP Cloud Architect Certified
http://jackson-s-oliveira.blogspot.com/
@cyber_jso
cyberjso

About us - Marcelo Serpa (Tarzan)
@_marceloserpa
❏ Software Developer
❏ Microservice / DevOps Practitioner
❏ Speaker
❏ Meetup coordinator - NodeJS POA
marceloserpa
https://medium.com/@marceloserpa

CM Blog POSTs
http://diego-pacheco.blogspot.com/2018/07/experiences-building-cassandra.html
http://ilegra.com/beyonddata/2018/08/experiences-building-a-cassandra-orchestratorcm/
T

Agenda
❏ About us
❏ Problem, Principles and Design
❏ Team Practices
❏ Outages, Issues and Lessons
❏ Remediation & Cmsh
❏ Lessons Learned
❏ Q&A
T

Problem, Principles and
Design
D

Problem CM solves
❏ Operation Automation
❏ Create Clusters, Decomissions Clusters, Search Clusters
❏ Observability, Remediation
❏ Deployment Automation
❏ Security Groups
❏ Launch Configurations
❏ Auto Scaling Groups
❏ Route53 DNS Entries
❏ EIP
❏ S3 Buckets
❏ Scaling Cloud Operations
❏ No Code Needed
❏ No Manual Work
D

Why java?
Team Background Troubleshooting CASS is written Java
D

CM Features
❏ Support for CASS 2.2X e 3.1.X
❏ Backups and point in time restores
❏ Seeds / Token Management
❏ Full AWS Automation (SG, LC and ASG)
❏ Automated node Replacement
❏ Automated Node-by-node repairs
❏ Multi-dc support
❏ REST interfaces
❏ CM Internal state durability / Recovery (local disk and S3)
❏ 100%automated operations for:
❏ Cluster: creation, search, shutdown
D

CM Philosophy: Self Healing - Self Operating
Self Healing Self Operating
D

Dynomite Experiences
https://www.youtube.com/watch?v=Z4_rzsZd70o&feature=youtu.be (Netflix 2016)
D

CM use cases
❏ Source of Truth of Most microservices
❏ Single Region Cluster
❏ Batch/Streaming Application (Previously with HBase)
❏ Multi-Region Region Cluster
❏ API Gateway (Kong)
❏ Authentication Microservice
D

Step Framework
❏ One task has multiple steps
❏ Order
❏ Run a list of steps for cassandra nodes
❏ Tracker the current step running by node
❏ Skip steps
❏ If step fail, send the message for slack channel
❏ SignalFX
1- Create directories
BACKUP
2- Copy data
3- Send to S3
RESTORE
1- Download backups
2- Copy data
3- Restart cassandra
J

Recovery old and new model
❏ OLD way
❏ Disk first
❏ S3 every minute
❏ Flaky: No covering all corner cases
❏ New way
❏ Disk
❏ Send to all Cass nodes
❏ In case of failure call all cass nodes
❏ Get the highest TIMESTAMP and use it.
❏ More reliable
TODO draw jackson
J

Multi-Region Design
❏ CM Topology
❏ Dedicated: 1-1
❏ Shared: 1-N
❏ Infrastructure details:
❏ CM in both regions exchanges
information
❏ CM internode communication with EIP
❏ Public IP + PEM -> VPC Peering
❏ Cassandra:
❏ 2 seeds on US, 1 seed EU
❏ Seeds boots up first
❏ Replicates is async between regions
J

❏ Clients are: Developers and Cloud Operators
❏ Plannings per Quarters
❏ Tech Lead / Coach
❏ Retro every month
❏ Coaching Sessions - 101
❏ Design session
❏ Reviews
❏ Refactoring
❏ Kanban + google sheets + trello
❏ DevOps Principles - i.e: Immutable Infrastructure
How the team works? Practices.
D

How the team works? Tracking.
❏ Tell me a engineer who likes JIRA? Just PMs like JIRA.
❏ We was not using issue tracking first
❏ Issues lost
❏ Look for emails
❏ Ask several times about issues
❏ Repeat same design over and over
❏ Come up in a retrospective
❏ Github as issue tracking
❏ Log issues: bugs and enhancements
❏ Github release tracking
D

How the team works? Kanban + Predictability
❏ Simple Google Sheets
❏ Items / Weeks
❏ Check every week is you are on track or not
❏ 100% accuracy for features
❏ 100% WRONG estimate for BUGS (2 weeks ~ 2 months)
❏ Different Nature: Microservices VS Data Layer
❏ Very hard to estimate bugs - Solution?
❏ You can't automate what you don't know
❏ Stability Mindset
❏ Don't introduce bugs == Developer Checklists
❏ Force you to know what to automate later
D

How the team works? Releases. Stabilization Windows
❏ 4 Quarters
❏ ~Monthly releases
❏ Looks like waterfall or buffering
❏ Avoid ship bugs to customers
❏ Avoid downtimes
❏ Avoid losing data
❏ It's a must in data layer
❏ Data layer need to be more reliable them microservices
❏ How we did it ?
❏ Single Region - Stabilization window 1
❏ Multi-DC - Stabilization Window 2
D

How the team works?Documentation and Scalability
❏ About our customer: 42 countries organization
❏ Meetings are bottleneck for scalability
❏ Jenkins DSL (Code in General) kills scalability
❏ Service-Service kills tickets
❏ Documentations kills meetings
❏ Documentation matters
❏ Time Zones
❏ English
❏ Avoid Repetition
D

How the team works? Tests! Stability + Checklists
❏ Unit Tests
❏ Integration Tests
❏ Exploratory Tests
❏ Release 1 - 30 Issues (most bugs)
❏ Release 2 - 20 issues (most enchantments)
❏ Stability Mindset / Principles
❏ Exploration tests are a MUST
❏ Try to maximize coverage spectrum
❏ Developer Checklists Works very well
D

How the team works? Refactorings.
❏ Strategic VS Tactical Programing
❏ Several Important Refactorings(Re-Designs) like:
❏ Thread Model
❏ Tasks Responsibility
❏ Utils
❏ And much more…
❏ Easy to do In java and good tooling like: Eclipse.
❏ Pay off in a long run
❏ Kill you if you don't do it.
D

Flaky Tests
❏ Integration tests
❏ ~20 minutes
❏ Cassandra 3x and Cassandra 2x
❏ Hard to maintain
❏ Async AWS apis (SG, LC and ASG)
❏ Fixed timeout == unstable tests
❏ Solution: Progressive timeout
T

Remediation
❏ Why Remediate?
❏ Manual Steps are dangerous
❏ Bad time == Lots of pressure
❏ Started with Dynomite
❏ Scale Up
❏ AMI Patch
❏ Refactor to support Cassandra and CM
❏ Calls DM and CM Health Checkers
❏ Procedural process
❏ Relies one: DM cold bootstrap and CM node_replace + repair.
D

Downtime VS No Downtime: Forklift + Dual Write
❏ Downtime
❏ Dump data to file
❏ Dump Keyspace/Schema to file
❏ Upload to S3
❏ Import in new cluster
❏ No-Downtime
❏ Forklift + Dual writer pattern
❏ Requires code in the microservices
❏ Requires orchestration in Spinnaker.
D

Outages, issues and
lessons...
J

Troubleshooting / Police Forensic Skills
J

Remediation kill too many
nodes and replace did not
happen... why?
A) AWS Ec2
B) Jenkins
C) CM Java Code
D) Python Demon
E) Java Remediation code
F) AWS S3
G) Cassandra Node
H) Cassandra Cluster
I) Time
J) None above
J

J
Remediation CM
Cass US 2A
Cass US 2B
Cass US 2C
Cass EU 1A
Cass EU 2B
Cass EU 2C
Cluster
activity?
Cass US 2A
ASG (kill box)
New IP?

Fast Vs Slow Issue!
❏ Only with Theories
❏ EVIDENCE to back up our
theories/assumptions
❏ Simulations
❏ Solution:
❏ AWS Chaos service :-)
❏ < 1 mim = FAST
❏ > 3 mim = SLOW
❏ In the end of the day it's all
about 90s internal TTL
❏ Wait for replace to make sure
reflect the REAL world
❏ Wait for HC to make sure
capture real world
J

Outage in prod: No outage because not live data there!
J

Cass 2.1.x to cass 2.2.x issues
❏ Node replace stop working
❏ We generate cass config files
❏ Position and parameters changed from 2.1 to 2.2
❏ Our code breaked
❏ Big changes on migration from Cass 2.1.x to 2.2.x
❏ Improvements
❏ Improved repair performance.
❏ The commit log is compressed to save disk space.
❏ Fixes
❏ Fix repair hang when snapshot failed (CASSANDRA-10057)
❏ Fix potential NPE on ORDER BY queries with IN (CASSANDRA-10955)
❏ Fix handling of nulls and unsets in IN conditions (CASSANDRA-12981)
❏ https://github.com/apache/cassandra/blob/cassandra-2.2/CHANGES.txt
T

CASS Stress/Load Tests
❏ Some bugs only appears when testing with volume
❏ Add volume might be tricky and time consuming
❏ Latency (do not run scripts from you local env)
❏ Filling up a table with a few text files will take too much time
❏ Parelization is needed
❏ Cassandra-Stress tool comes up handy on such scenario
❏ Customize how many rows and how many parallel threads writes
❏ It used tables with blobs
❏ Customize schema, replication factors and consistency level while running scripts
J

OOM Outage! EBS vs Instance Store
❏ EBS is a SPOF
❏ EBS is more expansive
❏ EBS is less performatic
❏ EBS is more flexible
❏ Disk spaces was critical to us
❏ You don’t want run out of disk, believe us..
❏ Dynamic Disk space definition while launching a cluster
❏ Disk space validations before starting a backup
J

Side note on Cass 4.x
❏ Cass 3.x is better than cass 2.x right now
❏ Cass 4.x will be awesome
❏ Netflix work on incremental repairs
❏ Bug fixes - like gossip threads and restart issue
❏ Way more stable - everybody should migrate.
❏ Having Less cassandra versions reduce complexity
❏ Different configurations
❏ Bugs that was fixed and you don't get it - lack of backport(old versions)
D

Design is strategic: Avoid complexity, bugs and reduce cost
D

Avoid Classitis - FAT classes rules
D

Java over bash always | Right tool for the job
D
Tooling
Refactoring Dev Vs Ops
Tooling / Mindset

Proof of 9 - Validate the code not the tests
D

Hard to Estimate Bugs(Data Layer) = Stabilization Payoff
D
Microservices Data Layer

Make sure you expand you test coverage radius
D

Forense Mindset & Skill | Observability over debug
D

Devops Is Plumbing. Automate the Hidden Pipelines!
❏ Remediation
❏ Scale Up
❏ Patch
❏ Upgrade
❏ Much more...
❏ Os Patches
❏ Telemetry
❏ Discovery
❏ Destroy
❏ Restore
D

Make tools for your Tools
❏ REPL - Cmsh
❏ Better than:
❏ Run books
❏ REST
❏ Bash Alias
❏ Shared Dashboards
❏ Avoid problems
❏ What monitor
❏ Self-Service Jobs
❏ Better than:
❏ Coding
❏ Jenkins DSL
❏ TF Templates
D

Experiences building a multi region cassandra operations orchestrator on aws

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Experiences building a multi region cassandra operations orchestrator on aws

Ähnlich wie Experiences building a multi region cassandra operations orchestrator on aws (20)

Mehr von Diego Pacheco

Mehr von Diego Pacheco (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Experiences building a multi region cassandra operations orchestrator on aws