Migrating Data Pipeline from MongoDB to Cassandra

Demi Ben-Ari
Cassandra Meetup – 10/11/2015
Israel

About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF

Agenda
 Data flow with Mongo DB
 The Problem
 Solution
 Lessons learned from a Newbi
 Conclusion

Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv

Data pipeline flow – Use Case
 Batch Apache Spark applications running
every 10 - 60 minutes
 Request Rate:
◦ Bursts of ~9 million requests per batch job
◦ Beginning – Reads
◦ End - Writes

Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB
Replica Set
Spark
Cluster
Master
Write
Read

Spark Slave - Server Specs
 Instance Type: r3.xlarge
 CPU’s: 4
 RAM: 30.5GB
 Storage: ephemeral
 Amount: 10+

MongoDB - Server Specs
 MongoDB version: 2.6.1
 Instance Type: m3.xlarge (AWS)
 CPU’s: 4
 RAM: 15GB
 Storage: EBS
 DB Size: ~500GB
 Collection Indexes: 5 (4 compound)

The Problem
 Batch jobs
◦ Should run for 5-10 minutes in total
◦ Actual - runs for ~40 minutes
 Why?
◦ ~20 minutes to write with the Java mongo driver –
Async (Unacknowledged)
◦ ~20 minutes to sync the journal
◦ Total: ~ 40 Minutes of the DB being unavailable
◦ No batch process response and no UI serving

Alternative Solutions
 Shareded MongoDB (With replica sets)
◦ Pros:
 Increases Throughput by the amount of shards
 Increases the availability of the DB
◦ Cons:
 Very hard to manage DevOps wise (for a small team of
developers)
 High cost of servers – because each shared need 3
replicas

Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
Master

Our DevOps – After that solution
We had no
DevOps guy at
that time at all


 DynamoDB (We’re hosted on Amazon)
◦ Pros:
 No need to manage DevOps
◦ Cons:
 Catholic Wedding Amazons Service
 Not enough usage use cases
 Might get to a high cost for the service

 Apache Cassandra
◦ Pros:
 Very large developer community
 Linearly scalable Database
 No single master architecture
 Proven working with distributed engines like Apache
Spark
◦ Cons:
 We had no experience at all with the Database
 No Geo Spatial Index – Needed to implement by
ourselves

The Solution
 Migration to Apache Cassandra (Steps)
◦ Writing to Mongo and Cassandra simultaneously
◦ Create easily a Cassandra cluster using
DataStax Community AMI on AWS
◦ First easy step – Using the spark-cassandra-
connector
 (Easy bootstrap move to Spark  Cassandra)
◦ Creating a monitoring dashboard to
Cassandra

Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read

Result
 Performance improvement
◦ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB
 Took 2 weeks to go from “Zero to Hero”, and to
ramp up a running solution that work without
glitches

Lessons learned from a
Newbi
 Use TokenAwarePolicy when connecting to
the cluster – Spreads the load on the
coordinators
Cluster cluster = null;
Builder builder = Cluster.builder()
.withSocketOptions(socketOptions);
builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy(
new
DCAwareRoundRobinPolicy()));
cluster = builder.build();

Newbi
 Monitor everything!!! – All of the Metrics
◦ Cassandra
◦ JVM
◦ OS
 Feature flag every parameter to the
connection, you’ll need it for tuning later

Monitor Everything!!!
 DataStax – OpsCenter
◦ Comes bundled with the DataStax
Community AMI on AWS

 Graphite + Grafana
◦ Pluggable metrics – Since Cassandra 2.0.x
 Cassandra internal metrics
 JVM metrics
◦ OS – Metrics
 CollectD / StatsD – Reporting to graphite
◦ Should be combined with application level
metrics in the same graphs
 Better visibility on correlations of the metrics

 Graphite + Grafana

Newbi
 “nodetool” is your friend
◦ tpstats, cfhistograms, cfstats…
 Data Modeling
◦ Time series data
◦ Evenly distributed partitions
◦ Everything becomes more rigid
 Know your queries before you model

Newbi
 CQL Queries
◦ Once we got to know our data model better,
It got more efficient performance wise to use
CQL statement instead of the “spark-cassandra-
connector”
◦ Prepared Statements, Delete queries (of full
partitions), Range queries…

Useful Cassandra GUI Clients
 DevCenter – By DataStax - Free
 Dbeaver – Free & Open Source
◦ Supports a wide variety of databeses

Conclusion
 Cassandra is a great linear scaling Distributed
Database
 Monitor as much as you can
◦ Get visibility of what’s going on in the Cluster
 Data Modeling correctly is the Key for success
 Be ready for your next war
◦ Cassandra performance tuning – You’ll get to that for
sure

Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

Migrating Data Pipeline from MongoDB to Cassandra

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Migrating Data Pipeline from MongoDB to Cassandra

Ähnlich wie Migrating Data Pipeline from MongoDB to Cassandra (20)

Mehr von Demi Ben-Ari

Mehr von Demi Ben-Ari (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Migrating Data Pipeline from MongoDB to Cassandra