MongoDB is a great NoSQL database, it’s very flexible and easy to use,
but would it handle massive Read / Write throughput?
actually, what happens when you need to scale everything out and easily?
We will lay out the reasons and the steps of migrating our data pipeline to Apache Cassandra in a short period without having any prior knowledge.
We’ll list our lessons learned as well.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Organizer of the “Big Things” Big Data community:http://somebigthings.com/big-things-intro/
2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
3. Agenda
Data flow with Mongo DB
The Problem
Solution
Lessons learned from a Newbi
Conclusion
5. Data pipeline flow – Use Case
Batch Apache Spark applications running
every 10 - 60 minutes
Request Rate:
◦ Bursts of ~9 million requests per batch job
◦ Beginning – Reads
◦ End - Writes
9. The Problem
Batch jobs
◦ Should run for 5-10 minutes in total
◦ Actual - runs for ~40 minutes
Why?
◦ ~20 minutes to write with the Java mongo driver –
Async (Unacknowledged)
◦ ~20 minutes to sync the journal
◦ Total: ~ 40 Minutes of the DB being unavailable
◦ No batch process response and no UI serving
10. Alternative Solutions
Shareded MongoDB (With replica sets)
◦ Pros:
Increases Throughput by the amount of shards
Increases the availability of the DB
◦ Cons:
Very hard to manage DevOps wise (for a small team of
developers)
High cost of servers – because each shared need 3
replicas
12. Our DevOps – After that solution
We had no
DevOps guy at
that time at all
13. Alternative Solutions
DynamoDB (We’re hosted on Amazon)
◦ Pros:
No need to manage DevOps
◦ Cons:
Catholic Wedding Amazons Service
Not enough usage use cases
Might get to a high cost for the service
14. Alternative Solutions
Apache Cassandra
◦ Pros:
Very large developer community
Linearly scalable Database
No single master architecture
Proven working with distributed engines like Apache
Spark
◦ Cons:
We had no experience at all with the Database
No Geo Spatial Index – Needed to implement by
ourselves
15. The Solution
Migration to Apache Cassandra (Steps)
◦ Writing to Mongo and Cassandra simultaneously
◦ Create easily a Cassandra cluster using
DataStax Community AMI on AWS
◦ First easy step – Using the spark-cassandra-
connector
(Easy bootstrap move to Spark Cassandra)
◦ Creating a monitoring dashboard to
Cassandra
17. Result
Performance improvement
◦ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB
Took 2 weeks to go from “Zero to Hero”, and to
ramp up a running solution that work without
glitches
18. Lessons learned from a
Newbi
Use TokenAwarePolicy when connecting to
the cluster – Spreads the load on the
coordinators
Cluster cluster = null;
Builder builder = Cluster.builder()
.withSocketOptions(socketOptions);
builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy(
new
DCAwareRoundRobinPolicy()));
cluster = builder.build();
19. Lessons learned from a
Newbi
Monitor everything!!! – All of the Metrics
◦ Cassandra
◦ JVM
◦ OS
Feature flag every parameter to the
connection, you’ll need it for tuning later
21. Monitor Everything!!!
Graphite + Grafana
◦ Pluggable metrics – Since Cassandra 2.0.x
Cassandra internal metrics
JVM metrics
◦ OS – Metrics
CollectD / StatsD – Reporting to graphite
◦ Should be combined with application level
metrics in the same graphs
Better visibility on correlations of the metrics
23. Lessons learned from a
Newbi
“nodetool” is your friend
◦ tpstats, cfhistograms, cfstats…
Data Modeling
◦ Time series data
◦ Evenly distributed partitions
◦ Everything becomes more rigid
Know your queries before you model
24. Lessons learned from a
Newbi
CQL Queries
◦ Once we got to know our data model better,
It got more efficient performance wise to use
CQL statement instead of the “spark-cassandra-
connector”
◦ Prepared Statements, Delete queries (of full
partitions), Range queries…
25. Useful Cassandra GUI Clients
DevCenter – By DataStax - Free
Dbeaver – Free & Open Source
◦ Supports a wide variety of databeses
26. Conclusion
Cassandra is a great linear scaling Distributed
Database
Monitor as much as you can
◦ Get visibility of what’s going on in the Cluster
Data Modeling correctly is the Key for success
Be ready for your next war
◦ Cassandra performance tuning – You’ll get to that for
sure