Many organizations struggle to balance traditional big data infrastructure with NoSQL databases. Other organizations do the smart thing and consolidate the two. This presentation explores Numberly’s experience migrating an intensive and join hungry production workload from MongoDB and Hive to Scylla. Using Scylla, we were able to accommodate a join of billions of rows in seconds, while also dramatically reducing operational and development complexity by using a single database for our hybrid analytical use case. As a bonus, we’ll cover benchmarks for Dask (a flexible parallel computing library for analytic computing) and Spark, highlighting their differences and lessons learned along the way.
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla
1. Joining Billions of Rows in Seconds
with One Database Instead of Two:
Replacing MongoDB and Hive with Scylla
Alexys Jacob
CTO, Numberly
2. 1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
3. Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
4. Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all data sources and
destinations.
➔ For this we use ID matching tables.
5. ID matching tables
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable by
partner
Queried AND updated all the time!
➔ High read AND write workload
JOIN
6. Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH
ACTIVATE
10. Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
11. Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
12. Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID over the last N
months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
13. Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
14. scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
15. Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
18. Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
21. Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
22. Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
24. OK for Scala, what about Python?
No joinWithCassandraTable
when using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
29. Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Keeping (in sync) two copies of the same data
Batch data freshness
Operational burden
=> none can sustain both read and write workload
Can Scylla sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?
Simpler data consistency
Operational simplicity and efficiency
Reduced costs
Always try a technology under the best omens :)
Running Gentoo Linux
ID translations must be done both ways => denormalization
I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.
We ended up with three denormalized tables
History like table (just like a log)
Optimize for latest value?
This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!
Docker based, Easy to install, Multi environment support
Understand the performance of your cluster
Tune your workload for optimal performances
Ref dataset : data cardinality, representative volumes
MongoDB cluster, make sure to shard and index the dataset just like you do on the production collections.
Hive, respect the storage file format of your current implementations as well as their partitioning.
Combien de machines en prod ?! le dire
It’s time to break Scylla, your goal here is to saturate the Scylla cluster, get it to ~90% load
read the 10M population rows from Hive in a partitioned manner
for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid
create a dataframe from the resulting matches
gather back all the dataframes and merge them
count the number of matches
Spark 2 cold is 12min
Spark 2 hot is 2min
I experienced pretty crappy performances at first. Grafana monitoring showed that Scylla was not being the bottleneck
Repartition is used to leverage the driver’s knowledge on how data is sharded to optimize how it is going to be split between spark workers
Take your clusters’ utilization into account
Take your clusters’ utilization into account
With spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances
Those three refurbished machines can compete with our current production machines and implementations
They can even match an idle Hive cluster of a medium size
DIGRESSION!
I went into the crazy quest of beating Spark 2 performances using a pure Python implementation.
The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not.
So you can’t possibly imagine outperforming Spark 2 with your single machine
Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time.
This is a quite expensive process that we have a chance to avoid using Python!
joinWithCassandraTable JOINs 10M with 400M...
read the 10M population rows from Hive in a partitioned manner
for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid
create a dataframe from the resulting matches
gather back all the dataframes and merge them
count the number of matches
Spark 2 cold is 12min
Spark 2 hot is 2min
libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!
The Hive loading + partitioning got down from 50s to 10s
The conclusion of the evaluation has not been driven by the good figures we got out of our test workloads
Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla
Instead we decided on the following points of interest (in no particular order):
data consistency
production reliability
datacenter awareness
ease of operation
infrastructure rationalisation
Developer friendliness (mais c’est pas mongo)
Costs (train engineers)