This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize the Spark job. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
23. What’s Next?
• Optimization of the C2S Spark job
• More Spark jobs
• Newer version of Spark & DSE
• Scala Spark jobs instead of Java
Hinweis der Redaktion
DataStax Certified Cassandra Architect
Created Trireme
OSC specializes in search, discovery, and analytics solutions.
We have published quire a few books and series including
Apach Solr Enterprise Search Server (Packt)
Relevant Search (Manning)
Building a Search Server with Elasticsearch
Technologies:
Spark
Cassandra
Solr
Elasticsearch
Camel
225 years of patent data starting in 1790
Patents are currently stored as TIF images with XML documents providing metadata (currently around 250 fields per patent)
Multiple collections spanning many countries (2 currently implemented with an additional 5 coming online this year)
Supports a custom query syntax which has been used at the Patent Office over the past 30 years
DataStax Enterprise 4.5 and 4.6
Cassandra 2.0
Solr 4.10.2
Spark 0.9.2 and 1.1 (more on that later)
CSS2C – reads compressed XML documents into Cassandra tables
C2S – loads data from many Cassandra tables into Solr documents
CSS2C is fairly fast. One process is spun up per archive. Each archive can span many years of data.
C2S reads data for a given partition (year and month) converting them into patent documents and shipping them to Solr Cloud
Process is kicked off on a utility server
Data is read from the given partition
Records are iterated over (subsequent queries are made) and then pushed to Solr Cloud
Communication with Solr is through Solr4J client which is pointed at a load balancer
Process was scaled out by running multiple processes from multiple utility servers.
Think about this for a minute. For each partition of data you have to fire up a process on a utility server. Should you have many partitions this will scale out to many servers. Each machine must be logged in to, a screen started, kick off the process, rinse and repeat.
What happens when a partition has an error? How do you track what is being run and what has finished? This ultimately lead to a gnarly Excel file. Gross.
Did it work?
Technically, yes
Why change it?
It didn’t meet the SLA. Even with a fairly large number of processes running we couldn’t meet the re-ingestion SLA requirements
How could we make it better?
There are two possible approaches
Optimize the C2S process
add caching
multi-thread where possible
We ended up doing this. It met the SLA, but just barely. We asked ourselves “What happens when the dataset increases?”
Look for a new way to ingest the data
Instead of moving the data to the code for ingestion, move the code to the data.
Our system of record is Cassandra running on DSE. Let’s use Spark (which is included within DSE) to run ingestion jobs.
Benefits:
Data is local to the node running the job. Loading the table content into an RDD pulls from the local node. There are no extra network requests. Ingestion occurs on the node where the data resides.
Built in job tracker – multiple jobs may be queued up
Dashboard to view output and see the status of jobs
We could perform joins with our data!
Here’s the original architecture again
In the new approach the job is submitted to the Spark cluster.
Joined data is loaded into a RDD
The RDD is mapped into Solr documents
Solr documents are batched and pushed to Solr Cloud
Q: How did this work?
A: Not too well. It was a little faster than the original process, but not by much. There was no major load on the Solr cluster, the bottleneck was definitely within the Spark job.
How did we move forward? Metrics, Metrics, Metrics
By running the job with metrics enabled. We instrumented every method call with timings and collated the results when the job completed. This painted a pretty clear picture.
The majority of our work was being done in a foreach on the joined RDD. Each iteration within the foreach loop would connect, send the document, then continue.
The logic which created a connection to the SolrCloud cluster was a huge drain on time. The creation of the HTTP client took 4 times longer than any other part of the iteration.
We determined that a single solr cloud connection was sufficient. We tried declaring a shared connection in a few places, but ran into issues. (Like it not being in scope).
We did some digging in the documentation and found foreachPartition. This looks perfect! The catch? It wasn’t available in the 0.9.2 Java API, only Scala, which we didn’t have experience with or permission to use.
Digging through the APIs some more we did find a mapPartitions() method that was available. We refactored our code to run a mapPartitions() on the joined RDD. Each paritition would instantiate it’s own Solr connection and reuse it for each document. The only problem here is that we removed our action (foreach()). This was solved by calling collect() on the RDD returned from our mapPartitions() invocation.
This solved our performance issue with instantiating and tearing down a bunch of Solr connections. Well everything appeared to be fixed, but now we were getting out of memory exceptions occasionally. This was resolved by changing the result of our mapPartitions to not return the documents processed, but instead a count.
--
Bold the count in column 2
Everything appeared to be working fine. We ran the job and looked at our monitoring while the job executed. We were seeing fantastic throughput on the Spark job, but then everything failed.
What happened? The Solr Cluster failed. Why? The naïve approach of using a load balancer to send traffic around ended up taking down the cluster. Requests to certain nodes would be forwarded to the appropriate node in the cluster. Couple that with all of the traffic from Spark and the nodes were being overloaded.
How can we fix this?
We changed our Solr client to be Solr Cloud aware. Our client communicates with ZooKeeper, which keeps track of cluster state. Our client may now send a document directly to the appropriate node alleviating the intra-cluster document requests.
Here is the new updated ingestion process.
Note the removal of the load balancer and communication between the Spark and ZooKeeper nodes.
The new Spark based process was well within the SLA. Provided additional admin features and …
Take some of the optimizations from the original C2S Multithreaded job and apply them within the Spark job (caches etc)
Add additional jobs (parity checking)
Upgrade DSE (thus upgrading Spark)
Write our Spark jobs in Scala instead of Java to have the more robust API available