Lessons Learned with Spark at the US Patent & Trademark Office

•Als PPTX, PDF herunterladen•

0 gefällt mir•964 views

This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize the Spark job. I will cover: * Understanding the parts of a Spark Job. Which components run where and common issues. * Adding metrics to show where pain points are in your code. * Comparing various methods in the API to achieve more performant code. * How we saved time and made a repeatable process with Spark.

Technologie

Lessons Learned with
Spark at the US Patent &
Trademark Office
Christopher Bradford
Big Data Architect at OpenSource Connections

Christopher Bradford
Twitter: @bradfordcp
GitHub: bradfordcp

EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)

EST – C2S Process
Note: some connections are omitted for clarity

EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity

EST – C2S Review
Did it work?
Why change it?
How could we make it better?

EST – Old C2S Process
Note: some connections are omitted for clarity

EST – Spark C2S Process
Note: some connections are omitted for clarity

Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done

Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()

Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Almost

The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partition.rows
.collect()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partitions.rows.count
.collect()

Better Solr Indexing
Note: some connections are omitted for clarity

EST – Spark C2S Process v2
Note: some connections are omitted for clarity

Success?
YUP
5x faster than the original C2S process (with optimizations)

What’s Next?
• Optimization of the C2S Spark job
• More Spark jobs
• Newer version of Spark & DSE
• Scala Spark jobs instead of Java

Weitere ähnliche Inhalte

Ähnlich wie Lessons Learned with Spark at the US Patent & Trademark Office

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Jump Start on Apache Spark 2.2 with DatabricksAnyscale

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Tuning and Debugging in Apache SparkDatabricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA

Spark to DocumentDB connectorDenny Lee

Jdbc driversPrabhat gangwar

Apache Spark Fundamentals TrainingEren Avşaroğulları

Building Robust ETL Pipelines with Apache SparkDatabricks

Tuning and Debugging in Apache SparkPatrick Wendell

Spark SQL - 10 Things You Need to KnowKristian Alexander

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly

Building a modern Application with DataFramesDatabricks

Building a modern Application with DataFramesSpark Summit

2 rel-algebraMahesh Jeedimalla

Quick Guide to Refresh Spark skillsRavindra kumar

Engineering Document Collaboration with Office 365JoAnna Cheshire

Ähnlich wie Lessons Learned with Spark at the US Patent & Trademark Office (20)

Jump Start with Apache Spark 2.0 on Databricks

Jump Start on Apache Spark 2.2 with Databricks

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Tuning and Debugging in Apache Spark

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Spark to DocumentDB connector

Jdbc drivers

Apache Spark Fundamentals Training

Building Robust ETL Pipelines with Apache Spark

Tuning and Debugging in Apache Spark

Spark SQL - 10 Things You Need to Know

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

Building a modern Application with DataFrames

2 rel-algebra

Quick Guide to Refresh Spark skills

Engineering Document Collaboration with Office 365

Mehr von OpenSource Connections

EncoresOpenSource Connections

Test driven relevancyOpenSource Connections

How To Structure Your Search Team for SuccessOpenSource Connections

The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections

Payloads and OCR with SolrOpenSource Connections

Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections

Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections

Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections

Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections

Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections

Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections

Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections

Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections

Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections

Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections

Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections

2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections

Mehr von OpenSource Connections (20)

Encores

Test driven relevancy

How To Structure Your Search Team for Success

The right path to making search relevant - Taxonomy Bootcamp London 2019

Payloads and OCR with Solr

Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...

Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj

Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...

Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl

Haystack 2019 - Search with Vectors - Simon Hughes

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...

Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...

Haystack 2019 - Architectural considerations on search relevancy in the conte...

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...

Haystack 2019 - Establishing a relevance focused culture in a large organizat...

Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...

2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brandgvaughan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Advanced Computer Architecture – An IntroductionDilum Bandara

Gen AI in Business - Global Trends Report 2024.pdfAddepto

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

unit 4 immunoblotting technique complete.pptxBkGupta21

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand

SIP trunking in Janus @ Kamailio World 2024

DevoxxFR 2024 Reproducible Builds with Apache Maven

Unraveling Multimodality with Large Language Models.pdf

Advanced Computer Architecture – An Introduction

Gen AI in Business - Global Trends Report 2024.pdf

What is DBT - The Ultimate Data Build Tool.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unleash Your Potential - Namagunga Girls Coding Club

Ensuring Technical Readiness For Copilot in Microsoft 365

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

The State of Passkeys with FIDO Alliance.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Commit 2024 - Secret Management made easy

Are Multi-Cloud and Serverless Good or Bad?

unit 4 immunoblotting technique complete.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Lessons Learned with Spark at the US Patent & Trademark Office

1. Lessons Learned with Spark at the US Patent & Trademark Office Christopher Bradford Big Data Architect at OpenSource Connections

2. Christopher Bradford Twitter: @bradfordcp GitHub: bradfordcp

3. OpenSource Connections

4. Exploring Search Technologies - EST

5. EST – Technology Stack

6. EST – Data Loading CSS Ingestion (CSS2C) Solr Ingestion (C2S)

7. EST – C2S Process Note: some connections are omitted for clarity

8. EST – C2S Process (Scaled Out) Note: some connections are omitted for clarity

9. EST – C2S Review Did it work? Why change it? How could we make it better?

10.

11. EST – Old C2S Process Note: some connections are omitted for clarity

12. EST – Spark C2S Process Note: some connections are omitted for clarity

13. How did this work out? Poorly

14. Poor Performance joinedRDD = … joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect() // Job is done

15. Poor Performance sc = new SolrConnection() sc.push(document) sc.disconnect()

16. Optimum Performance joinedRDD = … sc = new SolrConnection() joinedRDD.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done joinedRDD = … joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done Almost

17. The Solution! joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partition.rows .collect() joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partitions.rows.count .collect()

18. Results?

19. Solr Indexing

20. Better Solr Indexing Note: some connections are omitted for clarity

21. EST – Spark C2S Process v2 Note: some connections are omitted for clarity

22. Success? YUP 5x faster than the original C2S process (with optimizations)

23. What’s Next? • Optimization of the C2S Spark job • More Spark jobs • Newer version of Spark & DSE • Scala Spark jobs instead of Java

Hinweis der Redaktion

DataStax Certified Cassandra Architect Created Trireme
OSC specializes in search, discovery, and analytics solutions. We have published quire a few books and series including Apach Solr Enterprise Search Server (Packt) Relevant Search (Manning) Building a Search Server with Elasticsearch Technologies: Spark Cassandra Solr Elasticsearch Camel
225 years of patent data starting in 1790 Patents are currently stored as TIF images with XML documents providing metadata (currently around 250 fields per patent) Multiple collections spanning many countries (2 currently implemented with an additional 5 coming online this year) Supports a custom query syntax which has been used at the Patent Office over the past 30 years
DataStax Enterprise 4.5 and 4.6 Cassandra 2.0 Solr 4.10.2 Spark 0.9.2 and 1.1 (more on that later)
CSS2C – reads compressed XML documents into Cassandra tables C2S – loads data from many Cassandra tables into Solr documents CSS2C is fairly fast. One process is spun up per archive. Each archive can span many years of data. C2S reads data for a given partition (year and month) converting them into patent documents and shipping them to Solr Cloud
Process is kicked off on a utility server Data is read from the given partition Records are iterated over (subsequent queries are made) and then pushed to Solr Cloud Communication with Solr is through Solr4J client which is pointed at a load balancer
Process was scaled out by running multiple processes from multiple utility servers. Think about this for a minute. For each partition of data you have to fire up a process on a utility server. Should you have many partitions this will scale out to many servers. Each machine must be logged in to, a screen started, kick off the process, rinse and repeat. What happens when a partition has an error? How do you track what is being run and what has finished? This ultimately lead to a gnarly Excel file. Gross.
Did it work? Technically, yes Why change it? It didn’t meet the SLA. Even with a fairly large number of processes running we couldn’t meet the re-ingestion SLA requirements How could we make it better? There are two possible approaches Optimize the C2S process add caching multi-thread where possible We ended up doing this. It met the SLA, but just barely. We asked ourselves “What happens when the dataset increases?” Look for a new way to ingest the data
Instead of moving the data to the code for ingestion, move the code to the data. Our system of record is Cassandra running on DSE. Let’s use Spark (which is included within DSE) to run ingestion jobs. Benefits: Data is local to the node running the job. Loading the table content into an RDD pulls from the local node. There are no extra network requests. Ingestion occurs on the node where the data resides. Built in job tracker – multiple jobs may be queued up Dashboard to view output and see the status of jobs We could perform joins with our data!
Here’s the original architecture again
In the new approach the job is submitted to the Spark cluster. Joined data is loaded into a RDD The RDD is mapped into Solr documents Solr documents are batched and pushed to Solr Cloud
Q: How did this work? A: Not too well. It was a little faster than the original process, but not by much. There was no major load on the Solr cluster, the bottleneck was definitely within the Spark job. How did we move forward? Metrics, Metrics, Metrics By running the job with metrics enabled. We instrumented every method call with timings and collated the results when the job completed. This painted a pretty clear picture.
The majority of our work was being done in a foreach on the joined RDD. Each iteration within the foreach loop would connect, send the document, then continue.
The logic which created a connection to the SolrCloud cluster was a huge drain on time. The creation of the HTTP client took 4 times longer than any other part of the iteration.
We determined that a single solr cloud connection was sufficient. We tried declaring a shared connection in a few places, but ran into issues. (Like it not being in scope). We did some digging in the documentation and found foreachPartition. This looks perfect! The catch? It wasn’t available in the 0.9.2 Java API, only Scala, which we didn’t have experience with or permission to use.
Digging through the APIs some more we did find a mapPartitions() method that was available. We refactored our code to run a mapPartitions() on the joined RDD. Each paritition would instantiate it’s own Solr connection and reuse it for each document. The only problem here is that we removed our action (foreach()). This was solved by calling collect() on the RDD returned from our mapPartitions() invocation. This solved our performance issue with instantiating and tearing down a bunch of Solr connections. Well everything appeared to be fixed, but now we were getting out of memory exceptions occasionally. This was resolved by changing the result of our mapPartitions to not return the documents processed, but instead a count. -- Bold the count in column 2
Everything appeared to be working fine. We ran the job and looked at our monitoring while the job executed. We were seeing fantastic throughput on the Spark job, but then everything failed.
What happened? The Solr Cluster failed. Why? The naïve approach of using a load balancer to send traffic around ended up taking down the cluster. Requests to certain nodes would be forwarded to the appropriate node in the cluster. Couple that with all of the traffic from Spark and the nodes were being overloaded.
How can we fix this? We changed our Solr client to be Solr Cloud aware. Our client communicates with ZooKeeper, which keeps track of cluster state. Our client may now send a document directly to the appropriate node alleviating the intra-cluster document requests.
Here is the new updated ingestion process. Note the removal of the load balancer and communication between the Spark and ZooKeeper nodes.
The new Spark based process was well within the SLA. Provided additional admin features and …
Take some of the optimizations from the original C2S Multithreaded job and apply them within the Spark job (caches etc) Add additional jobs (parity checking) Upgrade DSE (thus upgrading Spark) Write our Spark jobs in Scala instead of Java to have the more robust API available

Lessons Learned with Spark at the US Patent & Trademark Office

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Lessons Learned with Spark at the US Patent & Trademark Office

Ähnlich wie Lessons Learned with Spark at the US Patent & Trademark Office (20)

Mehr von OpenSource Connections

Mehr von OpenSource Connections (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lessons Learned with Spark at the US Patent & Trademark Office

Hinweis der Redaktion