SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Leveraging the Power of
SOLR with SPARK

Johannes Weigend
QAware GmbH Germany

pache Big Data Europe

September 2015
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend

- CTO QAware GmbH

- Software architect / developer

- 25 years of experience

- Custom enterprise solutions (Java, JS,…)

- Lecturer for UI development at the University of
Applied Science in Rosenheim 

- Focus on performance and scalability

- SOLR user since 2011
2
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach()
-> Minutes / Hours
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data
(There’s no free lunch)
foreach()
-> Seconds/Minutes
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Agenda
SOLR cloud

Demo
SPARK cluster

Demo
Importing data into SOLR with SPARK

Demo
Analysis with SOLR and SPARK

Demo
5
1
2
3
4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Horizontally scalable, distributed NoSQL (Index) Database
• Document oriented

• A document is a collection of fields (string, number, date, …)

• Simple and multiple fields (similar to arrays)

• Schema and schema less

• Powerful query language (Lucene)

• Distributed data in shards

• Replication

• Powerful full text search capabilities

• Aggregation functions (aka facets)

• Stable —> V 5.3
6
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SOLR@QAware
• AIR

• Aftersales Information Research

• ZEBRA

• Part explosion for complex products

• EKG 

• Software Electro Cardiogram

• QAsearch

• Enterprise search across all repositories including
history
7
8
9
10
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Apache SOLR for BigData Analysis?
• Text Search Engine?

• Aggregations?

• Slice and Dice?

• Pivots?
11
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: SOLR Cloud
• Installing and configuring SOLR Cloud

• Searching, sorting and filtering

• Facets

• Terms (count by term)

• Ranges (count in range)

• Functions (avg, sum, …)

• Sub-Facets (pivot)
12
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Counting as Term Facet
13
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Statistics as Function Facet
14
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Pivots as Sub Facets
15
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
careerbuilder.com
16
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Banana
17
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
What’s Missing?
• Client-side processing of SOLR results does not scale

• No built-in M/R support

• Where to store really big data?

• Images

• Videos

• Binaries / large text documents

• No interfaces to R / ML
19
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Distributed job execution engine

• Map/Reduce framework

• Scala based (runs on JVM)

• Java/Scala/Python APIs

• Processes data from various data sources

• Textfiles (accessible from all nodes)

• Hadoop File System (HDFS)

• Databases (JDBC)

• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Combining Spark with SOLR
• Use Cases

• Distributed ETL – Importing data into SOLR-
Cloud

• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis

• Statistics on binary data

• Map/Reduce
21
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Four Ways to Import Data into
SOLR
1. Using built-in functions

post script

Dataimport handler,

Admin-UI

2. Writing custom parallel code using the SOLRJ API 

3. Using and customizing Apache Nutch (Hadoop !)

4. Using and customizing Apache Spark
22
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Import Logfiles with Spark
• Writing a Spark job which imports a bunch of
logfiles in one directory 

• Using Lucidwork’s Solr-Spark library
23
1 2 3 4
24
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Distributed Analysis with Spark
• Write a Spark Job which calculates the Duration of Business Actions
• Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SolrRDD - The Spark Abstraction to process SOLR Results

https://github.com/LucidWorks/spark-solr
26
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Supports Parallel SQL
27
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Dataframe API
28
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Worker
SOLR 5.3
SHARD #4
29
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SHARD #3
SPARK Worker
SOLR 5.3
SHARD #1
SPARK Worker
SOLR 5.3
SHARD #2
SPARK Master
SOLR 5.3
SHARD #0
SPARK Worker
ZOOKEEPER
NFS
40 Cores
10 GB RAM
320 GB eMMC Disk
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Summary
30
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Any Questions ?
31

Weitere ähnliche Inhalte

Was ist angesagt?

Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
Paolo Platter
 

Was ist angesagt? (20)

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 

Ähnlich wie Leveraging the power of solr with spark

Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
IBMInfoSphereUGFR
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software Engineering
Mehdi Mirakhorli
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 

Ähnlich wie Leveraging the power of solr with spark (20)

Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
 
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- ZagrebAPEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software Engineering
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigRaster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

Leveraging the power of solr with spark

  • 1. Leveraging the Power of SOLR with SPARK
 Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
  • 2. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Welcome • Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years of experience - Custom enterprise solutions (Java, JS,…) - Lecturer for UI development at the University of Applied Science in Rosenheim - Focus on performance and scalability - SOLR user since 2011 2
  • 3. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Brute Force Data Analysis 3 Read Read Read Filter Filter Filter Map Map Map Reduce Dataflow Not Indexed foreach() -> Minutes / Hours
  • 4. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Search Based Data Analysis 4 Filter Search Search Search Map Map Map Reduce DataflowFilter Filter Indexed Data (There’s no free lunch) foreach() -> Seconds/Minutes
  • 5. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Agenda SOLR cloud Demo SPARK cluster Demo Importing data into SOLR with SPARK Demo Analysis with SOLR and SPARK Demo 5 1 2 3 4
  • 6. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany • Horizontally scalable, distributed NoSQL (Index) Database • Document oriented • A document is a collection of fields (string, number, date, …) • Simple and multiple fields (similar to arrays) • Schema and schema less • Powerful query language (Lucene) • Distributed data in shards • Replication • Powerful full text search capabilities • Aggregation functions (aka facets) • Stable —> V 5.3 6 1 2 3 4
  • 7. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SOLR@QAware • AIR • Aftersales Information Research • ZEBRA • Part explosion for complex products • EKG • Software Electro Cardiogram • QAsearch • Enterprise search across all repositories including history 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Apache SOLR for BigData Analysis? • Text Search Engine? • Aggregations? • Slice and Dice? • Pivots? 11
  • 12. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: SOLR Cloud • Installing and configuring SOLR Cloud • Searching, sorting and filtering • Facets • Terms (count by term) • Ranges (count in range) • Functions (avg, sum, …) • Sub-Facets (pivot) 12
  • 13. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Counting as Term Facet 13
  • 14. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Statistics as Function Facet 14
  • 15. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Pivots as Sub Facets 15
  • 16. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany careerbuilder.com 16
  • 17. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Banana 17
  • 18. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
  • 19. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany What’s Missing? • Client-side processing of SOLR results does not scale • No built-in M/R support • Where to store really big data? • Images • Videos • Binaries / large text documents • No interfaces to R / ML 19
  • 20. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany • Distributed job execution engine • Map/Reduce framework • Scala based (runs on JVM) • Java/Scala/Python APIs • Processes data from various data sources • Textfiles (accessible from all nodes) • Hadoop File System (HDFS) • Databases (JDBC) • SOLR! 20 1 2 3 4 Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 21. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Combining Spark with SOLR • Use Cases • Distributed ETL – Importing data into SOLR- Cloud • Our Usecase: importing N logfiles into SOLR • Distributed processing – data analysis • Statistics on binary data • Map/Reduce 21
  • 22. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Four Ways to Import Data into SOLR 1. Using built-in functions post script Dataimport handler, Admin-UI 2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !) 4. Using and customizing Apache Spark 22
  • 23. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: Import Logfiles with Spark • Writing a Spark job which imports a bunch of logfiles in one directory • Using Lucidwork’s Solr-Spark library 23 1 2 3 4
  • 24. 24
  • 25. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: Distributed Analysis with Spark • Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC 25 1 2 3 4
  • 26. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SolrRDD - The Spark Abstraction to process SOLR Results https://github.com/LucidWorks/spark-solr 26
  • 27. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SPARK Supports Parallel SQL 27
  • 28. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Dataframe API 28
  • 29. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SPARK Worker SOLR 5.3 SHARD #4 29 Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$ SPARK Worker SOLR 5.3 SHARD #3 SPARK Worker SOLR 5.3 SHARD #1 SPARK Worker SOLR 5.3 SHARD #2 SPARK Master SOLR 5.3 SHARD #0 SPARK Worker ZOOKEEPER NFS 40 Cores 10 GB RAM 320 GB eMMC Disk
  • 30. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Summary 30
  • 31. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Any Questions ? 31