SlideShare a Scribd company logo
1 of 31
Download to read offline
Leveraging the Power of
SOLR with SPARK

Johannes Weigend
QAware GmbH Germany

pache Big Data Europe

September 2015
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend

- CTO QAware GmbH

- Software architect / developer

- 25 years of experience

- Custom enterprise solutions (Java, JS,…)

- Lecturer for UI development at the University of
Applied Science in Rosenheim 

- Focus on performance and scalability

- SOLR user since 2011
2
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach()
-> Minutes / Hours
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data
(There’s no free lunch)
foreach()
-> Seconds/Minutes
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Agenda
SOLR cloud

Demo
SPARK cluster

Demo
Importing data into SOLR with SPARK

Demo
Analysis with SOLR and SPARK

Demo
5
1
2
3
4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Horizontally scalable, distributed NoSQL (Index) Database
• Document oriented

• A document is a collection of fields (string, number, date, …)

• Simple and multiple fields (similar to arrays)

• Schema and schema less

• Powerful query language (Lucene)

• Distributed data in shards

• Replication

• Powerful full text search capabilities

• Aggregation functions (aka facets)

• Stable —> V 5.3
6
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SOLR@QAware
• AIR

• Aftersales Information Research

• ZEBRA

• Part explosion for complex products

• EKG 

• Software Electro Cardiogram

• QAsearch

• Enterprise search across all repositories including
history
7
8
9
10
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Apache SOLR for BigData Analysis?
• Text Search Engine?

• Aggregations?

• Slice and Dice?

• Pivots?
11
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: SOLR Cloud
• Installing and configuring SOLR Cloud

• Searching, sorting and filtering

• Facets

• Terms (count by term)

• Ranges (count in range)

• Functions (avg, sum, …)

• Sub-Facets (pivot)
12
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Counting as Term Facet
13
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Statistics as Function Facet
14
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Pivots as Sub Facets
15
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
careerbuilder.com
16
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Banana
17
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
What’s Missing?
• Client-side processing of SOLR results does not scale

• No built-in M/R support

• Where to store really big data?

• Images

• Videos

• Binaries / large text documents

• No interfaces to R / ML
19
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Distributed job execution engine

• Map/Reduce framework

• Scala based (runs on JVM)

• Java/Scala/Python APIs

• Processes data from various data sources

• Textfiles (accessible from all nodes)

• Hadoop File System (HDFS)

• Databases (JDBC)

• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Combining Spark with SOLR
• Use Cases

• Distributed ETL – Importing data into SOLR-
Cloud

• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis

• Statistics on binary data

• Map/Reduce
21
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Four Ways to Import Data into
SOLR
1. Using built-in functions

post script

Dataimport handler,

Admin-UI

2. Writing custom parallel code using the SOLRJ API 

3. Using and customizing Apache Nutch (Hadoop !)

4. Using and customizing Apache Spark
22
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Import Logfiles with Spark
• Writing a Spark job which imports a bunch of
logfiles in one directory 

• Using Lucidwork’s Solr-Spark library
23
1 2 3 4
24
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Distributed Analysis with Spark
• Write a Spark Job which calculates the Duration of Business Actions
• Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SolrRDD - The Spark Abstraction to process SOLR Results

https://github.com/LucidWorks/spark-solr
26
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Supports Parallel SQL
27
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Dataframe API
28
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Worker
SOLR 5.3
SHARD #4
29
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SHARD #3
SPARK Worker
SOLR 5.3
SHARD #1
SPARK Worker
SOLR 5.3
SHARD #2
SPARK Master
SOLR 5.3
SHARD #0
SPARK Worker
ZOOKEEPER
NFS
40 Cores
10 GB RAM
320 GB eMMC Disk
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Summary
30
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Any Questions ?
31

More Related Content

What's hot

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Databricks
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25Cask Data
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Spark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Databricks
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyRikin Tanna
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Spark Summit
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016Eduard Lazar
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Cohesive Networks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 

What's hot (20)

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 

Similar to Leveraging the power of solr with spark

Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkQAware GmbH
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefitsJohan Picard
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit
 
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!Vladimir Pavlov
 
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- ZagrebAPEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- ZagrebMichael Hichwa
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15IBMInfoSphereUGFR
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringMehdi Mirakhorli
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataWeCloudData
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigRaster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigKarin Patenge
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 

Similar to Leveraging the power of solr with spark (20)

Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
Debugging and Profiling Cloud Apps? Sure, You Can Do It Now!
 
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- ZagrebAPEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
APEX Alpe Adria Mike Hichwa Keynote April 11th 2019- Zagreb
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software Engineering
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDigRaster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 

Recently uploaded

Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Andreas Granig
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckMarc Lester
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanNeo4j
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024MulesoftMunichMeetup
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConNatan Silnitsky
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfWSO2
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfSrushith Repakula
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIInflectra
 
What is a Recruitment Management Software?
What is a Recruitment Management Software?What is a Recruitment Management Software?
What is a Recruitment Management Software?NYGGS Automation Suite
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Chirag Panchal
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Henry Schreiner
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In hararekasambamuno
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14VMware Tanzu
 

Recently uploaded (20)

Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
What is a Recruitment Management Software?
What is a Recruitment Management Software?What is a Recruitment Management Software?
What is a Recruitment Management Software?
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
 

Leveraging the power of solr with spark

  • 1. Leveraging the Power of SOLR with SPARK
 Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
  • 2. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Welcome • Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years of experience - Custom enterprise solutions (Java, JS,…) - Lecturer for UI development at the University of Applied Science in Rosenheim - Focus on performance and scalability - SOLR user since 2011 2
  • 3. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Brute Force Data Analysis 3 Read Read Read Filter Filter Filter Map Map Map Reduce Dataflow Not Indexed foreach() -> Minutes / Hours
  • 4. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Search Based Data Analysis 4 Filter Search Search Search Map Map Map Reduce DataflowFilter Filter Indexed Data (There’s no free lunch) foreach() -> Seconds/Minutes
  • 5. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Agenda SOLR cloud Demo SPARK cluster Demo Importing data into SOLR with SPARK Demo Analysis with SOLR and SPARK Demo 5 1 2 3 4
  • 6. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany • Horizontally scalable, distributed NoSQL (Index) Database • Document oriented • A document is a collection of fields (string, number, date, …) • Simple and multiple fields (similar to arrays) • Schema and schema less • Powerful query language (Lucene) • Distributed data in shards • Replication • Powerful full text search capabilities • Aggregation functions (aka facets) • Stable —> V 5.3 6 1 2 3 4
  • 7. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SOLR@QAware • AIR • Aftersales Information Research • ZEBRA • Part explosion for complex products • EKG • Software Electro Cardiogram • QAsearch • Enterprise search across all repositories including history 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Apache SOLR for BigData Analysis? • Text Search Engine? • Aggregations? • Slice and Dice? • Pivots? 11
  • 12. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: SOLR Cloud • Installing and configuring SOLR Cloud • Searching, sorting and filtering • Facets • Terms (count by term) • Ranges (count in range) • Functions (avg, sum, …) • Sub-Facets (pivot) 12
  • 13. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Counting as Term Facet 13
  • 14. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Statistics as Function Facet 14
  • 15. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Pivots as Sub Facets 15
  • 16. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany careerbuilder.com 16
  • 17. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Banana 17
  • 18. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
  • 19. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany What’s Missing? • Client-side processing of SOLR results does not scale • No built-in M/R support • Where to store really big data? • Images • Videos • Binaries / large text documents • No interfaces to R / ML 19
  • 20. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany • Distributed job execution engine • Map/Reduce framework • Scala based (runs on JVM) • Java/Scala/Python APIs • Processes data from various data sources • Textfiles (accessible from all nodes) • Hadoop File System (HDFS) • Databases (JDBC) • SOLR! 20 1 2 3 4 Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 21. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Combining Spark with SOLR • Use Cases • Distributed ETL – Importing data into SOLR- Cloud • Our Usecase: importing N logfiles into SOLR • Distributed processing – data analysis • Statistics on binary data • Map/Reduce 21
  • 22. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Four Ways to Import Data into SOLR 1. Using built-in functions post script Dataimport handler, Admin-UI 2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !) 4. Using and customizing Apache Spark 22
  • 23. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: Import Logfiles with Spark • Writing a Spark job which imports a bunch of logfiles in one directory • Using Lucidwork’s Solr-Spark library 23 1 2 3 4
  • 24. 24
  • 25. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Demo: Distributed Analysis with Spark • Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC 25 1 2 3 4
  • 26. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SolrRDD - The Spark Abstraction to process SOLR Results https://github.com/LucidWorks/spark-solr 26
  • 27. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SPARK Supports Parallel SQL 27
  • 28. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Dataframe API 28
  • 29. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany SPARK Worker SOLR 5.3 SHARD #4 29 Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$ SPARK Worker SOLR 5.3 SHARD #3 SPARK Worker SOLR 5.3 SHARD #1 SPARK Worker SOLR 5.3 SHARD #2 SPARK Master SOLR 5.3 SHARD #0 SPARK Worker ZOOKEEPER NFS 40 Cores 10 GB RAM 320 GB eMMC Disk
  • 30. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Summary 30
  • 31. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany Any Questions ? 31