Leveraging the power of solr with spark

Leveraging the Power of
SOLR with SPARK 
Johannes Weigend
QAware GmbH Germany

pache Big Data Europe

September 2015

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend

- CTO QAware GmbH

- Software architect / developer

- 25 years of experience

- Custom enterprise solutions (Java, JS,…)

- Lecturer for UI development at the University of
Applied Science in Rosenheim

- Focus on performance and scalability

- SOLR user since 2011
2

Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach()
-> Minutes / Hours

Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data
(There’s no free lunch)
foreach()
-> Seconds/Minutes

Agenda
SOLR cloud

Demo
SPARK cluster

Demo
Importing data into SOLR with SPARK

Demo
Analysis with SOLR and SPARK

Demo
5
1
2
3
4

• Horizontally scalable, distributed NoSQL (Index) Database
• Document oriented

• A document is a collection of fields (string, number, date, …)

• Simple and multiple fields (similar to arrays)

• Schema and schema less

• Powerful query language (Lucene)

• Distributed data in shards

• Replication

• Powerful full text search capabilities

• Aggregation functions (aka facets)

• Stable —> V 5.3
6
1 2 3 4

SOLR@QAware
• AIR

• Aftersales Information Research

• ZEBRA

• Part explosion for complex products

• EKG

• Software Electro Cardiogram

• QAsearch

• Enterprise search across all repositories including
history
7

Apache SOLR for BigData Analysis?
• Text Search Engine?

• Aggregations?

• Slice and Dice?

• Pivots?
11

Demo: SOLR Cloud
• Installing and configuring SOLR Cloud

• Searching, sorting and filtering

• Facets

• Terms (count by term)

• Ranges (count in range)

• Functions (avg, sum, …)

• Sub-Facets (pivot)
12

Counting as Term Facet
13

Statistics as Function Facet
14

Pivots as Sub Facets
15

careerbuilder.com
16

Banana
17

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18

What’s Missing?
• Client-side processing of SOLR results does not scale

• No built-in M/R support

• Where to store really big data?

• Images

• Videos

• Binaries / large text documents

• No interfaces to R / ML
19

• Distributed job execution engine

• Map/Reduce framework

• Scala based (runs on JVM)

• Java/Scala/Python APIs

• Processes data from various data sources

• Textfiles (accessible from all nodes)

• Hadoop File System (HDFS)

• Databases (JDBC)

• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Combining Spark with SOLR
• Use Cases

• Distributed ETL – Importing data into SOLR-
Cloud

• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis

• Statistics on binary data

• Map/Reduce
21

Four Ways to Import Data into
SOLR
1. Using built-in functions

post script

Dataimport handler,

Admin-UI

2. Writing custom parallel code using the SOLRJ API

3. Using and customizing Apache Nutch (Hadoop !)

4. Using and customizing Apache Spark
22

Demo: Import Logfiles with Spark
• Writing a Spark job which imports a bunch of
logfiles in one directory

• Using Lucidwork’s Solr-Spark library
23
1 2 3 4

Demo: Distributed Analysis with Spark
• Write a Spark Job which calculates the Duration of Business Actions
• Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4

SolrRDD - The Spark Abstraction to process SOLR Results

https://github.com/LucidWorks/spark-solr
26

SPARK Supports Parallel SQL
27

Dataframe API
28

SPARK Worker
SOLR 5.3
SHARD #4
29
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SHARD #3
SPARK Worker
SOLR 5.3
SHARD #1
SPARK Worker
SOLR 5.3
SHARD #2
SPARK Master
SOLR 5.3
SHARD #0
SPARK Worker
ZOOKEEPER
NFS
40 Cores
10 GB RAM
320 GB eMMC Disk

Summary
30

Any Questions ?
31

Leveraging the power of solr with spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Leveraging the power of solr with spark

Ähnlich wie Leveraging the power of solr with spark (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Leveraging the power of solr with spark