1. Leveraging the Power of
SOLR with SPARK
Johannes Weigend
QAware GmbH Germany
pache Big Data Europe
September 2015
2. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend
- CTO QAware GmbH
- Software architect / developer
- 25 years of experience
- Custom enterprise solutions (Java, JS,…)
- Lecturer for UI development at the University of
Applied Science in Rosenheim
- Focus on performance and scalability
- SOLR user since 2011
2
3. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach()
-> Minutes / Hours
4. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data
(There’s no free lunch)
foreach()
-> Seconds/Minutes
5. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Agenda
SOLR cloud
Demo
SPARK cluster
Demo
Importing data into SOLR with SPARK
Demo
Analysis with SOLR and SPARK
Demo
5
1
2
3
4
6. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Horizontally scalable, distributed NoSQL (Index) Database
• Document oriented
• A document is a collection of fields (string, number, date, …)
• Simple and multiple fields (similar to arrays)
• Schema and schema less
• Powerful query language (Lucene)
• Distributed data in shards
• Replication
• Powerful full text search capabilities
• Aggregation functions (aka facets)
• Stable —> V 5.3
6
1 2 3 4
7. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SOLR@QAware
• AIR
• Aftersales Information Research
• ZEBRA
• Part explosion for complex products
• EKG
• Software Electro Cardiogram
• QAsearch
• Enterprise search across all repositories including
history
7
11. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Apache SOLR for BigData Analysis?
• Text Search Engine?
• Aggregations?
• Slice and Dice?
• Pivots?
11
12. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: SOLR Cloud
• Installing and configuring SOLR Cloud
• Searching, sorting and filtering
• Facets
• Terms (count by term)
• Ranges (count in range)
• Functions (avg, sum, …)
• Sub-Facets (pivot)
12
13. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Counting as Term Facet
13
14. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Statistics as Function Facet
14
15. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Pivots as Sub Facets
15
16. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
careerbuilder.com
16
17. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Banana
17
18. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
19. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
What’s Missing?
• Client-side processing of SOLR results does not scale
• No built-in M/R support
• Where to store really big data?
• Images
• Videos
• Binaries / large text documents
• No interfaces to R / ML
19
20. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Distributed job execution engine
• Map/Reduce framework
• Scala based (runs on JVM)
• Java/Scala/Python APIs
• Processes data from various data sources
• Textfiles (accessible from all nodes)
• Hadoop File System (HDFS)
• Databases (JDBC)
• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
21. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Combining Spark with SOLR
• Use Cases
• Distributed ETL – Importing data into SOLR-
Cloud
• Our Usecase: importing N logfiles into SOLR
• Distributed processing – data analysis
• Statistics on binary data
• Map/Reduce
21
22. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Four Ways to Import Data into
SOLR
1. Using built-in functions
post script
Dataimport handler,
Admin-UI
2. Writing custom parallel code using the SOLRJ API
3. Using and customizing Apache Nutch (Hadoop !)
4. Using and customizing Apache Spark
22
23. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Import Logfiles with Spark
• Writing a Spark job which imports a bunch of
logfiles in one directory
• Using Lucidwork’s Solr-Spark library
23
1 2 3 4
25. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Distributed Analysis with Spark
• Write a Spark Job which calculates the Duration of Business Actions
• Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4
26. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SolrRDD - The Spark Abstraction to process SOLR Results
https://github.com/LucidWorks/spark-solr
26
27. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Supports Parallel SQL
27
28. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Dataframe API
28
29. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Worker
SOLR 5.3
SHARD #4
29
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SHARD #3
SPARK Worker
SOLR 5.3
SHARD #1
SPARK Worker
SOLR 5.3
SHARD #2
SPARK Master
SOLR 5.3
SHARD #0
SPARK Worker
ZOOKEEPER
NFS
40 Cores
10 GB RAM
320 GB eMMC Disk
30. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Summary
30
31. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Any Questions ?
31