+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Leveraging the power of solr with spark
1. Leveraging the Power of
SOLR with SPARK
Johannes Weigend
QAware GmbH Germany
pache Big Data Europe
September 2015
2. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend
- CTO QAware GmbH
- Software architect / developer
- 25 years of experience
- Custom enterprise solutions (Java, JS,…)
- Lecturer for UI development at the University of
Applied Science in Rosenheim
- Focus on performance and scalability
- SOLR user since 2011
2
3. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach()
-> Minutes / Hours
4. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data
(There’s no free lunch)
foreach()
-> Seconds/Minutes
5. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Agenda
SOLR cloud
Demo
SPARK cluster
Demo
Importing data into SOLR with SPARK
Demo
Analysis with SOLR and SPARK
Demo
5
1
2
3
4
6. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Horizontally scalable, distributed NoSQL (Index) Database
• Document oriented
• A document is a collection of fields (string, number, date, …)
• Simple and multiple fields (similar to arrays)
• Schema and schema less
• Powerful query language (Lucene)
• Distributed data in shards
• Replication
• Powerful full text search capabilities
• Aggregation functions (aka facets)
• Stable —> V 5.3
6
1 2 3 4
7. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SOLR@QAware
• AIR
• Aftersales Information Research
• ZEBRA
• Part explosion for complex products
• EKG
• Software Electro Cardiogram
• QAsearch
• Enterprise search across all repositories including
history
7
11. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Apache SOLR for BigData Analysis?
• Text Search Engine?
• Aggregations?
• Slice and Dice?
• Pivots?
11
12. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: SOLR Cloud
• Installing and configuring SOLR Cloud
• Searching, sorting and filtering
• Facets
• Terms (count by term)
• Ranges (count in range)
• Functions (avg, sum, …)
• Sub-Facets (pivot)
12
13. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Counting as Term Facet
13
14. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Statistics as Function Facet
14
15. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Pivots as Sub Facets
15
16. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
careerbuilder.com
16
17. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Banana
17
18. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
19. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
What’s Missing?
• Client-side processing of SOLR results does not scale
• No built-in M/R support
• Where to store really big data?
• Images
• Videos
• Binaries / large text documents
• No interfaces to R / ML
19
20. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Distributed job execution engine
• Map/Reduce framework
• Scala based (runs on JVM)
• Java/Scala/Python APIs
• Processes data from various data sources
• Textfiles (accessible from all nodes)
• Hadoop File System (HDFS)
• Databases (JDBC)
• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
21. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Combining Spark with SOLR
• Use Cases
• Distributed ETL – Importing data into SOLR-
Cloud
• Our Usecase: importing N logfiles into SOLR
• Distributed processing – data analysis
• Statistics on binary data
• Map/Reduce
21
22. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Four Ways to Import Data into
SOLR
1. Using built-in functions
post script
Dataimport handler,
Admin-UI
2. Writing custom parallel code using the SOLRJ API
3. Using and customizing Apache Nutch (Hadoop !)
4. Using and customizing Apache Spark
22
23. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Import Logfiles with Spark
• Writing a Spark job which imports a bunch of
logfiles in one directory
• Using Lucidwork’s Solr-Spark library
23
1 2 3 4
25. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Distributed Analysis with Spark
• Write a Spark Job which calculates the Duration of Business Actions
• Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4
26. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SolrRDD - The Spark Abstraction to process SOLR Results
https://github.com/LucidWorks/spark-solr
26
27. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Supports Parallel SQL
27
28. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Dataframe API
28
29. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Worker
SOLR 5.3
SHARD #4
29
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SHARD #3
SPARK Worker
SOLR 5.3
SHARD #1
SPARK Worker
SOLR 5.3
SHARD #2
SPARK Master
SOLR 5.3
SHARD #0
SPARK Worker
ZOOKEEPER
NFS
40 Cores
10 GB RAM
320 GB eMMC Disk
30. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Summary
30
31. Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Any Questions ?
31