Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.


Real-World Analytics with Solr Cloud and Spark
Solving Analytic Problems for Billions of Records Within Seconds
Vancouve...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Any Question?
Ask or Twitter wit...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH


The Problem We Want to Solve
■...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Horizontal Scalability can be di...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Hadoop Gives Answers for Horizon...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
The Processing of Distributed Da...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With Former Indexing and Searchi...
Spark
Search Search Search
Map Map Map
Reduce
Distributed
Data
Cluster
Processing
Business Layer
Frontend
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Spark
1. Solr Cloud for Analytics
Filter
Search Search Search
Map Map Map
Reduce
Data FlowFilter Filter
Search / NoSQL
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Document based NoSQL database w...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Shard2
The Architecture of Solr ...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Stores Everything in a Sing...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Solr Cloud can be Started in S...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
With the Solr Cloud Collection A...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Zookeeper has to be Started Firs...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Example: Solr Cloud for Analytic...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr Supports JSON Queries per H...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Term Facets Group and Count a Si...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Function Facets Aggregate Fields...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Pivot Facets Compose Facets into...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Solr 6 Supports SQL
■ Solr 6 sup...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Resilience
■The number of replic...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You Got Everything What You Need...
Spark
Distributed In-Memory Computing
mit Apache Spark
Filter
Search Search Search
Map Map Map
Reduce
Data flowFilter Filt...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
■Distributed computing (100x fas...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Cluster
JVM
Worker
Worker
JVM
JV...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
A Very First Spark Application
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 1: Distributed Tas...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 2: Distributed Rea...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Spark Pattern 3: Caching and Fur...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
Spark
Putting all together

Solr & Spark in Action
Filter
Search Search Search
Map Map Map
Reduce
DatenflussFilter Filter
...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
How to implement readFromShard()...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
LucidWorks has released a Spark/...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
1
2
3
4
Lucidworks Solr-Spark
Ad...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Logfile Analytics with Solr and ...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
42
1
2
3
4
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
DEMO
+
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Specifications – Intel NUC6i5SYK...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Technical Cluster Architecture
h...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
You can even run Solr Cloud and ...
47
SPARK Worker
SOLR 5.3
Odroid XU4
2 GB RAM
64 GB eMMC Disk
Ubuntu Linux
70$
SPARK Worker
SOLR 5.3
SPARK Worker
SOLR 5.3
...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Summary
■Solr Cloud and Spark ar...
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
@JohannesWeigend
@qaware
slidesh...
51
Nächste SlideShare
Wird geladen in …5
×

Real World Analytics with Solr Cloud and Spark

Apache Big Data Conference 2016, Vancouver BC: Talk by Johannes Weigend (@JohannesWeigend, CTO at QAware).

Abstract: Apache Solr is a distributed NoSQL database with impressive search capabilities. Apache Spark makes M/R faster and richer. In this code-intense session shows how to combine both to solve real-time search and processing problems. The demos feature a portable Solr Cloud / Spark Cluster based on Intel NUC Hardware.

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Real World Analytics with Solr Cloud and Spark

  1. 1. 
 Real-World Analytics with Solr Cloud and Spark Solving Analytic Problems for Billions of Records Within Seconds Vancouver, May 2016 | Johannes Weigend | QAware GmbH Johannes Weigend Apache Big Data North America 2016 May 2016
  2. 2. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Any Question? Ask or Twitter with the Hashtag #cloudnativenerd
  3. 3. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 
 The Problem We Want to Solve ■Interactive applications with runtimes lower than a second! ■Processing of billions of records (>109 rows / records) ■Continuously import data (near realtime) ■Applications on top of the Reactive Manifesto
  4. 4. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  5. 5. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  6. 6. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Horizontal Scalability can be difficult! ■Horizontal Scalability of functions ■Trivial ■Loadbalancing of (stateless) services (makro- / microservices) ■More users ! more machines ■Not trivial ■More machines ! faster response times ■Horizontal Scalability of data ■Trivial ■Linear distribution of data on multiple machines ■More machines ! more data ■Not trivial ■Constant response times with growing datasets
  7. 7. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Hadoop Gives Answers for Horizontal Scalability of Data and Functions
  8. 8. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  9. 9. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH The Processing of Distributed Data can be Quite Slow! 9 Data Flow Read Read Read Filter Filter Filter Map Map Map Reduce foreach() -> Minutes / Hours HDFS / NFS / NoSQL
  10. 10. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH With Former Indexing and Searching, Less Data has to be Read and Filtered. 10 Filter Search Search Search Map Map Map Reduce Data FlowFilter Filter foreach() -> Seconds/Minutes Search / NoSQL
  11. 11. Spark Search Search Search Map Map Map Reduce Distributed Data Cluster Processing Business Layer Frontend
  12. 12. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  13. 13. Spark 1. Solr Cloud for Analytics Filter Search Search Search Map Map Map Reduce Data FlowFilter Filter Search / NoSQL
  14. 14. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH ■Document based NoSQL database with outstanding search capabilities ■A document is a collection of fields (string, number, date, …) ■Single und multiple fields (fields can be arrays) ■Nested documents ■Static und dynamic scheme ■Powerful query language (Lucene) ■Horizontal scalable with Solr Cloud ■Distributed data in separate shards ■Resilience by the combination of zookeeper and replication ■Powerful aggregations (aka facets) ■Stable —> V 6.0 Cloud
  15. 15. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Shard2 The Architecture of Solr Cloud Solr Server Zookeeper Solr ServerSolr Server Shard1 Zookeeper Zookeeper Zookeeper Cluster Solr Cloud Leader Scale Out Shard3 Replika8 Replika9 Shard5Shard4 Shard6 Shard8Shard7 Shard9 Replika2 Replika3 Replika5 Shards Replicas Collection Replica4 Replica7 Replika1 Shard6
  16. 16. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr Stores Everything in a Single „Table“ (BigTable). 
 Searching is Extremely Fast and Powerful.* Customer Order *1 Name Amount Address Product Type ID Name Address Amount Product K2B Customer 1 K 1 A 1 - - [3,5] Customer 2 K 2 A 2 - - [4] Order 3 - - Z 1 P 1 [1] Order 4 - - Z 2 P 2 [2] ... SolrDocument SolrDocument SolrDocument SolrDocument (*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms
  17. 17. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH A Solr Cloud can be Started in Seconds. ■ Create a scheme by reusing an existing set of solr config files ■ There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be copied and modified ■ Start solr ■ When the wizzard asks for a collection name use „bigdata2016“ (see above) ■ Make a first test cp $SOLR_HOME/server/solr/configset/basic_configs $SOLR_HOME/server/solr/configsets/bigdata2016 $SOLR_HOME/bin/solr start –e cloud curl localhost:8983/solr/jax2016/query?q=*:*
  18. 18. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH With the Solr Cloud Collection API, 
 Shards can be Created, Changed or Deleted. ■ Create a collection ■ Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE& name=<<name of collection>> <<SOLR URL>>/solr/admin/collections?action=CREATE& name=<<name of collection>>& numShards=16& replicationFactor=2& maxShardsPerNode=8& collection.configName= <<name of uploaded zookeeper configuration>> https://cwiki.apache.org/confluence/display/solr/Collections+API
  19. 19. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Zookeeper has to be Started First and the Solr Configuration must be Uploaded to Use a Solr Cloud. 1.Start zookeeper on 2n+1 nodes (odd number) 2.Upload the solr configuration into zookeeper 3.Start solr on n-nodes connected to the zookeeper cluster 4.Create a collection with a number of shards and replicas $SOLR_HOME/bin/solr start –c -z 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 $SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd upconfig -zkhost 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 - confname ekgdata -solrhome /opt/solr/server/solr -confdir / opt/solr/server/solr/configsets/ekgdata_configs/conf $ZOO_HOME/bin/zkServer.sh start
  20. 20. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Example: Solr Cloud for Analytics of Insurance Data ■Insurance sample data with the following fields Education IncomeGender ...
  21. 21. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  22. 22. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr Supports JSON Queries per HTTP Post
  23. 23. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Term Facets Group and Count a Single Field. 23
  24. 24. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Function Facets Aggregate Fields. 24 http://yonik.com/solr-facet-functions/
  25. 25. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Pivot Facets Compose Facets into Hierarchies. 25
  26. 26. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Solr 6 Supports SQL ■ Solr 6 supports distributed SQL ■ The JDBC Driver is part of the solrj client library ■ A collection is currently mapped as single table. ■ Collection -> Table ■ SolrDocument -> Row ■ Field -> Column ■ The Solr 6.0 is limited, but more functionality is expected in upcoming versions ■ No database metadata, no prepared statements, no mapping to tables per type field
  27. 27. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Resilience ■The number of replicas per shard is configurable (replication factor) ■This number corresponds with the number of nodes which can silently fail ■Zookeeper is the single source of failure, but can also be failsafe by running multiple instances ■Solr knows all zookeeper instances and can silently switch over to the next available leader if last connected zookeeper crashes
  28. 28. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH You Got Everything What You Need! – Or Not? ■Client side processing of solr documents does not scale ■No possibility to run parallel business logic inside solr ■The solr index is not a general purpose store for huge data ■Images ■Videos ■Binaries / large text documents ■No Interface to machine learning or typical statistics libraries (R) ... 28
  29. 29. Spark Distributed In-Memory Computing mit Apache Spark Filter Search Search Search Map Map Map Reduce Data flowFilter Filter Search / NoSQL
  30. 30. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH ■Distributed computing (100x faster than Hadoop (M/R) ■Distributed Map/Reduce on distributed data can be done in-memory ■Written in Scala (JVM) ■Java/Scala/Python APIs ■Processes data from distributed and non-distributed sources ■Textfiles (accessible from all nodes) ■Hadoop File System (HDFS) ■Databases (JDBC) ■Solr per Lucidworks API ■... READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  31. 31. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Cluster JVM Worker Worker JVM JVM JVM Worker Master / Yarn / Mesos JVM Executor Executor JVM JVM JVM Executor start start start Task Task(s) Slave Slave Slave Master Host Spark Context MasterURL Resilient Distributed Dataset RDD Driver Node creates Driver Application Application uses Partition Task(s) Partition Task(s) Partition
  32. 32. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH A Very First Spark Application
  33. 33. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 1: Distributed Task with Params
  34. 34. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 2: Distributed Read from External Sources
  35. 35. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Spark Pattern 3: Caching and Further Processing with RDDs
  36. 36. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO
  37. 37. Spark Putting all together Solr & Spark in Action Filter Search Search Search Map Map Map Reduce DatenflussFilter Filter Search / NoSQL
  38. 38. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH How to implement readFromShard()? ■ Several possibilities for that: ■ SolrJ: SolrStream ■ /export Handler kann Massendaten aus SOLR streamen ■ Unterstützt nur JSON Export (Kein Binary Format !) ■ Or: SolrJ cursor marks ■ Or: Custom export handler http://localhost:8983/solr/jax2016/export?q=*:*&sort=id%20asc&fl=id&wt=xml
  39. 39. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH LucidWorks has released a Spark/Solr Integration Library.
 https://github.com/lucidworks/spark-solr
  40. 40. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 1 2 3 4 Lucidworks Solr-Spark Adapter V 2.1
  41. 41. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Logfile Analytics with Solr and Spark ■Histogram of all exception from hosts A,B,C during time interval D ■Step 1: Search with Solr ■Solr Query (q=*Exception AND (server: A OR server:B OR server:C) AND timestamp between [1.1.2015, 31.12.2015] ■Step 2: Create a map with key = << exception name >>, value = count ■Group with Spark
  42. 42. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 42 1 2 3 4
  43. 43. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH DEMO +
  44. 44. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Specifications – Intel NUC6i5SYK
 6th generation Intel® Core™ i5-6260U processor with Intel® Iris™ graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP) CPU 32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz RAM 256 GB Samsung M.2 internal SSDDISK ! This case is as powerful like four notebooks 8 Cores, 16 HT Units, 128 GB RAM, 1 TB DiskTotal
  45. 45. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Technical Cluster Architecture hdfs Ubuntu Linux Solr Cloud Zookeeper #1 Spark Zeppelin Master JVM Slave JVM Executor JVM #1 Ubuntu Linux Solr Cloud Zookeeper #2 Spark Zeppelin Master JVM #2 Slave JVM #2 Executor JVM #2 Ubuntu Linux Solr Cloud Spark Master JVM #4 Slave JVM #4 Executor JVM #4 Ubuntu Linux Solr Cloud Zookeeper #3 Spark Master JVM #3 Slave JVM #3 Executor JVM #3 s1 s2 s3 s4 s5 s6 s7 s8 s13 s14 s15 s16 s9 s10 s11 s12 1 23 4
  46. 46. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH You can even run Solr Cloud and Spark on Odroid 4 70$ ARM Computers ■ 8 Cores ■ ca. 1/10 CPU performance in comparison to the Intel NUC 6 / Core i5
  47. 47. 47 SPARK Worker SOLR 5.3 Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$ SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 SPARK Worker SOLR 5.3 SPARK Master SOLR 5.3 SPARK Worker ZOOKEEPER 40 Cores 10 GB RAM 320 GB eMMC Disk
  48. 48. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
  49. 49. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Summary ■Solr Cloud and Spark are a powerful combination for interactive analytics and data intense applications ■Writing distributed software stays hard. Only distribute if you have to. ■100% Open Source ■A simple integration of Solr and Spark is easy. For high performance applications things could be more complicated. ■If professional product support is needed, customers can switch to Lucidworks Fusion to get a pre integrated and supported Solr/Spark platform
  50. 50. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH @JohannesWeigend @qaware slideshare.net/qaware blog.qaware.de
  51. 51. 51

×