HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
Solr as a Spark SQL Datasource
1.
2. Solr as a Spark SQL Datasource
Kiran Chitturi,
Lucidworks
3. Solr & Spark
• A few interesting things about Spark
• Overview of SparkSQL and DataFrames
• Solr as a SparkSQL DataSource in depth
• Use Lucene for text analysis in ML pipelines
• Example Use Case: Lucidworks Fusion and Spark
5. What’s interesting about Spark?
• Wealth of overview / getting started resources on the Web
➢ Start here -> https://spark.apache.org/
➢ Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• Faster, more modernized alternative to MapReduce
➢ Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less
computing power)
• Unified platform for Big Data
➢ Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
➢ Runs on YARN, Mesos, and plays well with HDFS
• Nice API for Java, Scala, Python, R, and sometimes SQL … REPL interface too
• >14,200 Issues in JIRA, 1000+ code contributors, 2.0 coming soon!
7. Physical Architecture (Standalone)
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker
8. Spark SQL
• DataSource API for reading from and writing to external data sources
• DataFrame is an RDD[Row] + schema
• Secret sauce is logical plan optimizer
• SQL or relational operators on DF
• JDBC / ODBC
• UDFs!
• Machine Learning Pipelines
9. Solr as a Spark SQL Data Source
• Read/write data from/to Solr as DataFrame
• Use Solr Schema API to access field-level metadata
• Push predicates down into Solr query constructs, e.g. fq clause
• Deep-paging, shard partitioning, intra-shard splitting, streaming results
// Connect to Solr
val opts = Map("zkhost" -> "localhost:9983", "collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
// Perform SQL queries
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()
12. Data-locality Hint
• SolrRDD extends RDD[SolrDocument] (written in Scala)
• Give hint to Spark task scheduler about where data lives
override def getPreferredLocations(split: Partition): Seq[String] = {
// return preferred hostname for a Solr partition
}
• Useful when Spark executor and Solr replicas live on same physical
host, as we do in Fusion
• Query to a shard has a “preferred” replica; can fallback to other
replicas if the preferred goes down (will be in 2.1)
13. Solr Streaming API for fast reads
• Contributed to spark-solr by Bloomberg team (we PRs)
• Extremely fast “table scans” over large result sets in Solr
• Relies on a column-oriented data structure in Lucene: docValues
• DocValues help speed up faceting and sorting too!
• Coming soon! Push SQL predicates down into Solr’s Parallel SQL
engine available in Solr 6.x
14. Writing to Solr (aka indexing)
• Cloud-aware client sends updates to shard leaders in parallel
• Solr Schema API used to create fields on-the-fly using the DataFrame
schema
• Better parallelism than traditional approaches like Solr DIH
val dbOpts = Map(
"url" -> "jdbc:postgresql:mydb",
"dbtable" -> “schema.table",
"partitionColumn" -> "foo",
"numPartitions" -> "10")
val jdbcDF = sqlContext.read.format("jdbc").options(dbOpts).load
val solrOpts = Map("zkhost" -> "localhost:9983", "collection" -> "mycoll")
jdbcDF.write.format("solr").options(solrOpts).mode(SaveMode.Overwrite).save
15. Solr / Lucene Analyzers for Spark ML Pipelines
• Spark ML Pipeline provides nice API for defining stages to train / predict ML models
• Crazy idea ~ use battle-hardened Lucene for text analysis in Spark
• Pipelines support import/export (work in progress, more coming in Spark 2.0)
• Can try different text analysis techniques during cross-validation
DF
Spark ML Pipeline
Lucene
Analyzer
HashingTF
Standard
Scaler
Trained
Model
(SVM)
save
https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/
16.
17. Fusion & Spark
• spark-solr 2.0.1 released, built into Fusion 2.4
• Users leave evidence of their needs & experience as they use your app
• Fusion closes the feedback loop to improve results based on user “signals”
• Train and serve ML Pipeline and mllib based Machine Learning models
• Run custom Scala “script” jobs in the background in Fusion
➢Complex aggregation jobs (see next slide)
➢Unsupervised learning (LDA topic modeling)
➢Re-train supervised models as new training data flows in
18. Scheduled Fusion job to compute stats for user sessions
val opts = Map("zkhost" -> "localhost:9983”, "collection" -> "apachelogs”)
var logEvents = sqlContext.read.format("solr").options(opts).load
logEvents.registerTempTable("logs”)
sqlContext.udf.register("ts2ms", (d: java.sql.Timestamp) => d.getTime)
sqlContext.udf.register("asInt", (b: String) => b.toInt)
val sessions = sqlContext.sql("""
|SELECT *, sum(IF(diff_ms > 30000, 1, 0))
|OVER (PARTITION BY clientip ORDER BY ts) session_id
|FROM (SELECT *, ts2ms(ts) - lag(ts2ms(ts))
|OVER (PARTITION BY clientip ORDER BY ts) as diff_ms FROM logs) tmp
""".stripMargin)
sessions.registerTempTable("sessions")
var sessionsAgg = sqlContext.sql("""
|SELECT concat_ws('||', clientip,session_id) as id,
| first(clientip) as clientip,
| min(ts) as session_start,
| max(ts) as session_end,
| (ts2ms(max(ts)) - ts2ms(min(ts))) as session_len_ms_l,
| count(*) as total_requests_l
|FROM sessions
|GROUP BY clientip,session_id
""".stripMargin)
sessionsAgg.write.format("solr").options(Map("zkhost" -> "localhost:9983", "collection" -> "apachelogs_signals_aggr"))
.mode(org.apache.spark.sql.SaveMode.Overwrite).save
19. Getting started with spark-solr
• Import package via maven
./bin/spark-shell --packages "com.lucidworks.solr:spark-solr:2.0.1"
• Build from source
git clone https://github.com/LucidWorks/spark-solr
cd spark-solr
mvn clean package -DskipTests
./bin/spark-shell --jars 2.1.0-SNAPSHOT.jar
20. Example : Deep paging via shards
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips LIMIT 2").show()
21. Example : Deep paging with intra shard splitting
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips",
"splits" -> "true")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips").count()
22. Example : Streaming API (/export handler)
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM
trips").show()
23. Performance test
• NYC taxi data (30 months - 91.7M rows)
• Dataset loaded in to AWS RDS instance (Postgres)
• 3 EC2 nodes of r3.2x large instances
• Solr and Spark instances co-located together
• Collection ‘nyc-taxi’ created with 6 shards, 1 replication
• Deployed using solr-scale-tk (https://github.com/LucidWorks/solr-scale-tk)
• Dataset link: https://github.com/toddwschneider/nyc-taxi-data
• More details: https://gist.github.com/kiranchitturi/
0be62fc13e4ec7f9ae5def53180ed181
24. • Query - simple aggregation query to calculate averages
• Streaming expressions took 2.3 mins across 6 tasks
• Deep paging took 20 minutes across 120 tasks
Query performance
25. Index performance
• 91.4M rows imported to Solr in 49 minutes
• Docs per second: 31K
• JDBC batch size: 5000
• Indexing batch size: 50000
• Partitions: 200
26. Wrap-up and Q & A
Download Fusion: http://lucidworks.com/fusion/download/
Feel free to reach out to me with questions:
kiran.chitturi@lucidworks.com / @chitturikiran
27. Spark Streaming: Nuts & Bolts
• Transform a stream of records into small, deterministic batches
✓ Discretized stream: sequence of RDDs
✓ Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
✓ Low-latency micro batches
✓ Time to process a batch must be less than the batch interval time
• Two types of operators:
✓ Transformations (group by, join, etc)
✓ Output (send to some external sink, e.g. Solr)
• Impressive performance!
✓ 4GB/s (40M records/s) on 100 node cluster with less than 1 second latency (note: not indexing rate)
✓ Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
28. Spark Streaming Example: Solr as Sink
Twitter
./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar
twitter-to-solr -zkHost localhost:2181 –collection social
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
29. Document Matching using Stored Queries
• For each document, determine which of a large set of stored queries
matches.
• Useful for alerts, alternative flow paths through a stream, etc
• Index a micro-batch into an embedded (in-memory) Solr instance and
then determine which queries match
• Matching framework; you have to decide where to load the stored
queries from and what to do when matches are found
• Scale it using Spark … need to scale to many queries, checkout Luwak
30. Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take when
docs match
31. RDD Illustrated: Word count
map(word => (word, 1))
Map words into
pairs with count of 1
(quick,1)
(brown,1)
(fox,1)
(quick,1)
(quick,1)
val file =
spark.textFile("hdfs://...")
HDFS
file RDD from HDFS
quick brown fox jumped …
quick brownie recipe …
quick drying glue …
………
file.flatMap(line => line.split(" "))
Split lines into words
quick
brown
fox
quick
quick
……
reduceByKey(_ + _)
Send all keys to same
reducer and sum
(quick,1)
(quick,1)
(quick,1)
(quick,3)
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
x
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
32. Understanding Resilient Distributed Datasets (RDD)
• Read-only partitioned collection of records with fault-tolerance
• Created from external system OR using a transformation of another RDD
• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
• If a partition is lost, RDDs can be re-computed by re-playing the transformations
• User can choose to persist an RDD (for reusing during interactive data-mining)
• User can control partitioning scheme
33. Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker