SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Solr as a Spark SQL Datasource
Kiran Chitturi
Lucidworks
The standard for
enterprise search.
of Fortune 500
uses Solr.
90%
Solr as a Spark SQL Data Source
• Read/write data from/to Solr as DataFrame
• Use Solr Schema API to access field-level metadata
• Push predicates down into Solr query constructs, e.g. fq clause
• Shard partitioning, intra-shard splitting, streaming results
// Connect to Solr
val opts = Map("zkhost" -> "localhost:9983", "collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
// Perform SQL queries
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()
Shard 1
Shard 2
nyc_trips
collection
Partition 1
(spark task)
solr.DefaultSource
Partition 2
(spark task)
Spark
Driver
App
ZooKeeper
Read collection metadata
(num shards, replica URLs, etc)
Table	Scan	Query:	
q=*:*&rows=1000&	
distrib=false&cursorMark=*
Results streamed back from Solr
Parallelizequeryexecution
sqlContext.load(“solr”,opts)
DataFrame
(schema + RDD<Row>)
Map(“collection”	->	“nyc_trips”,	
					“zkHost”	->	“…”)
SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
nyc_trips
collection
Partition 1
(spark task)
solr.DefaultSource
Partition 3
(spark task)
Spark
Driver
App
ZooKeeper
Read collection metadata
(num shards, replica URLs, etc)
Table	Scan	Query:	
q=*:*&rows=1000&	
distrib=false&cursorMark=*	
&split_field:[aa	TO	ba}
Parallelizequeryexecution
sqlContext.load(“solr”,opts)
DataFrame
(schema + RDD<Row>)
Map(“collection”	->	“nyc_trips”,	
					“zkHost”	->	“…”)
Partition 2
(spark task) &split_field:[ba	TO	bz]
Partition 4
(spark task)
&split_field:[aa	TO	ac}
&split_field:[ac	TO	ae]
SolrRDD: Reading data from Solr into Spark
Solr Streaming API for fast reads
• Contributed to spark-solr by Bloomberg team
• Extremely fast “table scans” over large result sets in Solr
• Relies on a column-oriented data structure in Lucene: docValues
• DocValues help speed up faceting and sorting too!
• Push SQL predicates down into Solr’s Parallel SQL engine available in Solr 6.x (Coming
soon)
Data-locality Hint
• SolrRDD extends RDD[SolrDocument] (written in Scala)
• Give hint to Spark task scheduler about where data lives
override def getPreferredLocations(split:	Partition):	Seq[String]	=	{	
			//	return	preferred	hostname	for	a	Solr	partition

		}	
• Useful when Spark executor and Solr replicas live on same physical host, as we do in
Fusion
• Query to a shard has a “preferred” replica; can fallback to other replicas if the
preferred goes down (will be in 2.1)
Writing to Solr (aka indexing)
• Cloud-aware client sends updates to shard leaders in parallel
• Solr Schema API used to create fields on-the-fly using the DataFrame schema
• Better parallelism than traditional approaches like Solr DIH
val dbOpts = Map(
"url" -> "jdbc:postgresql:mydb",
"dbtable" -> "schema.table",
"partitionColumn" -> "foo",
"numPartitions" -> "10")
val jdbcDF = sqlContext.read.format("jdbc").options(dbOpts).load
val solrOpts = Map("zkhost" -> "localhost:9983", "collection" -> "mycoll")
jdbcDF.write.format("solr").options(solrOpts).mode(SaveMode.Overwrite).save
• Spark ML Pipeline provides nice API for defining stages to train / predict ML models
• Crazy idea ~ use battle-hardened Lucene for text analysis in Spark
• Pipelines support import/export
• Can try different text analysis techniques during cross-validation
DF
Spark ML Pipeline
Lucene
Analyzer
HashingTF
Standard
Scaler
Trained
Model
(SVM)
save
https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/
Solr / Lucene Analyzers for Spark ML Pipelines
• Use Lucene text analyzers for text analysis
• Wide selection of tokenizers, filters, Character filters in Lucene
• JSON declared schema analysis definition
Lucene Text Analysis
{ "analyzers": [{ "name": "...",
"charFilters": [{ "type": "...", ...}, ... ],
"tokenizer": { "type": "...", ... },
"filters": [{ "type": "...", ... } ... ] }] },
{ "name": "...",
"charFilters": [{ "type": "...", ...}, ... ],
"tokenizer": { "type": "...", ... },
"filters": [{ "type": "...", ... }, ... ] }] } ],
"fields": [{"name": "...", "analyzer": "..."},
{ "regex": ".+", "analyzer": "..." }, ... ] }
• Word count example of spark-solr README file
Lucene Text Analysis (example)
import com.lucidworks.spark.analysis.LuceneTextAnalyzer
val schema = """{ "analyzers": [{ "name": "StdTokLower",
| "tokenizer": { "type": "standard" },
| "filters": [{ "type": "lowercase" }] }],
| "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] }
""".stripMargin
val analyzer = new LuceneTextAnalyzer(schema)
val file = sc.textFile("README.adoc")
val counts = file.flatMap(line => analyzer.analyze("anything", line))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, false) // descending sort by count
println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
• Word count example of spark-solr README file with stop word filtering
Lucene Text Analysis (example)
import com.lucidworks.spark.analysis.LuceneTextAnalyzer
val schema = """{ "analyzers": [{ "name": "StdTokLower",
| "tokenizer": { "type": "standard" },
| "filters": [{ "type": "lowercase" }] },
| { "name": "StdTokLowerStop",
| "tokenizer": { "type": "standard" },
| "filters": [{ "type": "lowercase" },
| { "type": "stop" }] }],
| "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" },
| { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]}
""".stripMargin
val analyzer = new LuceneTextAnalyzer(schema)
val file = sc.textFile("README.adoc")
val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, false)
println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
Lucene ML transformer (example)
• LuceneTextAnalyzerTransformer as an ML pipeline transformer
• Supports multi-valued input columns
val WhitespaceTokSchema =
"""{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }],
| "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin
val StdTokLowerSchema =
"""{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" },
| "filters": [{ "type": "lowercase" }] }],
| "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin
[...]
val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol)
[..]
val pipeline = new Pipeline().setStages(Array(labelIndexer, analyzer, hashingTF, estimatorStage,
labelConverter))
• spark-solr 2.0.1 released, built into Fusion 2.4
• Users leave evidence of their needs & experience as they use your app
• Fusion closes the feedback loop to improve results based on user “signals”
• Train and serve ML Pipeline and mllib based Machine Learning models
• Run custom Scala “script” jobs in the background in Fusion
➢ Complex aggregation jobs
➢ Unsupervised learning (LDA topic modeling)
➢ Re-train supervised models as new training data flows in
Fusion & Spark
• Import package via maven. Available at spark-packages.org
./bin/spark-shell --packages "com.lucidworks.spark:spark-solr:2.0.1"
• Build from source
git clone https://github.com/Lucidworks/spark-solr
cd spark-solr
mvn clean package -DskipTests
./bin/spark-shell --jars 2.1.0-SNAPSHOT.jar
Getting started with spark-solr
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips LIMIT 2").show()
Example : Deep paging via shards
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips",
"splits" -> "true")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT * FROM trips").count()
Example : Deep paging with intra shard splitting
// Connect to Solr
val opts = Map(
"zkhost" -> "localhost:9983",
"collection" -> "nyc_trips")
val solrDF = sqlContext.read.format("solr").options(opts).load
// Register DF as temp table
solrDF.registerTempTable("trips")
sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM
trips").show()
Example : Streaming API (/export handler)
• NYC taxi data (30 months - 91.7M rows)
• Dataset loaded in to AWS RDS instance (Postgres)
• 3 EC2 nodes of r3.2x large instances
• Solr and Spark instances co-located together
• Collection ‘nyc-taxi’ created with 6 shards, 1 replication
• Deployed using solr-scale-tk (https://github.com/LucidWorks/solr-scale-tk)
• Dataset link: https://github.com/toddwschneider/nyc-taxi-data
• More details: https://gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181
Performance test
• 91.4M rows indexed to Solr in 49 minutes
• Docs per second: 31K
• JDBC batch size: 5000
• Indexing batch size: 50000
• Partitions: 200
Index performance
• Query - simple aggregation query to calculate averages
• Streaming expressions took 2.3 mins across 6 tasks
• Deep paging took 20 minutes across 120 tasks
Query performance
• Implement Datasource with Catalyst scan.
• Fallback to different replica if the preferred replica is down
• HashQParserPlugin for intra-shard splits
• Output PreAnalyzedField compatible JSON for Solr
Roadmap
THANK YOU.
kiran.chitturi@lucidworks.com / @chitturikiran
https://github.com/Lucidworks/spark-solr/
Download Fusion: http://lucidworks.com/fusion/download/

Weitere ähnliche Inhalte

Andere mochten auch

Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingSpark Summit
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkQAware GmbH
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudJen Aman
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 

Andere mochten auch (20)

Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 

Ähnlich wie Solr As A SparkSQL DataSource

Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Spark Structured Streaming
Spark Structured Streaming Spark Structured Streaming
Spark Structured Streaming Revin Chalil
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Lucidworks (Archived)
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Using spark data frame for sql
Using spark data frame for sqlUsing spark data frame for sql
Using spark data frame for sqlDaeMyung Kang
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferayBinesh Gummadi
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf Conference
 
04 data accesstechnologies
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologiesBat Programmer
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdutionXuan-Chao Huang
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosJoe Stein
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 

Ähnlich wie Solr As A SparkSQL DataSource (20)

Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Spark Structured Streaming
Spark Structured Streaming Spark Structured Streaming
Spark Structured Streaming
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Using spark data frame for sql
Using spark data frame for sqlUsing spark data frame for sql
Using spark data frame for sql
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
 
04 data accesstechnologies
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologies
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

KĂźrzlich hochgeladen

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 

KĂźrzlich hochgeladen (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Solr As A SparkSQL DataSource

  • 1. Solr as a Spark SQL Datasource Kiran Chitturi Lucidworks
  • 2. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 3. Solr as a Spark SQL Data Source • Read/write data from/to Solr as DataFrame • Use Solr Schema API to access field-level metadata • Push predicates down into Solr query constructs, e.g. fq clause • Shard partitioning, intra-shard splitting, streaming results // Connect to Solr val opts = Map("zkhost" -> "localhost:9983", "collection" -> "nyc_trips") val solrDF = sqlContext.read.format("solr").options(opts).load // Register DF as temp table solrDF.registerTempTable("trips") // Perform SQL queries sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()
  • 4. Shard 1 Shard 2 nyc_trips collection Partition 1 (spark task) solr.DefaultSource Partition 2 (spark task) Spark Driver App ZooKeeper Read collection metadata (num shards, replica URLs, etc) Table Scan Query: q=*:*&rows=1000& distrib=false&cursorMark=* Results streamed back from Solr Parallelizequeryexecution sqlContext.load(“solr”,opts) DataFrame (schema + RDD<Row>) Map(“collection” -> “nyc_trips”, “zkHost” -> “…”) SolrRDD: Reading data from Solr into Spark
  • 5. Shard 1 Shard 2 nyc_trips collection Partition 1 (spark task) solr.DefaultSource Partition 3 (spark task) Spark Driver App ZooKeeper Read collection metadata (num shards, replica URLs, etc) Table Scan Query: q=*:*&rows=1000& distrib=false&cursorMark=* &split_field:[aa TO ba} Parallelizequeryexecution sqlContext.load(“solr”,opts) DataFrame (schema + RDD<Row>) Map(“collection” -> “nyc_trips”, “zkHost” -> “…”) Partition 2 (spark task) &split_field:[ba TO bz] Partition 4 (spark task) &split_field:[aa TO ac} &split_field:[ac TO ae] SolrRDD: Reading data from Solr into Spark
  • 6. Solr Streaming API for fast reads • Contributed to spark-solr by Bloomberg team • Extremely fast “table scans” over large result sets in Solr • Relies on a column-oriented data structure in Lucene: docValues • DocValues help speed up faceting and sorting too! • Push SQL predicates down into Solr’s Parallel SQL engine available in Solr 6.x (Coming soon)
  • 7. Data-locality Hint • SolrRDD extends RDD[SolrDocument] (written in Scala) • Give hint to Spark task scheduler about where data lives override def getPreferredLocations(split: Partition): Seq[String] = { // return preferred hostname for a Solr partition
 } • Useful when Spark executor and Solr replicas live on same physical host, as we do in Fusion • Query to a shard has a “preferred” replica; can fallback to other replicas if the preferred goes down (will be in 2.1)
  • 8. Writing to Solr (aka indexing) • Cloud-aware client sends updates to shard leaders in parallel • Solr Schema API used to create fields on-the-fly using the DataFrame schema • Better parallelism than traditional approaches like Solr DIH val dbOpts = Map( "url" -> "jdbc:postgresql:mydb", "dbtable" -> "schema.table", "partitionColumn" -> "foo", "numPartitions" -> "10") val jdbcDF = sqlContext.read.format("jdbc").options(dbOpts).load val solrOpts = Map("zkhost" -> "localhost:9983", "collection" -> "mycoll") jdbcDF.write.format("solr").options(solrOpts).mode(SaveMode.Overwrite).save
  • 9. • Spark ML Pipeline provides nice API for defining stages to train / predict ML models • Crazy idea ~ use battle-hardened Lucene for text analysis in Spark • Pipelines support import/export • Can try different text analysis techniques during cross-validation DF Spark ML Pipeline Lucene Analyzer HashingTF Standard Scaler Trained Model (SVM) save https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/ Solr / Lucene Analyzers for Spark ML Pipelines
  • 10. • Use Lucene text analyzers for text analysis • Wide selection of tokenizers, filters, Character filters in Lucene • JSON declared schema analysis definition Lucene Text Analysis { "analyzers": [{ "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... } ... ] }] }, { "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... }, ... ] }] } ], "fields": [{"name": "...", "analyzer": "..."}, { "regex": ".+", "analyzer": "..." }, ... ] }
  • 11. • Word count example of spark-solr README file Lucene Text Analysis (example) import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("anything", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) // descending sort by count println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
  • 12. • Word count example of spark-solr README file with stop word filtering Lucene Text Analysis (example) import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }, | { "name": "StdTokLowerStop", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }, | { "type": "stop" }] }], | "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" }, | { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]} """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))
  • 13. Lucene ML transformer (example) • LuceneTextAnalyzerTransformer as an ML pipeline transformer • Supports multi-valued input columns val WhitespaceTokSchema = """{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }], | "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin val StdTokLowerSchema = """{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin [...] val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol) [..] val pipeline = new Pipeline().setStages(Array(labelIndexer, analyzer, hashingTF, estimatorStage, labelConverter))
  • 14.
  • 15. • spark-solr 2.0.1 released, built into Fusion 2.4 • Users leave evidence of their needs & experience as they use your app • Fusion closes the feedback loop to improve results based on user “signals” • Train and serve ML Pipeline and mllib based Machine Learning models • Run custom Scala “script” jobs in the background in Fusion ➢ Complex aggregation jobs ➢ Unsupervised learning (LDA topic modeling) ➢ Re-train supervised models as new training data flows in Fusion & Spark
  • 16. • Import package via maven. Available at spark-packages.org ./bin/spark-shell --packages "com.lucidworks.spark:spark-solr:2.0.1" • Build from source git clone https://github.com/Lucidworks/spark-solr cd spark-solr mvn clean package -DskipTests ./bin/spark-shell --jars 2.1.0-SNAPSHOT.jar Getting started with spark-solr
  • 17. // Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips") val solrDF = sqlContext.read.format("solr").options(opts).load // Register DF as temp table solrDF.registerTempTable("trips") sqlContext.sql("SELECT * FROM trips LIMIT 2").show() Example : Deep paging via shards
  • 18. // Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips", "splits" -> "true") val solrDF = sqlContext.read.format("solr").options(opts).load // Register DF as temp table solrDF.registerTempTable("trips") sqlContext.sql("SELECT * FROM trips").count() Example : Deep paging with intra shard splitting
  • 19. // Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips") val solrDF = sqlContext.read.format("solr").options(opts).load // Register DF as temp table solrDF.registerTempTable("trips") sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show() Example : Streaming API (/export handler)
  • 20. • NYC taxi data (30 months - 91.7M rows) • Dataset loaded in to AWS RDS instance (Postgres) • 3 EC2 nodes of r3.2x large instances • Solr and Spark instances co-located together • Collection ‘nyc-taxi’ created with 6 shards, 1 replication • Deployed using solr-scale-tk (https://github.com/LucidWorks/solr-scale-tk) • Dataset link: https://github.com/toddwschneider/nyc-taxi-data • More details: https://gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181 Performance test
  • 21. • 91.4M rows indexed to Solr in 49 minutes • Docs per second: 31K • JDBC batch size: 5000 • Indexing batch size: 50000 • Partitions: 200 Index performance
  • 22. • Query - simple aggregation query to calculate averages • Streaming expressions took 2.3 mins across 6 tasks • Deep paging took 20 minutes across 120 tasks Query performance
  • 23. • Implement Datasource with Catalyst scan. • Fallback to different replica if the preferred replica is down • HashQParserPlugin for intra-shard splits • Output PreAnalyzedField compatible JSON for Solr Roadmap
  • 24. THANK YOU. kiran.chitturi@lucidworks.com / @chitturikiran https://github.com/Lucidworks/spark-solr/ Download Fusion: http://lucidworks.com/fusion/download/