SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Wifi
•GTVisitor  
•hotel/guest  
•pass:  76FCE
Data Engineering with
Solr and Spark
Grant Ingersoll
@gsingers
CTO, Lucidworks
Lucidworks  Fusion  Is  Search-­‐Driven  Everything
•Drive  next  genera=on  relevance  
via  Content,  Collabora=on  and  
Context  
•Harness  best  in  class  Open  
Source:  Apache  Solr  +  Spark  
•Simplify  applica=on  
development  and  reduce  
ongoing  maintenance
Fusion  is  built  on  three  
core  principles:
Fusion  Architecture
RESTAPI Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HDFS(Optional)
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Aler=ng/Messaging
NLP
Pipelines
Blob  Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin  UI
SECURITY  BUILT-­‐IN
https://twitter.com/gsingers/status/700459516362625026
Get Started
https://github.com/Lucidworks/fusion-examples/tree/master/great-
wide-open-2016
• Why  Search  for  Data  Engineering?  
• Quick  intro  to  Solr  
• Quick  intro  to  Spark  
• Solr  +  Spark  
• Relevance  101  
• Machine  learning  with  Spark  and  Solr  
• What’s  next?
Let’s  Do  This
Examples  throughout!
The Importance of Importance
Search-­‐Driven  
Everything
Customer  
Service
Customer  
Insights
Fraud  Surveillance
Research  
Portal
Online  Retail
Digital  
Content
• Data  Engineering,  esp.  with  text,  is  a  
strange  and  magical  world  filled  with…  
– Evil  villains  
– Jesters  
– Wizards  
– Unicorns  
– Heroes!  
• In  other  words,  no  system  will  be  perfect
Caveat  Emptor:  Data  Engineering  EdiLon
• You  will  spend  most  of  your  time  in  data  
engineering,  search,  machine  learning  and  NLP  
doing  “grunt”  work  nicely  labeled  as:  
– Preprocessing  
– Feature  Selection  
– Sampling  
– Validation/testing/etc.  
– Content  extraction  
– ETL  
• Corollary:  Start  with  simple,  tried  and  true  
algorithms,  then  iterate
Why  do  data  engineering  with  Solr  and  Spark?
Solr Spark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Fast, large scale iterative
algorithms
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Lots of integrations with other big
data systems
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr  Key  Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
Lucene  for  the  Win!
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
more
• Easy to plug-in ranking
Solr  and  Your  Tools
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
others
• Output formats: JSON, CSV, XML, custom
Basics  of  Solr  Requests
• Querying:
• Simple: term, phrases, boolean, wildcards, weights
• Advanced: query parsers, spatial, etc.
• Facets: term, query, range, pivot, stats
• Highlighting
• Spell checking
Solr Basics
Spark  Key  Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing,
integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos
Spark  Basics
• Resilient Distributed Datasets
• Spark SQL provides a Data Source, which provides a
DataFrame
• DataFrames — a DSL for distributed data manipulation
• Seamless integration with other Spark tech: SparkR,
Python
Spark  Components
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
components
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory
Why  Spark  for  Solr?
• Build the index very, very quickly!
• Aggregations
• Boosts, stats, iterative computations
• Offline compute to update index with additional info (e.g.
PageRank, popularity)
• Whole corpus analytics, clustering, classification
• Joins with other storage (Cassandra, HDFS, DB, HBase)
Why  Solr  for  Spark?
• Massive simplification of operations!
• Non “dumb” distributed, resilient storage
• Random access with smart queries
• Table scans
• Advanced filtering, feature selection
• Schemaless when you want, predefined when you don’t
• Spatial, columnar, sparse
Spark  +  Solr  in  Anger
http://github.com/lucidworks/spark-solr
Map<String,	
  String>	
  options	
  =	
  new	
  HashMap<String,	
  String>();

options.put("zkhost",	
  zkHost);

options.put("collection”,	
  "tweets");



DataFrame	
  df	
  =	
  sqlContext.read().format("solr").options(options).load();	
  
count	
  =	
  df.filter(df.col("type_s").equalTo(“echo")).count();
Spark  Shell  in  a  Nutshell
• Common commands
• Solr in Spark: queries, filters and other requests
• See commands.md in the Github repo
But is it relevant?
Tales from the
trenches
Look before
you leap
• Wing it
• Ask — Caveat Emptor
• Log analysis
• Experimentation: A/B (A/A) testing
Approaches
• Precision/Recall (also, Mean Avg. Precision)
• Mean Reciprocal Rank (MRR)
• Number of {Zero|Embarrassing} Results
• Inter-Annotator Agreement
• Normalized Discounted Cumulative Gain (NDCG)
Common  Metrics
Tips and Traps
Algorithms Collective Intelligence Editors/Rules
The mainstay of any approach: leverages
Lucene/Solr’s built in similarity engine,
function queries and other capabilities to
determine importance based on core index
Especially effective for curating the long
tail, feedback from users and other systems
provide key insights into importance. Can
also be used to inform the business about
trends and interests.
Should be used sparingly to handle key
situations such as promotions and edge
cases. Review often. Encourage
experimentation instead. Works well for
landing pages, boosts and blocks where
you know the answers. Not to be confused
with curating content.
Big  Picture  on  Relevance
• Similarity Models
Default, BM25F, others
• Function Queries, Reranking, Boosts
• Phrases are almost always a win (edismax does most of this for you)
e.g.: (exact match terms)^100 AND (“termA termB…”~10)^50 AND (termA AND
termB…)^10 AND (termA OR termB)
• Mind your analysis
Algorithms
• UI, UI, UI!
• 1000’s of rules
• Second is the first loser
• Local minimum
• Pet peeve queries
• Oprah effect
• Assumptions
It’s a trap!
Level up
• Spark ships with good out of the box machine learning capabilities
• Spark-Solr brings enhanced feature selection tools via Lucene analyzers
• Examples
k-means
word2vec
Find synonyms
Machine  Learning  at  Work
Sneak Peek
• Parallel  Execu=on  of  SQL  across  
SolrCloud  
• Real=me  Map-­‐Reduce  (“ish”)  
Func=onality  
• Parallel  Rela=onal  Algebra  
• Builds  on  streaming  capabili=es  in  5.x  
• JDBC  client  in  the  works
Just  When  You  Thought  SQL  was  Dead
Full, Parallelized, SQL Support
• Lots  of  Func=ons:  
• Search,  Merge,  Group,  Unique,  Parallel,  
Select,  Reduce,  Select,  innerJoin,  
hashJoin,  Top,  Rollup,  Facet,  Stats,  
Update,  JDBC,  Intersect,  Complement,  
Logit  
• Composable  Streams  
• Query  op=miza=on  built  in
SQL  Guts
Example
select	
  str_s,	
  count(*),	
  sum(field_i),	
  min(field_i),	
  max(field_i),	
  
avg(field_i)	
  from	
  collection1	
  where	
  text=’XXXX’	
  group	
  by	
  str_s
rollup(	
  
	
  	
  	
  search(collection1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  q=”(text:XXXX)”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  qt=”/export”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl=”str_s,field_i”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKeys=str_s,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort=”str_s	
  asc”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  zkHost=”localhost:9989/solr”),	
  
	
  	
  	
  over=str_s,	
  
	
  	
  	
  count(*),	
  
	
  	
  	
  sum(field_i),	
  
	
  	
  	
  min(field_i),	
  
	
  	
  	
  max(field_i),	
  
	
  	
  	
  avg(field_i)
• Provides  replica=on  between  two  or  more  SolrCloud  clusters  located  in  two  or  
more  data  centers  
• Uses  exis=ng  transac=on  logs  
• Asynchronous  indexing  
• No  Single  Point  of  Failure  or  boglenecks  
• Leader-­‐to-­‐leader  communica=on  to  ensure  updates  are  only  sent  once  
Never  Go  Down,  or  at  least  Recover  Quickly!
Cross Data Center Replication
• Graph  Traversal  
• Find  all  tweets  men=oning  “Solr”  by  me  or  people  I  follow  
• Find  all  drah  blog  posts  about  “Parallel  SQL”  wrigen  by  a  developer  
• Find  3-­‐star  hotels  in  NYC  my  friends  stayed  in  last  year  
• BM25F  Default  Similarity  
• Geo3D  search
Make  ConnecLons,  Get  BeXer  Results
• Jegy  9.3  and  hgp2  (6.x)  
• Fully  mul=plexed  over  a  single  connec=on  
• Reduced  chance  of  distributed  deadlock  
• Backup/Restore  API  
• Op=miza=ons  to  distributed  search  algorithm  
• AngularJS-­‐based  UI
But  Wait!    There’s  More!
2016
OCTOBER 13-16, 2016
BOSTON, MA
Resources
• This code: https://github.com/Lucidworks/fusion-
examples/tree/master/great-wide-open-2016
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Appendix  A:  SQL  details
Streaming API & Expressions
●API
○ Java API to provide programming framework
○ Returns tuples as a JSON stream
○ org.apache.solr.client.solrj.io	
  
●Expressions
○ String Query Language
○ Serialization format
○ Allows non-Java programmers to access Streaming API
DocValues must be enabled for any field to be returned
Streaming Expression Request
curl	
  -­‐-­‐data-­‐urlencode	
  	
  
	
  	
  	
  'stream=search(sample,	
  
	
  	
  	
  	
  	
  	
  q="*:*",	
  
	
  	
  	
  	
  	
  	
  fl="id,field_i",	
  
	
  	
  	
  	
  	
  	
  sort="field_i	
  asc")'	
  http://localhost:8901/solr/sample/stream
Streaming Expression Response
{"responseHeader":	
  {"status":	
  0,	
  "QTime":	
  1},	
  
	
  	
  	
  	
  "tuples":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  "numFound":	
  -­‐1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "start":	
  -­‐1,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "docs":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"id":	
  "doc1",	
  "field_i":	
  1},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"id":	
  "doc2",	
  "field_i":	
  2},	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {"EOF":	
  true}]	
  
	
  	
  	
  	
  }}
Architecture
●MapReduce-ish
○ Borrows Shuffling concept from M/R
●Logical tiers for performing the query
○ SQL tier: translates SQL to streaming expressions for parallel query plan,
selects worker nodes, merges results
○ Worker tier: executes parallel query plan, streams tuples from data tables
back
○ Data Table tier: queries SolrCloud collections, performs initial sort and
partitioning of results for worker nodes
JDBC Client
●Parallel SQL includes a “thin” JDBC client
●Expanded to include SQL Clients such as DbVisualizer
(SOLR-8502)
●Client only works with Parallel SQL features
Learning More
Joel Bernstein’s presentation at Lucene Revolution:
●https://www.youtube.com/watch?v=baWQfHWozXc
Apache Solr Reference Guide:
●https://cwiki.apache.org/confluence/display/solr/
Streaming+Expressions
●https://cwiki.apache.org/confluence/display/solr/Parallel
+SQL+Interface
Spark  Architecture
Spark Master (daemon)
Spark Slave (daemon)
my-spark-job.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
•  Keeps track of live workers
•  Web UI on port 8080
•  Task Scheduler
•  Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
•  RDD Graph
•  DAG Scheduler
•  Block tracker
•  Shuffle tracker

Weitere ähnliche Inhalte

Was ist angesagt?

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

Was ist angesagt? (20)

Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Eclipse文字化けする。一撃で文字化けを直す方法
Eclipse文字化けする。一撃で文字化けを直す方法Eclipse文字化けする。一撃で文字化けを直す方法
Eclipse文字化けする。一撃で文字化けを直す方法
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1
 
CUDA
CUDACUDA
CUDA
 
20211118 dbts2021 マイクロサービスにおけるApache Geodeの効果的な使い方
20211118 dbts2021 マイクロサービスにおけるApache Geodeの効果的な使い方20211118 dbts2021 マイクロサービスにおけるApache Geodeの効果的な使い方
20211118 dbts2021 マイクロサービスにおけるApache Geodeの効果的な使い方
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Design patterns : résumé
Design patterns : résuméDesign patterns : résumé
Design patterns : résumé
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Design Patterns Java
Design Patterns JavaDesign Patterns Java
Design Patterns Java
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
 

Ähnlich wie Data Engineering with Solr and Spark

Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 

Ähnlich wie Data Engineering with Solr and Spark (20)

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langit0bbleedingedge long-140614012258-phpapp02 lynn-langit
0bbleedingedge long-140614012258-phpapp02 lynn-langit
 
Bleeding Edge Databases
Bleeding Edge DatabasesBleeding Edge Databases
Bleeding Edge Databases
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 

Mehr von Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Mehr von Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Data Engineering with Solr and Spark

  • 1.
  • 3. Data Engineering with Solr and Spark Grant Ingersoll @gsingers CTO, Lucidworks
  • 4. Lucidworks  Fusion  Is  Search-­‐Driven  Everything •Drive  next  genera=on  relevance   via  Content,  Collabora=on  and   Context   •Harness  best  in  class  Open   Source:  Apache  Solr  +  Spark   •Simplify  applica=on   development  and  reduce   ongoing  maintenance Fusion  is  built  on  three   core  principles:
  • 5. Fusion  Architecture RESTAPI Worker Worker Cluster Mgr. Apache Spark Shards Shards Apache Solr HDFS(Optional) Shared Config Mgmt Leader Election Load Balancing ZK 1 Apache Zookeeper ZK N DATABASEWEBFILELOGSHADOOP CLOUD Connectors Aler=ng/Messaging NLP Pipelines Blob  Storage Scheduling Recommenders/Signals … Core Services Admin  UI SECURITY  BUILT-­‐IN
  • 8. • Why  Search  for  Data  Engineering?   • Quick  intro  to  Solr   • Quick  intro  to  Spark   • Solr  +  Spark   • Relevance  101   • Machine  learning  with  Spark  and  Solr   • What’s  next? Let’s  Do  This Examples  throughout!
  • 9. The Importance of Importance
  • 10. Search-­‐Driven   Everything Customer   Service Customer   Insights Fraud  Surveillance Research   Portal Online  Retail Digital   Content
  • 11. • Data  Engineering,  esp.  with  text,  is  a   strange  and  magical  world  filled  with…   – Evil  villains   – Jesters   – Wizards   – Unicorns   – Heroes!   • In  other  words,  no  system  will  be  perfect Caveat  Emptor:  Data  Engineering  EdiLon
  • 12. • You  will  spend  most  of  your  time  in  data   engineering,  search,  machine  learning  and  NLP   doing  “grunt”  work  nicely  labeled  as:   – Preprocessing   – Feature  Selection   – Sampling   – Validation/testing/etc.   – Content  extraction   – ETL   • Corollary:  Start  with  simple,  tried  and  true   algorithms,  then  iterate
  • 13. Why  do  data  engineering  with  Solr  and  Spark? Solr Spark • Data exploration and visualization • Easy ingestion and feature selection • Powerful ranking features • Quick and dirty classification and clustering • Simple operation and scaling • Stats and math built in • Advanced machine learning: MLLib, Mahout, Deep Learning4j • Fast, large scale iterative algorithms • General purpose batch/streaming compute engine Whole collection analysis! • Lots of integrations with other big data systems
  • 14. • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr  Key  Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication
  • 15. Lucene  for  the  Win! • Vector Space or Probabilistic, it’s your choice! • Killer FST • Wicked fast • Pluggable compression, queries, indexing and more • Advanced Similarity Models • Lang. Modeling, Divergence from Random, more • Easy to plug-in ranking
  • 16. Solr  and  Your  Tools • Data ingest: • JSON, CSV, XML, Rich types (PDF, etc.), custom • Clients for Python, R, Java, .NET and more • http://cran.r-project.org/web/packages/solr/index.html, amongst others • Output formats: JSON, CSV, XML, custom
  • 17. Basics  of  Solr  Requests • Querying: • Simple: term, phrases, boolean, wildcards, weights • Advanced: query parsers, spatial, etc. • Facets: term, query, range, pivot, stats • Highlighting • Spell checking
  • 19. Spark  Key  Features • General purpose, high powered cluster computing system • Modern, faster alternative to MapReduce • 3x faster w/ 10x less hardware for Terasort • Great for iterative algorithms • APIs for Java, Scala, Python and R • Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems • Deploys: Standalone, Hadoop YARN, Mesos
  • 20. Spark  Basics • Resilient Distributed Datasets • Spark SQL provides a Data Source, which provides a DataFrame • DataFrames — a DSL for distributed data manipulation • Seamless integration with other Spark tech: SparkR, Python
  • 21. Spark  Components Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching components engine cluster mgmt Tachyon languages Scala Java Python R shared memory
  • 22. Why  Spark  for  Solr? • Build the index very, very quickly! • Aggregations • Boosts, stats, iterative computations • Offline compute to update index with additional info (e.g. PageRank, popularity) • Whole corpus analytics, clustering, classification • Joins with other storage (Cassandra, HDFS, DB, HBase)
  • 23. Why  Solr  for  Spark? • Massive simplification of operations! • Non “dumb” distributed, resilient storage • Random access with smart queries • Table scans • Advanced filtering, feature selection • Schemaless when you want, predefined when you don’t • Spatial, columnar, sparse
  • 24. Spark  +  Solr  in  Anger http://github.com/lucidworks/spark-solr Map<String,  String>  options  =  new  HashMap<String,  String>();
 options.put("zkhost",  zkHost);
 options.put("collection”,  "tweets");
 
 DataFrame  df  =  sqlContext.read().format("solr").options(options).load();   count  =  df.filter(df.col("type_s").equalTo(“echo")).count();
  • 25. Spark  Shell  in  a  Nutshell • Common commands • Solr in Spark: queries, filters and other requests • See commands.md in the Github repo
  • 26.
  • 27. But is it relevant?
  • 30. • Wing it • Ask — Caveat Emptor • Log analysis • Experimentation: A/B (A/A) testing Approaches
  • 31. • Precision/Recall (also, Mean Avg. Precision) • Mean Reciprocal Rank (MRR) • Number of {Zero|Embarrassing} Results • Inter-Annotator Agreement • Normalized Discounted Cumulative Gain (NDCG) Common  Metrics
  • 33. Algorithms Collective Intelligence Editors/Rules The mainstay of any approach: leverages Lucene/Solr’s built in similarity engine, function queries and other capabilities to determine importance based on core index Especially effective for curating the long tail, feedback from users and other systems provide key insights into importance. Can also be used to inform the business about trends and interests. Should be used sparingly to handle key situations such as promotions and edge cases. Review often. Encourage experimentation instead. Works well for landing pages, boosts and blocks where you know the answers. Not to be confused with curating content. Big  Picture  on  Relevance
  • 34. • Similarity Models Default, BM25F, others • Function Queries, Reranking, Boosts • Phrases are almost always a win (edismax does most of this for you) e.g.: (exact match terms)^100 AND (“termA termB…”~10)^50 AND (termA AND termB…)^10 AND (termA OR termB) • Mind your analysis Algorithms
  • 35. • UI, UI, UI! • 1000’s of rules • Second is the first loser • Local minimum • Pet peeve queries • Oprah effect • Assumptions It’s a trap!
  • 37. • Spark ships with good out of the box machine learning capabilities • Spark-Solr brings enhanced feature selection tools via Lucene analyzers • Examples k-means word2vec Find synonyms Machine  Learning  at  Work
  • 39. • Parallel  Execu=on  of  SQL  across   SolrCloud   • Real=me  Map-­‐Reduce  (“ish”)   Func=onality   • Parallel  Rela=onal  Algebra   • Builds  on  streaming  capabili=es  in  5.x   • JDBC  client  in  the  works Just  When  You  Thought  SQL  was  Dead Full, Parallelized, SQL Support
  • 40. • Lots  of  Func=ons:   • Search,  Merge,  Group,  Unique,  Parallel,   Select,  Reduce,  Select,  innerJoin,   hashJoin,  Top,  Rollup,  Facet,  Stats,   Update,  JDBC,  Intersect,  Complement,   Logit   • Composable  Streams   • Query  op=miza=on  built  in SQL  Guts Example select  str_s,  count(*),  sum(field_i),  min(field_i),  max(field_i),   avg(field_i)  from  collection1  where  text=’XXXX’  group  by  str_s rollup(        search(collection1,                      q=”(text:XXXX)”,                      qt=”/export”,                      fl=”str_s,field_i”,                      partitionKeys=str_s,                      sort=”str_s  asc”,                      zkHost=”localhost:9989/solr”),        over=str_s,        count(*),        sum(field_i),        min(field_i),        max(field_i),        avg(field_i)
  • 41. • Provides  replica=on  between  two  or  more  SolrCloud  clusters  located  in  two  or   more  data  centers   • Uses  exis=ng  transac=on  logs   • Asynchronous  indexing   • No  Single  Point  of  Failure  or  boglenecks   • Leader-­‐to-­‐leader  communica=on  to  ensure  updates  are  only  sent  once   Never  Go  Down,  or  at  least  Recover  Quickly! Cross Data Center Replication
  • 42. • Graph  Traversal   • Find  all  tweets  men=oning  “Solr”  by  me  or  people  I  follow   • Find  all  drah  blog  posts  about  “Parallel  SQL”  wrigen  by  a  developer   • Find  3-­‐star  hotels  in  NYC  my  friends  stayed  in  last  year   • BM25F  Default  Similarity   • Geo3D  search Make  ConnecLons,  Get  BeXer  Results
  • 43. • Jegy  9.3  and  hgp2  (6.x)   • Fully  mul=plexed  over  a  single  connec=on   • Reduced  chance  of  distributed  deadlock   • Backup/Restore  API   • Op=miza=ons  to  distributed  search  algorithm   • AngularJS-­‐based  UI But  Wait!    There’s  More!
  • 45. Resources • This code: https://github.com/Lucidworks/fusion- examples/tree/master/great-wide-open-2016 • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Book: http://www.manning.com/ingersoll • Solr: http://lucene.apache.org/solr • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @gsingers
  • 46. Appendix  A:  SQL  details
  • 47. Streaming API & Expressions ●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io   ●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API DocValues must be enabled for any field to be returned
  • 48. Streaming Expression Request curl  -­‐-­‐data-­‐urlencode          'stream=search(sample,              q="*:*",              fl="id,field_i",              sort="field_i  asc")'  http://localhost:8901/solr/sample/stream
  • 49. Streaming Expression Response {"responseHeader":  {"status":  0,  "QTime":  1},          "tuples":  {                  "numFound":  -­‐1,                  "start":  -­‐1,                  "docs":  [                          {"id":  "doc1",  "field_i":  1},                          {"id":  "doc2",  "field_i":  2},                          {"EOF":  true}]          }}
  • 50. Architecture ●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan, selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables back ○ Data Table tier: queries SolrCloud collections, performs initial sort and partitioning of results for worker nodes
  • 51. JDBC Client ●Parallel SQL includes a “thin” JDBC client ●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502) ●Client only works with Parallel SQL features
  • 52. Learning More Joel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/ Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel +SQL+Interface
  • 53. Spark  Architecture Spark Master (daemon) Spark Slave (daemon) my-spark-job.jar (w/ shaded deps) My Spark App SparkContext (driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks Spark Executor (JVM process) Tasks Executor runs in separate process than slave daemon Spark Worker Node (1...N of these) Each task works on some partition of a data set to apply a transformation or action Cache Losing a master prevents new applications from being executed Can achieve HA using ZooKeeper and multiple master nodes Tasks are assigned based on data-locality When selecting which node to execute a task on, the master takes into account data locality •  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker