SlideShare ist ein Scribd-Unternehmen logo
1 von 47
REMINDER
Check in on the COLLABORATE
mobile app
Building Recommendation
Platforms with Hadoop
Prepared by:
Jayant Shekhar
Sr. Solutions Architect
Cloudera
Agenda
■ Why Big Data Recommendation Platform?
■ Common Recommendation Patterns & Algorithms
■ Lambda Architecture
■ Architecture & Design of Computation & Serving Layers
■ Social Recommendations with Giraph
■ Recommendations with Solr
■ Recommendation with Storm/HBase
Recommendations is one of the commonly used use cases of Hadoop
Recommendations can be
Recommendations Broader Use Cases
• Product Recommendation
• People/Social Recommendation
• Merchant Recommendation
• Content Recommendation
• Query Recommendation
• Sponsored Search Advertising
• Realtime
• News Recommendations
• Merchant/Offer
Recommendations on mobile
• Offline
• Similar Profiles/Resumes
• In Between
• Web
• Mobile
• Email
• Postal mail
• Newspaper/Magazine
ads
Recommendations are delivered through Data Sets Involved
• Items/Products/Content
• Transaction Data
• User Data
• Logs & User Activity
• Additional 3rd Party Data
• Geo
• Social
• Reviews
• …
Different Time to Action Targeted
• User would view the content now/Buy the Product
Now
• User would buy the product in a week
• Next when he/she goes grocery shopping
• User would buy the product in the next 3 months
• TV/Dishwasher etc.
• Vacation
Also need to be able to
determine/differentiate between
the users in a household
Common Recommendation
Patterns & associated
Algorithms
Some ML Algorithms used for Recommendations
• Collaborative Filtering
• Clustering
• Classification &
Regression
• Pattern Mining
Collaborative Filtering Clustering
• ALS
• SVD
• Slope One
Recommender
• K-means
• Canopy
• Fuzzy K-Means
• Parallel FP-Growth
• Logistic Regression
• Naïve Bayes
• Random Forest
Classification & Regression Pattern Mining
CLASS
Product Recommendation
Use Cases
• Recommend Product
• Recommend Movies/Videos
Algorithms
• Collaborative Filtering
• Logistic Regression
Frequently bought/viewed together
Use Cases
• Find items that are frequently
bought together
• Related Searches/Query
Suggestion
• View Item Page
Algorithms
• Parallel FP-
Growth://infolab.stanford.edu/~
echang/recsys08-69.pdf
Related Searches or Query Recommendations
Design
• Use Query Log Data
• Cluster similar queries
• Use Parallel FP Growth to
find the related searches
Query Distance
• Based on keywords or phrases
• Based on searches in the same
session
• Based on common clicked URLs
• Based on the distance of the
clicked documents
Related Articles/News
• Batch clustering with K-Means
• NRT clustering using the centroids
• Perform canopy on left over articles
Social/People Recommendations
Use Case
• Recommend Missing Links in a
Social Network
• Bipartite Matching –
Recommend Men/Women
Design
• Take existing edges and friend of
friends
• Build Regression Models based on
latest activity
• Scale easily offline with Hadoop as
number of friends of friends and
activities could be very high.
• Giraph
Lambda Architecture
Lambda Architecture
Stream Processing
Realtime View
New Data Stream
All Data
Pre-compute
Views
Batch View Batch View
Query
Lambda architecture
proposed by Nathan Marz,
creator of Storm
Lambda Architecture
Large Scale Offline Batch
+
Real-time Online Streaming
■ Batch Layer : offline, asynchronous
■ Serving Layer : real-time, incremental, approximate
Computation & Serving Layers
for Recommendations
Oryx Lambda Architecture
Closer View of Oryx Serving & Model Generation
HDFS
Serving
Layer
Serving
Layer
Serving
Layer
A
P
I
Generation 0 Generation 1 Generation 2
Computation Layer
Generation directory contains:
• Input data
• Configuration
• Model
Generation 3
Feature Generation & Model Building
HDFS
Data Data
Data Data
Data Data
Feature Generation
Model Model
Model Model
Model Model
Model Generation
Hadoop enables easy iteration over the process of Model Generation
and testing it out offline.
Requirements for ML on Hadoop
■ Model Building
▪ Large Scale Distributed
▪ Continuous
■ Model Serving
▪ Real-time query
▪ Real-time updates
■ Algorithms
▪ Parallelizable
▪ Updateable
■ Interoperable
▪ PMML model format
▪ Simple REST API
▪ Open Source
Computation Layer Vs Serving Layer
■ Computation Layer
▪ Periodically builds generation from recent data and past model
▪ Baby sits MR job
▪ Publishes Model
■ Serving Layer
▪ Consumes Model
▪ Serves queries from model in memory
▪ Updates the model from new input
▪ Also writes input to HDFS
▪ Replicas for scale
Collaborative Filtering : ALS
■ Alternating Least Squares
■ Matrix Factorization
■ Faster than SVD
■ Real-time update
■ Parallelizable
Clustering : K-means++
■ Well-known and understood
■ Parallelizable
■ Clusters Updateable
■ Obtains an initial set of centers that is close to the optimum
solution.
Classification/Regression : RDF
■ Random Decision Forests
■ Ensemble Method
■ Numerical, Categorical features and target
■ Very Parallel
■ Nodes Updateable
Social Recommendation with
Giraph
Graph Use Cases
• Social Recommendations
• Recommend missing links in a social network
• Twitter Graph
• Who to follow
• Similar To
• Bipartite Matching
• Matching job/employees, men/women
• How are users connected
• Clustering – find related people in groups
Giraph
■ Each vertex has an id, a value, a list of its adjacent
neighbour ids and the corresponding edge values
■ Edges are always directed
▪ Out-edges attached to a node
▪ Nodes can’t see inbound edges
■ Nodes communicate via messages
■ No remote reads
Giraph BSP
Giraph BSP
■ Input is a directed graph
■ Each vertex is invoked in each superstep, can
recompute its value and send messages to other
vertices, which are delivered over superstep barriers
■ This is done till every Vertex votes to halt
■ Output is a directed graph
ML Algorithms with Graph Processing
■ Collaborative Filtering
■ Clustering
■ Gradient Descent : Linear Regression, Logistic Regression
Matrix factorization
M = U X
V
ALS : fix one side and solve for the other
Representing Matrix by Graphs
3 - 8
- 9 5
5 - -
3
1
2
1
2
3
Row Column
3
8
9
55
• every vertex holds a row vector
Recommendations with Solr
Lucene Inverted Index
Term Documents
framework 1[1x]
for 1[1x] , 5[1x]
job 1[1x]
data 2[1x] , 4[1x]
… ...
and 3[1x], 4[1x]
wide 5[1x]
variety 5[2x]
… …
Document Content Field
1 framework for job
scheduling
2 data warehouse
infrastructure and
3 fast and general compute
engine
4 data serialization system
and
5 wide variety of companies
… …
Input Documents Index
Recommendation Approaches in Solr
■ Attribute-based
■ Textual Similarity-based
■ More-like-this
■ Collaborative Filtering
Attribute-based Recommendations
■ Example: Match User Attributes to Item Attribute Fields
/solr/select/?q=(grouptitle:”big data”^25 OR grouptitle:(java)^10) AND
((city:”Las Vegas” AND state:”NV”)^15 OR state:”NV”)”
Textual Similarity-based Recommendations
■ Solr’s MoreLikeThis Search Component.
■ Extracts important keywords from one or more documents
and uses them in search.
■ This results in secondary search results which demonstrate
textual similarity to the original document
■ http://wiki.apache.org/solr/MoreLikeThis
Content Recommendation
■ Even a single keyword can be enough to begin making meaningful
recommendations.
■ Filtering or boosting results based upon geographical area or
distance can help greatly for certain use cases:
▪ Jobs/Resumes, Events, Restaurants
■ /solr/select/?q=(Standard Recommendation Query) AND
_val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”
Behavior Based Recommendation Approaches
Collaborative Filtering : Uses who likes these also liked…
■ Step 1: Find similar users who like the same documents
q=documentid: (“doc1” OR “doc4”)
■ Step 2: Search for docs “liked” by those similar users
/solr/select/?q=userlikes: (“user5”^2
OR “user4”^2 OR “user1”^1)
Cloudera Search Architecture
HDFS
Online Streaming Data
End User Client App
(e.g. Hue)
Flume
Raw, filtered, or
annotated data
SolrCloud Cluster(s)
NRT Data
indexed w/
Morphlines
Indexed data
MapReduce Batch Indexing
w/ Morphlines
GoLive updates
HBase
Cluster
NRT Replication
Events indexed
w/ Morphlines
OLTP Data
ClouderaManager
Search queries
Storm & Recommendations
Real-time Architecture using Storm & Hadoop
Key/Value StoreStorm
Incoming Data
Hadoop
Query
Real-time and Storm
■ The query layer queries the real-time and batch and merges
the result
■ Some algorithms are hard to implement in real time. For
those cases we could estimate the results.
■ The model is generated offline on Hadoop and deployed into
Storm.
■ Online learning algorithms can be used in Storm. They learn
continuously through streaming training data.
■ Storm can also be used for scoring.
Storm/Track Realtime Events
■ Real-time streaming analytics/stats on consumer viewing behavior
and digital content trends.
■ Track impressions, clicks, conversions, bid requests etc. in real
time. Push per minute aggregations to HBase.
■ Most Popular Searches/Downloads/News Articles/Movies/Products
Training of Models
A/B Testing
Offline Training & Testing of Models
Use Cases
• Recommend Missing Links in a Social Network
• How are users connected
• Clustering – find related people in groups
• Iterative Graph Ranking
Hadoop provides an excellent platform to train and test out the Models and various Algorithms
Model
Train Test
Training Set Test Set
Score
A/B Testing
Traffic
New Model
Old Model
X%
(100-X)%
A/B testing is used to test the performance of the Models online
A/B testing involves:
• Partitioning real traffic to two models and then measuring the performance to the
desired result (maximize CTR, revenue, page views etc.).
• The partitioning logic can get complicated. In such cases they can be pre-computed on
Hadoop offline and pushed to an online store.
Please complete the session
evaluation on the mobile app
We appreciate your feedback and insight
Trends, Aggregations & Counters
• Most Popular Searches/Downloads/News Articles/Movies/Products
• Load results into HBase
• Use HBase where we need NRT count of things (categories/products etc.)
• Impala is very useful here for faster SLAs
HBase Counters
• Has concept of Incrementing column values
• Avoids lock row/read value/increment it/write it back/unlock rows
• Great for counting specific metrics over time
• Example - count per URL/Product
• Can disable write to WAL on puts

Weitere ähnliche Inhalte

Was ist angesagt?

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jMax De Marzi
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?John T. Kane
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Statistical Models for Massive Web Data
Statistical Models for Massive Web DataStatistical Models for Massive Web Data
Statistical Models for Massive Web DataDeepak Agarwal
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Salesforce Engineering
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
Windy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4jWindy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4jMax De Marzi
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Deep Dive: Security Trimming in Fusion
Deep Dive: Security Trimming in FusionDeep Dive: Security Trimming in Fusion
Deep Dive: Security Trimming in FusionLucidworks
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Michal Bachman
 

Was ist angesagt? (18)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?
Search Product Manager: Software PM vs. Enterprise PM or What does that * PM do?
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Statistical Models for Massive Web Data
Statistical Models for Massive Web DataStatistical Models for Massive Web Data
Statistical Models for Massive Web Data
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
Windy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4jWindy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4j
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
 
Deep Dive: Security Trimming in Fusion
Deep Dive: Security Trimming in FusionDeep Dive: Security Trimming in Fusion
Deep Dive: Security Trimming in Fusion
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)
 

Ähnlich wie Building Recommendation Platforms with Hadoop

Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendationsRik Van Bruggen
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaNeo4j
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Sonya Liberman
 
State of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMState of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMNeo4j
 
SEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentationSEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentationSemLib Project
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning Jesus Rodriguez
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analyticsRik Van Bruggen
 
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech Talks
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech TalksDeep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech Talks
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech TalksAmazon Web Services
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainNeo4j
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 

Ähnlich wie Building Recommendation Platforms with Hadoop (20)

Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendations
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 
State of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAMState of Florida Neo4j Graph Briefing - Cyber IAM
State of Florida Neo4j Graph Briefing - Cyber IAM
 
SEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentationSEMLIB Final Conference | DERI presentation
SEMLIB Final Conference | DERI presentation
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech Talks
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech TalksDeep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech Talks
Deep Dive on Amazon Cloud Directory - April 2017 AWS Online Tech Talks
 
Introduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & BahrainIntroduction to Neo4j for the Emirates & Bahrain
Introduction to Neo4j for the Emirates & Bahrain
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 

Kürzlich hochgeladen

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 

Kürzlich hochgeladen (17)

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 

Building Recommendation Platforms with Hadoop

  • 1. REMINDER Check in on the COLLABORATE mobile app Building Recommendation Platforms with Hadoop Prepared by: Jayant Shekhar Sr. Solutions Architect Cloudera
  • 2. Agenda ■ Why Big Data Recommendation Platform? ■ Common Recommendation Patterns & Algorithms ■ Lambda Architecture ■ Architecture & Design of Computation & Serving Layers ■ Social Recommendations with Giraph ■ Recommendations with Solr ■ Recommendation with Storm/HBase
  • 3.
  • 4. Recommendations is one of the commonly used use cases of Hadoop Recommendations can be Recommendations Broader Use Cases • Product Recommendation • People/Social Recommendation • Merchant Recommendation • Content Recommendation • Query Recommendation • Sponsored Search Advertising • Realtime • News Recommendations • Merchant/Offer Recommendations on mobile • Offline • Similar Profiles/Resumes • In Between
  • 5. • Web • Mobile • Email • Postal mail • Newspaper/Magazine ads Recommendations are delivered through Data Sets Involved • Items/Products/Content • Transaction Data • User Data • Logs & User Activity • Additional 3rd Party Data • Geo • Social • Reviews • … Different Time to Action Targeted • User would view the content now/Buy the Product Now • User would buy the product in a week • Next when he/she goes grocery shopping • User would buy the product in the next 3 months • TV/Dishwasher etc. • Vacation Also need to be able to determine/differentiate between the users in a household
  • 6. Common Recommendation Patterns & associated Algorithms
  • 7. Some ML Algorithms used for Recommendations • Collaborative Filtering • Clustering • Classification & Regression • Pattern Mining Collaborative Filtering Clustering • ALS • SVD • Slope One Recommender • K-means • Canopy • Fuzzy K-Means • Parallel FP-Growth • Logistic Regression • Naïve Bayes • Random Forest Classification & Regression Pattern Mining CLASS
  • 8. Product Recommendation Use Cases • Recommend Product • Recommend Movies/Videos Algorithms • Collaborative Filtering • Logistic Regression Frequently bought/viewed together Use Cases • Find items that are frequently bought together • Related Searches/Query Suggestion • View Item Page Algorithms • Parallel FP- Growth://infolab.stanford.edu/~ echang/recsys08-69.pdf
  • 9. Related Searches or Query Recommendations Design • Use Query Log Data • Cluster similar queries • Use Parallel FP Growth to find the related searches Query Distance • Based on keywords or phrases • Based on searches in the same session • Based on common clicked URLs • Based on the distance of the clicked documents Related Articles/News • Batch clustering with K-Means • NRT clustering using the centroids • Perform canopy on left over articles
  • 10. Social/People Recommendations Use Case • Recommend Missing Links in a Social Network • Bipartite Matching – Recommend Men/Women Design • Take existing edges and friend of friends • Build Regression Models based on latest activity • Scale easily offline with Hadoop as number of friends of friends and activities could be very high. • Giraph
  • 12. Lambda Architecture Stream Processing Realtime View New Data Stream All Data Pre-compute Views Batch View Batch View Query Lambda architecture proposed by Nathan Marz, creator of Storm
  • 13. Lambda Architecture Large Scale Offline Batch + Real-time Online Streaming ■ Batch Layer : offline, asynchronous ■ Serving Layer : real-time, incremental, approximate
  • 14. Computation & Serving Layers for Recommendations
  • 16. Closer View of Oryx Serving & Model Generation HDFS Serving Layer Serving Layer Serving Layer A P I Generation 0 Generation 1 Generation 2 Computation Layer Generation directory contains: • Input data • Configuration • Model Generation 3
  • 17. Feature Generation & Model Building HDFS Data Data Data Data Data Data Feature Generation Model Model Model Model Model Model Model Generation Hadoop enables easy iteration over the process of Model Generation and testing it out offline.
  • 18. Requirements for ML on Hadoop ■ Model Building ▪ Large Scale Distributed ▪ Continuous ■ Model Serving ▪ Real-time query ▪ Real-time updates ■ Algorithms ▪ Parallelizable ▪ Updateable ■ Interoperable ▪ PMML model format ▪ Simple REST API ▪ Open Source
  • 19. Computation Layer Vs Serving Layer ■ Computation Layer ▪ Periodically builds generation from recent data and past model ▪ Baby sits MR job ▪ Publishes Model ■ Serving Layer ▪ Consumes Model ▪ Serves queries from model in memory ▪ Updates the model from new input ▪ Also writes input to HDFS ▪ Replicas for scale
  • 20. Collaborative Filtering : ALS ■ Alternating Least Squares ■ Matrix Factorization ■ Faster than SVD ■ Real-time update ■ Parallelizable
  • 21. Clustering : K-means++ ■ Well-known and understood ■ Parallelizable ■ Clusters Updateable ■ Obtains an initial set of centers that is close to the optimum solution.
  • 22. Classification/Regression : RDF ■ Random Decision Forests ■ Ensemble Method ■ Numerical, Categorical features and target ■ Very Parallel ■ Nodes Updateable
  • 24. Graph Use Cases • Social Recommendations • Recommend missing links in a social network • Twitter Graph • Who to follow • Similar To • Bipartite Matching • Matching job/employees, men/women • How are users connected • Clustering – find related people in groups
  • 25. Giraph ■ Each vertex has an id, a value, a list of its adjacent neighbour ids and the corresponding edge values ■ Edges are always directed ▪ Out-edges attached to a node ▪ Nodes can’t see inbound edges ■ Nodes communicate via messages ■ No remote reads
  • 27. Giraph BSP ■ Input is a directed graph ■ Each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers ■ This is done till every Vertex votes to halt ■ Output is a directed graph
  • 28. ML Algorithms with Graph Processing ■ Collaborative Filtering ■ Clustering ■ Gradient Descent : Linear Regression, Logistic Regression
  • 29. Matrix factorization M = U X V ALS : fix one side and solve for the other
  • 30. Representing Matrix by Graphs 3 - 8 - 9 5 5 - - 3 1 2 1 2 3 Row Column 3 8 9 55 • every vertex holds a row vector
  • 32. Lucene Inverted Index Term Documents framework 1[1x] for 1[1x] , 5[1x] job 1[1x] data 2[1x] , 4[1x] … ... and 3[1x], 4[1x] wide 5[1x] variety 5[2x] … … Document Content Field 1 framework for job scheduling 2 data warehouse infrastructure and 3 fast and general compute engine 4 data serialization system and 5 wide variety of companies … … Input Documents Index
  • 33. Recommendation Approaches in Solr ■ Attribute-based ■ Textual Similarity-based ■ More-like-this ■ Collaborative Filtering
  • 34. Attribute-based Recommendations ■ Example: Match User Attributes to Item Attribute Fields /solr/select/?q=(grouptitle:”big data”^25 OR grouptitle:(java)^10) AND ((city:”Las Vegas” AND state:”NV”)^15 OR state:”NV”)”
  • 35. Textual Similarity-based Recommendations ■ Solr’s MoreLikeThis Search Component. ■ Extracts important keywords from one or more documents and uses them in search. ■ This results in secondary search results which demonstrate textual similarity to the original document ■ http://wiki.apache.org/solr/MoreLikeThis
  • 36. Content Recommendation ■ Even a single keyword can be enough to begin making meaningful recommendations. ■ Filtering or boosting results based upon geographical area or distance can help greatly for certain use cases: ▪ Jobs/Resumes, Events, Restaurants ■ /solr/select/?q=(Standard Recommendation Query) AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”
  • 37. Behavior Based Recommendation Approaches Collaborative Filtering : Uses who likes these also liked… ■ Step 1: Find similar users who like the same documents q=documentid: (“doc1” OR “doc4”) ■ Step 2: Search for docs “liked” by those similar users /solr/select/?q=userlikes: (“user5”^2 OR “user4”^2 OR “user1”^1)
  • 38. Cloudera Search Architecture HDFS Online Streaming Data End User Client App (e.g. Hue) Flume Raw, filtered, or annotated data SolrCloud Cluster(s) NRT Data indexed w/ Morphlines Indexed data MapReduce Batch Indexing w/ Morphlines GoLive updates HBase Cluster NRT Replication Events indexed w/ Morphlines OLTP Data ClouderaManager Search queries
  • 40. Real-time Architecture using Storm & Hadoop Key/Value StoreStorm Incoming Data Hadoop Query
  • 41. Real-time and Storm ■ The query layer queries the real-time and batch and merges the result ■ Some algorithms are hard to implement in real time. For those cases we could estimate the results. ■ The model is generated offline on Hadoop and deployed into Storm. ■ Online learning algorithms can be used in Storm. They learn continuously through streaming training data. ■ Storm can also be used for scoring.
  • 42. Storm/Track Realtime Events ■ Real-time streaming analytics/stats on consumer viewing behavior and digital content trends. ■ Track impressions, clicks, conversions, bid requests etc. in real time. Push per minute aggregations to HBase. ■ Most Popular Searches/Downloads/News Articles/Movies/Products
  • 44. Offline Training & Testing of Models Use Cases • Recommend Missing Links in a Social Network • How are users connected • Clustering – find related people in groups • Iterative Graph Ranking Hadoop provides an excellent platform to train and test out the Models and various Algorithms Model Train Test Training Set Test Set Score
  • 45. A/B Testing Traffic New Model Old Model X% (100-X)% A/B testing is used to test the performance of the Models online A/B testing involves: • Partitioning real traffic to two models and then measuring the performance to the desired result (maximize CTR, revenue, page views etc.). • The partitioning logic can get complicated. In such cases they can be pre-computed on Hadoop offline and pushed to an online store.
  • 46. Please complete the session evaluation on the mobile app We appreciate your feedback and insight
  • 47. Trends, Aggregations & Counters • Most Popular Searches/Downloads/News Articles/Movies/Products • Load results into HBase • Use HBase where we need NRT count of things (categories/products etc.) • Impala is very useful here for faster SLAs HBase Counters • Has concept of Incrementing column values • Avoids lock row/read value/increment it/write it back/unlock rows • Great for counting specific metrics over time • Example - count per URL/Product • Can disable write to WAL on puts

Hinweis der Redaktion

  1. - Parallel Frequent Pattern GrowthIf we divide the entire database in several partitions, then an itemset can be frequent only if it is frequent in at least one partition. Bear in mind that the support of an itemset is actually a percentage and if this minimum percentage requirement is not met for at least one individual partitions, it will not be met for the whole database. This property enables us to apply divide and conquer type of algorithms. Again, stay tuned for this too.Random ForestBootstrapping the recommender system.
  2. Batch and Realtime views can be queries with HBase or Impala.Batch Layer, Speed Layer and Serving Layer. http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architectingIn essence the speed layer is the same as the batch layer in that it computes views from the data it receive.Realtime views can be discarded once the data that they contain is captured in the batch layer.
  3. Recommend Missing Links in a Social NetworkHow are users connectedClustering – find related people in groupsIterative Graph Ranking
  4. The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It does this by using terms from the original document to find similar documents in the index.