SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
New Directions in Mahout’s Recommenders
Sebastian Schelter, Apache Software Foundation
Recommender Systems Get-together Berlin
NewDirectionsinMahout’sRecommenders
2/28
New Directions?
Mahout in Action is the prime source of
information for using Mahout in practice.
As it is more than two years old, it
is missing a lot of recent developments.
This talk describes what has been added to the recommenders
of Mahout since then.
Single machine recommenders
NewDirectionsinMahout’sRecommenders
4/28
MyMedialite, scientific library of recom-
mender system algorithms
Mahout now features a couple of popular latent factor models,
mostly ported by Zeno Gantner.
NewDirectionsinMahout’sRecommenders
5/28
New recommenders and factorizers
BiasedItemBasedRecommender, item-based kNN with
user-item-bias estimation
Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD ’09
RatingSGDFactorizer, biased matrix factorization
Koren et al.: Matrix Factorization Techniques for Recommender Systems, IEEE Computer ’09
SVDPlusPlusFactorizer, SVD++
Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD ’08
ALSWRFactorizer, matrix factorization using Alternating
Least Squares
Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08
Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08
NewDirectionsinMahout’sRecommenders
6/28
Batch Item-Similarities on a single machine
Simple but powerful way to deploy Mahout: Use item-based
collaborative filtering with periodically precomputed item
similarities.
Mahout now supports multithreaded item similarity
computation on a single machine for data sizes that don’t
require a Hadoop-based solution.
DataModel dataModel = new FileDataModel(new File(”movielens.csv”));
ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel));
ItemBasedRecommender recommender =
new GenericItemBasedRecommender(dataModel, similarity);
BatchItemSimilarities batch =
new MultithreadedBatchItemSimilarities(recommender, k);
batch.computeItemSimilarities(numThreads, maxDurationInHours,
new FileSimilarItemsWriter(resultFile));
Parallel processing
NewDirectionsinMahout’sRecommenders
8/28
Collaborative Filtering
idea: infer recommendations from patterns found in the
historical user-item interactions
data can be explicit feedback (ratings) or implicit feedback
(clicks, pageviews), represented in the interaction matrix A





item1 · · · item3 · · ·
user1 3 · · · 4 · · ·
user2 − · · · 4 · · ·
user3 5 · · · 1 · · ·
· · · · · · · · · · · · · · ·





row ai denotes the interaction history of user i
we target use cases with millions of users and hundreds of
millions of interactions
NewDirectionsinMahout’sRecommenders
9/28
MapReduce
paradigm for data-intensive parallel processing
data is partitioned in a distributed file system
computation is moved to data
system handles distribution, execution, scheduling, failures
fixed processing pipeline where user specifies two
functions
map : (k1, v1) → list(k2, v2)
reduce : (k2, list(v2)) → list(v2)
DFS
Input
Input
Input
map
map
map
reduce
reduce
DFS
Output
Output
shuffle
Scalable neighborhood methods
NewDirectionsinMahout’sRecommenders
11/28
Neighborhood Methods
Item-Based Collaborative Filtering is one of the most
deployed CF algorithms, because:
simple and intuitively understandable
additionally gives non-personalized, per-item
recommendations (people who like X might also like Y)
recommendations for new users without model retraining
comprehensible explanations (we recommend Y because
you liked X)
NewDirectionsinMahout’sRecommenders
12/28
Cooccurrences
start with a simplified view:
imagine interaction matrix A was
binary
→ we look at cooccurrences only
item similarity computation becomes matrix multiplication
ri = (A A) ai
scale-out of the item-based approach reduces to finding an
efficient way to compute the item similarity matrix
S = A A
NewDirectionsinMahout’sRecommenders
13/28
Parallelizing S = A A
standard approach of computing item cooccurrences requires
random access to both users and items
foreach item f do
foreach user i who interacted with f do
foreach item j that i also interacted with do
Sfj = Sfj + 1
→ not efficiently parallelizable on partitioned data
row outer product formulation of matrix multiplication is
efficiently parallelizable on a row-partitioned A
S = A A =
i∈A
ai ai
mappers compute the outer products of rows of A, emit the
results row-wise, reducers sum these up to form S
NewDirectionsinMahout’sRecommenders
14/28
Parallel similarity computation
real datasets not binary and we want to use a variety of
similarity measures, e.g. Pearson correlation
express similarity measures by 3 canonical functions, which
can be efficiently embedded into the computation (cf.,
VectorSimilarityMeasure)
preprocess adjusts an item rating vector
f = preprocess( f ) j = preprocess( j )
norm computes a single number from the adjusted vector
nf = norm( f ) nj = norm( j )
similarity computes the similarity of two vectors from the
norms and their dot product
Sfj = similarity( dotfj, nf , nj )
NewDirectionsinMahout’sRecommenders
15/28
Example: Jaccard coefficient
preprocess binarizes the rating vectors
if =



3
−
5


 j =



4
4
1


 f = bin(f ) =



1
0
1


 j = bin(j) =



1
1
1



norm computes the number of users that rated each item
nf = f 1 = 2 nj = j 1 = 3
similarity finally computes the jaccard coefficient from
the norms and the dot product of the vectors
jaccard(f , j) =
|f ∩ j|
|f ∪ j|
=
dotfj
nf + nj − dotfj
=
2
2 + 3 − 2
=
2
3
NewDirectionsinMahout’sRecommenders
16/28
Implementation in Mahout
o.a.m.math.hadoop.similarity.cooccurrence.RowSimilarityJob
computes the top-k pairwise similarities for each row of a
matrix using some similarity measure
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
computes the top-k similar items per item using
RowSimilarityJob
o.a.m.cf.taste.hadoop.item.RecommenderJob
computes recommendations and similar items using
RowSimilarityJob
NewDirectionsinMahout’sRecommenders
17/28
MapReduce pass 1
data partitioned by items (row-partitioned A )
invokes preprocess and norm for each item vector
transposes input to form A
reduceshufflecombinemap
1----
1----
-1---
21---
1,
2,
2,
-1,
--1--
--1--
--1--
---1-
0,
1,
2,
1,
---1-
----1
--321
2,
0,
-1,
1----
11---
21---
1,
2,
-1,
--11-
11---
--11-
,
2, ,
--1-1
1----
0,
1,
21---
--321
-1, ,
0 1 2 3 4
0 - - 1 - 1
1 1 - 1 1 -
2 1 1 1 1 -
binarized A pointing
from users to items
AT pointing from
items to users
21321
item „norms“
0 1 2
0 - 1 2
1 - - 1
2 3 1 5
3 - 2 4
4 1 - -
--1-1
--11-
--11-
0,
1,
2,
--321-1,
NewDirectionsinMahout’sRecommenders
18/28
MapReduce pass 2
data partitioned by users (row-partitioned A)
computes dot products of columns
loads norms and invokes similarity
implementation contains several optimizations
(sparsification, exploit symmetry and thresholds)
reduceshufflecombinemap
0 1 2 3 4
0 - - 1 - 1
1 1 - 1 1 -
2 1 1 1 1 -
-122-
--11-
---2-
0,
1,
2,
binarized A
----12,
--11-
---1-
-111-
--11-
0,
2,
0,
1,
---1-2,
----12,
-122-
--11-
0,
1,
---2-
,----12,
0 1 2 3 4
0 - 1
2
2
3
1 -
1 - - 1
3
1
2
-
2 - - - 2
3
1
3
3 - - - - -
4 - - - - -
“ATA“ holding item
similarities
21321
item „norms“
NewDirectionsinMahout’sRecommenders
19/28
Cost of the algorithm
major cost in our algorithm is the communication in the
second MapReduce pass: for each user, we have to process the
square of the number of his interactions
S =
i∈A
ai ai
→ cost is dominated by the densest rows of A
(the users with the highest number of interactions)
distribution of interactions per user is usually heavy tailed
→ small number of power users with an unproportionally
high amount of interactions drastically increase the runtime
if a user has more than p interactions, only use a random
sample of size p of his interactions
saw negligible effect on prediction quality for moderate p
NewDirectionsinMahout’sRecommenders
20/28
Scalable Neighborhood Methods: Experiments
Setup
26 machines running Java 7 and Hadoop 1.0.4
two 4-core Opteron CPUs, 32 GB memory and four 1 TB
disk drives per machine
Results
Yahoo Songs dataset (700M datapoints, 1.8M users, 136K
items), 26 machines, similarity computation takes less than 40
minutes
Scalable matrix factorization
NewDirectionsinMahout’sRecommenders
22/28
Latent factor models: idea
interactions are deeply influenced by a set of factors that are
very specific to the domain (e.g. amount of action or
complexity of characters in movies)
these factors are in general not obvious, we might be able to
think of some of them but it’s hard to estimate their impact
on the interactions
need to infer those so called latent factors from the
interaction data
NewDirectionsinMahout’sRecommenders
23/28
low-rank matrix factorization
approximately factor A into the product of two rank r feature
matrices U and M such that A ≈ UM.
U models the latent features of the users, M models the latent
features of the items
dot product ui mj in the latent feature space predicts strength
of interactions between user i and item j
to obtain a factorization, minimize regularized squared error
over the observed interactions, e.g.:
min
U,M
(i,j)∈A
(aij − ui mj)2
+ λ


i
nui ui
2
+
j
nmj mj
2


NewDirectionsinMahout’sRecommenders
24/28
Alternating Least Squares
ALS rotates between fixing U and M. When U is fixed, the
system recomputes M by solving a least-squares problem per
item, and vice versa.
easy to parallelize, as all users (and vice versa, items) can be
recomputed independently
additionally, ALS is able to solve non-sparse models from
implicit data
≈ ×
A
u × i
U
u × k
M
k × i
NewDirectionsinMahout’sRecommenders
25/28
Implementation in Mahout
o.a.m.cf.taste.hadoop.als.ParallelALSFactorizationJob
computes a factorization using Alternating Least Squares, has
different solvers for explicit and implicit data
Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08
Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08
o.a.m.cf.taste.hadoop.als.FactorizationEvaluator computes
the prediction error of a factorization on a test set
o.a.m.cf.taste.hadoop.als.RecommenderJob computes
recommendations from a factorization
NewDirectionsinMahout’sRecommenders
26/28
Scalable Matrix Factorization: Implementation
Recompute user feature matrix U using a broadcast-join:
1. Run a map-only job using multithreaded mappers
2. load item-feature matrix M into memory from HDFS to
share it among the individual mappers
3. mappers read the interaction histories of the users
4. multithreaded: solve a least squares problem per user to
recompute its feature vector
user histories A user features U
item features M
Map
Hash-Join + Re-computation
localfwdlocalfwdlocalfwd
Map
Hash-Join + Re-computation
Map
Hash-Join + Re-computation
broadcast
machine1machine2machine3
NewDirectionsinMahout’sRecommenders
27/28
Scalable Matrix Factorization: Experiments
Setup
26 machines running Java 7 and Hadoop 1.0.4
two 4-core Opteron CPUs, 32 GB memory and four 1 TB
disk drives per machine
configured Hadoop to reuse JVMs, ran multithreaded
mappers
Results
Yahoo Songs dataset (700M datapoints), 26 machines, single
iteration (two map-only jobs) takes less than 2 minutes
Thanks for listening!
Follow me on twitter at http://twitter.com/sscdotopen
Join Mahout’s mailinglists at http://s.apache.org/mahout-lists
picture on slide 3 by Tim Abott, http://www.flickr.com/photos/theabbott/
picture on slide 21 by Crimson Diabolics, http://crimsondiabolics.deviantart.com/

Más contenido relacionado

Was ist angesagt?

Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionIOSR Journals
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
 
Download
DownloadDownload
Downloadbutest
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updatedVajira Thambawita
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMapAshish Patel
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 

Was ist angesagt? (20)

Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Cell Profiler
Cell ProfilerCell Profiler
Cell Profiler
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
 
Data reduction
Data reductionData reduction
Data reduction
 
Download
DownloadDownload
Download
 
Cell Profiler
Cell ProfilerCell Profiler
Cell Profiler
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updated
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 

Ähnlich wie New Directions in Mahout's Recommenders

Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTIRJET Journal
 
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...shivamverma394
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 
Classification of voltage disturbance using machine learning
Classification of voltage disturbance using machine learning Classification of voltage disturbance using machine learning
Classification of voltage disturbance using machine learning Mohan Kashyap
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...IJCI JOURNAL
 
Prediction Of Clean Coal Using Mathematical Models
Prediction Of Clean Coal Using Mathematical ModelsPrediction Of Clean Coal Using Mathematical Models
Prediction Of Clean Coal Using Mathematical ModelsChantel Marie
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...ijcsit
 
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...IRJET Journal
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualizationtaeseon ryu
 

Ähnlich wie New Directions in Mahout's Recommenders (20)

Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
M2R Group 26
M2R Group 26M2R Group 26
M2R Group 26
 
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...
MODELLING, ANALYSIS AND SIMULATION OF DYNAMIC SYSTEMS USING CONTROL TECHNIQUE...
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Classification of voltage disturbance using machine learning
Classification of voltage disturbance using machine learning Classification of voltage disturbance using machine learning
Classification of voltage disturbance using machine learning
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
 
Prediction Of Clean Coal Using Mathematical Models
Prediction Of Clean Coal Using Mathematical ModelsPrediction Of Clean Coal Using Mathematical Models
Prediction Of Clean Coal Using Mathematical Models
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
 
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...
IRJET - Design of a Low Power Serial- Parallel Multiplier with Low Transition...
 
An Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked ExamplesAn Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked Examples
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
Matlab basic and image
Matlab basic and imageMatlab basic and image
Matlab basic and image
 

Mehr von sscdotopen

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Mehr von sscdotopen (7)

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Último

THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...Subham Panja
 
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...AKSHAYMAGAR17
 
EDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderEDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderDr. Bruce A. Johnson
 
AI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsAI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsStella Lee
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.docdieu18
 
Auchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsAuchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsDhatriParmar
 
3.12.24 The Social Construction of Gender.pptx
3.12.24 The Social Construction of Gender.pptx3.12.24 The Social Construction of Gender.pptx
3.12.24 The Social Construction of Gender.pptxmary850239
 
Quantitative research methodology and survey design
Quantitative research methodology and survey designQuantitative research methodology and survey design
Quantitative research methodology and survey designBalelaBoru
 
POST ENCEPHALITIS case study Jitendra bhargav
POST ENCEPHALITIS case study  Jitendra bhargavPOST ENCEPHALITIS case study  Jitendra bhargav
POST ENCEPHALITIS case study Jitendra bhargavJitendra Bhargav
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...Nguyen Thanh Tu Collection
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxiammrhaywood
 
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSDLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSTeacherNicaPrintable
 
Plant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxPlant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxHimansu10
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxDr. Santhosh Kumar. N
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudDr. Bruce A. Johnson
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...Nguyen Thanh Tu Collection
 
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandThe OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandDr. Sarita Anand
 
LEAD5623 The Economics of Community Coll
LEAD5623 The Economics of Community CollLEAD5623 The Economics of Community Coll
LEAD5623 The Economics of Community CollDr. Bruce A. Johnson
 

Último (20)

THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
 
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
DNA and RNA , Structure, Functions, Types, difference, Similarities, Protein ...
 
EDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderEDD8524 The Future of Educational Leader
EDD8524 The Future of Educational Leader
 
AI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsAI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace Applications
 
Problems on Mean,Mode,Median Standard Deviation
Problems on Mean,Mode,Median Standard DeviationProblems on Mean,Mode,Median Standard Deviation
Problems on Mean,Mode,Median Standard Deviation
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
 
Auchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsAuchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian Poetics
 
3.12.24 The Social Construction of Gender.pptx
3.12.24 The Social Construction of Gender.pptx3.12.24 The Social Construction of Gender.pptx
3.12.24 The Social Construction of Gender.pptx
 
Quantitative research methodology and survey design
Quantitative research methodology and survey designQuantitative research methodology and survey design
Quantitative research methodology and survey design
 
POST ENCEPHALITIS case study Jitendra bhargav
POST ENCEPHALITIS case study  Jitendra bhargavPOST ENCEPHALITIS case study  Jitendra bhargav
POST ENCEPHALITIS case study Jitendra bhargav
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
 
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYSDLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
DLL Catch Up Friday March 22.docx CATCH UP FRIDAYS
 
Plant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxPlant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptx
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced Stud
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - HK2 (...
 
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita AnandThe OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
 
Least Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research MethodologyLeast Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research Methodology
 
LEAD5623 The Economics of Community Coll
LEAD5623 The Economics of Community CollLEAD5623 The Economics of Community Coll
LEAD5623 The Economics of Community Coll
 

New Directions in Mahout's Recommenders

  • 1. New Directions in Mahout’s Recommenders Sebastian Schelter, Apache Software Foundation Recommender Systems Get-together Berlin
  • 2. NewDirectionsinMahout’sRecommenders 2/28 New Directions? Mahout in Action is the prime source of information for using Mahout in practice. As it is more than two years old, it is missing a lot of recent developments. This talk describes what has been added to the recommenders of Mahout since then.
  • 4. NewDirectionsinMahout’sRecommenders 4/28 MyMedialite, scientific library of recom- mender system algorithms Mahout now features a couple of popular latent factor models, mostly ported by Zeno Gantner.
  • 5. NewDirectionsinMahout’sRecommenders 5/28 New recommenders and factorizers BiasedItemBasedRecommender, item-based kNN with user-item-bias estimation Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD ’09 RatingSGDFactorizer, biased matrix factorization Koren et al.: Matrix Factorization Techniques for Recommender Systems, IEEE Computer ’09 SVDPlusPlusFactorizer, SVD++ Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD ’08 ALSWRFactorizer, matrix factorization using Alternating Least Squares Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08 Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08
  • 6. NewDirectionsinMahout’sRecommenders 6/28 Batch Item-Similarities on a single machine Simple but powerful way to deploy Mahout: Use item-based collaborative filtering with periodically precomputed item similarities. Mahout now supports multithreaded item similarity computation on a single machine for data sizes that don’t require a Hadoop-based solution. DataModel dataModel = new FileDataModel(new File(”movielens.csv”)); ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel)); ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, similarity); BatchItemSimilarities batch = new MultithreadedBatchItemSimilarities(recommender, k); batch.computeItemSimilarities(numThreads, maxDurationInHours, new FileSimilarItemsWriter(resultFile));
  • 8. NewDirectionsinMahout’sRecommenders 8/28 Collaborative Filtering idea: infer recommendations from patterns found in the historical user-item interactions data can be explicit feedback (ratings) or implicit feedback (clicks, pageviews), represented in the interaction matrix A      item1 · · · item3 · · · user1 3 · · · 4 · · · user2 − · · · 4 · · · user3 5 · · · 1 · · · · · · · · · · · · · · · · · ·      row ai denotes the interaction history of user i we target use cases with millions of users and hundreds of millions of interactions
  • 9. NewDirectionsinMahout’sRecommenders 9/28 MapReduce paradigm for data-intensive parallel processing data is partitioned in a distributed file system computation is moved to data system handles distribution, execution, scheduling, failures fixed processing pipeline where user specifies two functions map : (k1, v1) → list(k2, v2) reduce : (k2, list(v2)) → list(v2) DFS Input Input Input map map map reduce reduce DFS Output Output shuffle
  • 11. NewDirectionsinMahout’sRecommenders 11/28 Neighborhood Methods Item-Based Collaborative Filtering is one of the most deployed CF algorithms, because: simple and intuitively understandable additionally gives non-personalized, per-item recommendations (people who like X might also like Y) recommendations for new users without model retraining comprehensible explanations (we recommend Y because you liked X)
  • 12. NewDirectionsinMahout’sRecommenders 12/28 Cooccurrences start with a simplified view: imagine interaction matrix A was binary → we look at cooccurrences only item similarity computation becomes matrix multiplication ri = (A A) ai scale-out of the item-based approach reduces to finding an efficient way to compute the item similarity matrix S = A A
  • 13. NewDirectionsinMahout’sRecommenders 13/28 Parallelizing S = A A standard approach of computing item cooccurrences requires random access to both users and items foreach item f do foreach user i who interacted with f do foreach item j that i also interacted with do Sfj = Sfj + 1 → not efficiently parallelizable on partitioned data row outer product formulation of matrix multiplication is efficiently parallelizable on a row-partitioned A S = A A = i∈A ai ai mappers compute the outer products of rows of A, emit the results row-wise, reducers sum these up to form S
  • 14. NewDirectionsinMahout’sRecommenders 14/28 Parallel similarity computation real datasets not binary and we want to use a variety of similarity measures, e.g. Pearson correlation express similarity measures by 3 canonical functions, which can be efficiently embedded into the computation (cf., VectorSimilarityMeasure) preprocess adjusts an item rating vector f = preprocess( f ) j = preprocess( j ) norm computes a single number from the adjusted vector nf = norm( f ) nj = norm( j ) similarity computes the similarity of two vectors from the norms and their dot product Sfj = similarity( dotfj, nf , nj )
  • 15. NewDirectionsinMahout’sRecommenders 15/28 Example: Jaccard coefficient preprocess binarizes the rating vectors if =    3 − 5    j =    4 4 1    f = bin(f ) =    1 0 1    j = bin(j) =    1 1 1    norm computes the number of users that rated each item nf = f 1 = 2 nj = j 1 = 3 similarity finally computes the jaccard coefficient from the norms and the dot product of the vectors jaccard(f , j) = |f ∩ j| |f ∪ j| = dotfj nf + nj − dotfj = 2 2 + 3 − 2 = 2 3
  • 16. NewDirectionsinMahout’sRecommenders 16/28 Implementation in Mahout o.a.m.math.hadoop.similarity.cooccurrence.RowSimilarityJob computes the top-k pairwise similarities for each row of a matrix using some similarity measure o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob computes the top-k similar items per item using RowSimilarityJob o.a.m.cf.taste.hadoop.item.RecommenderJob computes recommendations and similar items using RowSimilarityJob
  • 17. NewDirectionsinMahout’sRecommenders 17/28 MapReduce pass 1 data partitioned by items (row-partitioned A ) invokes preprocess and norm for each item vector transposes input to form A reduceshufflecombinemap 1---- 1---- -1--- 21--- 1, 2, 2, -1, --1-- --1-- --1-- ---1- 0, 1, 2, 1, ---1- ----1 --321 2, 0, -1, 1---- 11--- 21--- 1, 2, -1, --11- 11--- --11- , 2, , --1-1 1---- 0, 1, 21--- --321 -1, , 0 1 2 3 4 0 - - 1 - 1 1 1 - 1 1 - 2 1 1 1 1 - binarized A pointing from users to items AT pointing from items to users 21321 item „norms“ 0 1 2 0 - 1 2 1 - - 1 2 3 1 5 3 - 2 4 4 1 - - --1-1 --11- --11- 0, 1, 2, --321-1,
  • 18. NewDirectionsinMahout’sRecommenders 18/28 MapReduce pass 2 data partitioned by users (row-partitioned A) computes dot products of columns loads norms and invokes similarity implementation contains several optimizations (sparsification, exploit symmetry and thresholds) reduceshufflecombinemap 0 1 2 3 4 0 - - 1 - 1 1 1 - 1 1 - 2 1 1 1 1 - -122- --11- ---2- 0, 1, 2, binarized A ----12, --11- ---1- -111- --11- 0, 2, 0, 1, ---1-2, ----12, -122- --11- 0, 1, ---2- ,----12, 0 1 2 3 4 0 - 1 2 2 3 1 - 1 - - 1 3 1 2 - 2 - - - 2 3 1 3 3 - - - - - 4 - - - - - “ATA“ holding item similarities 21321 item „norms“
  • 19. NewDirectionsinMahout’sRecommenders 19/28 Cost of the algorithm major cost in our algorithm is the communication in the second MapReduce pass: for each user, we have to process the square of the number of his interactions S = i∈A ai ai → cost is dominated by the densest rows of A (the users with the highest number of interactions) distribution of interactions per user is usually heavy tailed → small number of power users with an unproportionally high amount of interactions drastically increase the runtime if a user has more than p interactions, only use a random sample of size p of his interactions saw negligible effect on prediction quality for moderate p
  • 20. NewDirectionsinMahout’sRecommenders 20/28 Scalable Neighborhood Methods: Experiments Setup 26 machines running Java 7 and Hadoop 1.0.4 two 4-core Opteron CPUs, 32 GB memory and four 1 TB disk drives per machine Results Yahoo Songs dataset (700M datapoints, 1.8M users, 136K items), 26 machines, similarity computation takes less than 40 minutes
  • 22. NewDirectionsinMahout’sRecommenders 22/28 Latent factor models: idea interactions are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action or complexity of characters in movies) these factors are in general not obvious, we might be able to think of some of them but it’s hard to estimate their impact on the interactions need to infer those so called latent factors from the interaction data
  • 23. NewDirectionsinMahout’sRecommenders 23/28 low-rank matrix factorization approximately factor A into the product of two rank r feature matrices U and M such that A ≈ UM. U models the latent features of the users, M models the latent features of the items dot product ui mj in the latent feature space predicts strength of interactions between user i and item j to obtain a factorization, minimize regularized squared error over the observed interactions, e.g.: min U,M (i,j)∈A (aij − ui mj)2 + λ   i nui ui 2 + j nmj mj 2  
  • 24. NewDirectionsinMahout’sRecommenders 24/28 Alternating Least Squares ALS rotates between fixing U and M. When U is fixed, the system recomputes M by solving a least-squares problem per item, and vice versa. easy to parallelize, as all users (and vice versa, items) can be recomputed independently additionally, ALS is able to solve non-sparse models from implicit data ≈ × A u × i U u × k M k × i
  • 25. NewDirectionsinMahout’sRecommenders 25/28 Implementation in Mahout o.a.m.cf.taste.hadoop.als.ParallelALSFactorizationJob computes a factorization using Alternating Least Squares, has different solvers for explicit and implicit data Zhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08 Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08 o.a.m.cf.taste.hadoop.als.FactorizationEvaluator computes the prediction error of a factorization on a test set o.a.m.cf.taste.hadoop.als.RecommenderJob computes recommendations from a factorization
  • 26. NewDirectionsinMahout’sRecommenders 26/28 Scalable Matrix Factorization: Implementation Recompute user feature matrix U using a broadcast-join: 1. Run a map-only job using multithreaded mappers 2. load item-feature matrix M into memory from HDFS to share it among the individual mappers 3. mappers read the interaction histories of the users 4. multithreaded: solve a least squares problem per user to recompute its feature vector user histories A user features U item features M Map Hash-Join + Re-computation localfwdlocalfwdlocalfwd Map Hash-Join + Re-computation Map Hash-Join + Re-computation broadcast machine1machine2machine3
  • 27. NewDirectionsinMahout’sRecommenders 27/28 Scalable Matrix Factorization: Experiments Setup 26 machines running Java 7 and Hadoop 1.0.4 two 4-core Opteron CPUs, 32 GB memory and four 1 TB disk drives per machine configured Hadoop to reuse JVMs, ran multithreaded mappers Results Yahoo Songs dataset (700M datapoints), 26 machines, single iteration (two map-only jobs) takes less than 2 minutes
  • 28. Thanks for listening! Follow me on twitter at http://twitter.com/sscdotopen Join Mahout’s mailinglists at http://s.apache.org/mahout-lists picture on slide 3 by Tim Abott, http://www.flickr.com/photos/theabbott/ picture on slide 21 by Crimson Diabolics, http://crimsondiabolics.deviantart.com/