SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
Scalable Similarity-Based Neighborhood
       Methods with MapReduce
          6th ACM Conference on Recommender Systems, Dublin, 2012




Sebastian Schelter, Christoph Boden, Volker Markl

Database Systems and Information Management Group
Technische Universität Berlin
motivation
• with rapid growth in data sizes, the processing efficiency,
  scalability and fault tolerance of recommender systems in
  production become a major concern

• run data-intensive computations
  in parallel on a large number of
  commodity machines
  → need to rephrase algorithms

• our work: rephrase and scale-out the similarity-based
  neighborhood methods on MapReduce

• proposed solution forms the core of the
  distributed recommender of Apache Mahout
                                                                2
MapReduce
• popular paradigm for data-intensive parallel processing
   –   data is partitioned across the cluster in a distributed file system
   –   computation is moved to data
   –   fixed processing pipeline where user specifies two functions
   –   system handles distribution, execution, scheduling, failures etc.




                                                                             3
cooccurrences
• start with a simplified view:
  binary |U|x|I| matrix A holds interactions
  between users U and of items I

• neighborhood methods share same
  fundamental computational model

   user-based                       item-based



• we focus on the item-based approach, its scale-out reduces to
  finding an efficient way to compute the item similarity matrix



                                                                   4
parallelizing S = ATA
• standard approach of computing item cooccurrences requires
  random access to both users and items
   foreach item i
                                                  not efficiently parallelizable
    foreach user u who interacted with i
                                                  on partitioned data
     foreach item j that u also interacted with
         Sij = Sij + 1

• row outer product formulation of matrix multiplication
  is efficiently parallelizable on a row-partitioned A




• each map invocation computes the outer product of a row of A,
  emits the resulting matrix row-wise
• reducers sum these up to form S                                             5
parallel similarity computation
• real datasets not binary, either contain explicit feedback (ratings)
  or implicit feedback (clicks, pageviews)

• algorithm computes dot products, these are not enough,
  we want to use a variety of similarity measures (cosine, Jaccard
  coefficient, Pearson correlation, ...)

• express similarity measures by 3 canonical functions, which can be
  efficiently embedded into our algorithm
    – preprocess adjusts an item rating vector
    – norm computes a single number from the adjusted item rating vector
    – similarity computes the similarity of two vectors from the norms and
      their dot product

                                                                             6
example: Jaccard coefficient
• preprocess binarizes the rating vectors




• norm computes the number of users that rated each item



• similarity finally computes the jaccard coefficient from the
  norms and the dot product of the vectors




                                                                 7
cost of the algorithm
• determined by the amount of data that has to be sent over the
  network in the matrix multiplication step

• for each user, we have to process the square of the number of his
  interactions → cost is dominated by the densest rows of A

• distribution of interactions per user is usually heavy tailed
  → small number of power users with an unproportionally high
  amount of interactions drastically increase the runtime

• apply ‘interaction-cut’
   – if a user has more than p interactions, only use a random sample of
     size p of his interactions
   – saw negligible effect on prediction quality for moderately sized p
                                                                           8
scalability experiments
• cluster: Apache Hadoop on 6 machines (two 8-core Opteron CPUs, 32 GB
  memory and four 1 TB drives each)
• dataset: R2 - Yahoo! Music (717M ratings, 1.8M users, 136k songs)
• similarity computation with differently sized interaction-cuts, measured
  prediction quality on 18M held out ratings




                                                                             9
scalability experiments
• ran several experiments in Amazon‘s EC2 cloud using up to 20 m1.xlarge
  instances (15GB RAM, 8 virtual cores each)

   → linear speedup with the number of machines
   → linear scalability with a growing user base




                                                                           10
thank you.

                                 Questions?

Sebastian Schelter, Christoph Boden, Volker Markl
Database Systems and Information Management Group (DIMA), TU Berlin

mail: ssc@apache.org       twitter: @sscdotopen

code to reproduce our experiments is available at:
https://github.com/dima-tuberlin/publications-ssnmm

The research leading to these results has received funding from the European Union (EU)
in the course of the project ‚ROBUST‘ (EU grant no. 257859) and used data provided by
‚Yahoo Academic Relations‘.
                                                                                          11

Weitere ähnliche Inhalte

Was ist angesagt?

Uncertainty aware multidimensional ensemble data visualization and exploration
Uncertainty aware multidimensional ensemble data visualization and explorationUncertainty aware multidimensional ensemble data visualization and exploration
Uncertainty aware multidimensional ensemble data visualization and explorationSubhashis Hazarika
 
Southwick anguiano lmu-symposium_presentation_20140329
Southwick anguiano lmu-symposium_presentation_20140329Southwick anguiano lmu-symposium_presentation_20140329
Southwick anguiano lmu-symposium_presentation_20140329GRNsight
 
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24Integrating Network Discovery and Community Detection (IRE IIITH) Team 24
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24Nikhil Daliya
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsHesen Peng
 
Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016GRNsight
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIMLILAB
 
How to digitize penstocks leading to powerhouse of a hydropower plant from th...
How to digitize penstocks leading to powerhouse of a hydropower plant from th...How to digitize penstocks leading to powerhouse of a hydropower plant from th...
How to digitize penstocks leading to powerhouse of a hydropower plant from th...Mrinmoy Majumder
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentlau
 
Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508GRNsight
 
Machine Learning - Matt Moloney
Machine Learning - Matt MoloneyMachine Learning - Matt Moloney
Machine Learning - Matt MoloneyPhillip Trelford
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data StreamsEfficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data StreamsPlanetData Network of Excellence
 
user_defined_functions_forinterpolation
user_defined_functions_forinterpolationuser_defined_functions_forinterpolation
user_defined_functions_forinterpolationsushanth tiruvaipati
 

Was ist angesagt? (19)

Uncertainty aware multidimensional ensemble data visualization and exploration
Uncertainty aware multidimensional ensemble data visualization and explorationUncertainty aware multidimensional ensemble data visualization and exploration
Uncertainty aware multidimensional ensemble data visualization and exploration
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
 
Southwick anguiano lmu-symposium_presentation_20140329
Southwick anguiano lmu-symposium_presentation_20140329Southwick anguiano lmu-symposium_presentation_20140329
Southwick anguiano lmu-symposium_presentation_20140329
 
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24Integrating Network Discovery and Community Detection (IRE IIITH) Team 24
Integrating Network Discovery and Community Detection (IRE IIITH) Team 24
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actionsLinear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
 
Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016Varshneya samdarshi lmu_symposium_2016
Varshneya samdarshi lmu_symposium_2016
 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
How to digitize penstocks leading to powerhouse of a hydropower plant from th...
How to digitize penstocks leading to powerhouse of a hydropower plant from th...How to digitize penstocks leading to powerhouse of a hydropower plant from th...
How to digitize penstocks leading to powerhouse of a hydropower plant from th...
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignment
 
Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508Southwick britain gr_nsight_cmsi402-presentation_20140508
Southwick britain gr_nsight_cmsi402-presentation_20140508
 
Machine Learning - Matt Moloney
Machine Learning - Matt MoloneyMachine Learning - Matt Moloney
Machine Learning - Matt Moloney
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Lalal
LalalLalal
Lalal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data StreamsEfficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
 
user_defined_functions_forinterpolation
user_defined_functions_forinterpolationuser_defined_functions_forinterpolation
user_defined_functions_forinterpolation
 

Ähnlich wie Scalable Similarity-Based Neighborhood Methods with MapReduce

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Mohan C R CV
Mohan C R CVMohan C R CV
Mohan C R CVMOHAN C R
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Soheila Dehghanzadeh
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 

Ähnlich wie Scalable Similarity-Based Neighborhood Methods with MapReduce (20)

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Mohan C R CV
Mohan C R CVMohan C R CV
Mohan C R CV
 
CINET: A CyberInfrastructure for Network Science
CINET: A CyberInfrastructure for Network ScienceCINET: A CyberInfrastructure for Network Science
CINET: A CyberInfrastructure for Network Science
 
Data visualization
Data visualizationData visualization
Data visualization
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 

Mehr von sscdotopen

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Mehr von sscdotopen (9)

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Kürzlich hochgeladen

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Scalable Similarity-Based Neighborhood Methods with MapReduce

  • 1. Scalable Similarity-Based Neighborhood Methods with MapReduce 6th ACM Conference on Recommender Systems, Dublin, 2012 Sebastian Schelter, Christoph Boden, Volker Markl Database Systems and Information Management Group Technische Universität Berlin
  • 2. motivation • with rapid growth in data sizes, the processing efficiency, scalability and fault tolerance of recommender systems in production become a major concern • run data-intensive computations in parallel on a large number of commodity machines → need to rephrase algorithms • our work: rephrase and scale-out the similarity-based neighborhood methods on MapReduce • proposed solution forms the core of the distributed recommender of Apache Mahout 2
  • 3. MapReduce • popular paradigm for data-intensive parallel processing – data is partitioned across the cluster in a distributed file system – computation is moved to data – fixed processing pipeline where user specifies two functions – system handles distribution, execution, scheduling, failures etc. 3
  • 4. cooccurrences • start with a simplified view: binary |U|x|I| matrix A holds interactions between users U and of items I • neighborhood methods share same fundamental computational model user-based item-based • we focus on the item-based approach, its scale-out reduces to finding an efficient way to compute the item similarity matrix 4
  • 5. parallelizing S = ATA • standard approach of computing item cooccurrences requires random access to both users and items foreach item i not efficiently parallelizable foreach user u who interacted with i on partitioned data foreach item j that u also interacted with Sij = Sij + 1 • row outer product formulation of matrix multiplication is efficiently parallelizable on a row-partitioned A • each map invocation computes the outer product of a row of A, emits the resulting matrix row-wise • reducers sum these up to form S 5
  • 6. parallel similarity computation • real datasets not binary, either contain explicit feedback (ratings) or implicit feedback (clicks, pageviews) • algorithm computes dot products, these are not enough, we want to use a variety of similarity measures (cosine, Jaccard coefficient, Pearson correlation, ...) • express similarity measures by 3 canonical functions, which can be efficiently embedded into our algorithm – preprocess adjusts an item rating vector – norm computes a single number from the adjusted item rating vector – similarity computes the similarity of two vectors from the norms and their dot product 6
  • 7. example: Jaccard coefficient • preprocess binarizes the rating vectors • norm computes the number of users that rated each item • similarity finally computes the jaccard coefficient from the norms and the dot product of the vectors 7
  • 8. cost of the algorithm • determined by the amount of data that has to be sent over the network in the matrix multiplication step • for each user, we have to process the square of the number of his interactions → cost is dominated by the densest rows of A • distribution of interactions per user is usually heavy tailed → small number of power users with an unproportionally high amount of interactions drastically increase the runtime • apply ‘interaction-cut’ – if a user has more than p interactions, only use a random sample of size p of his interactions – saw negligible effect on prediction quality for moderately sized p 8
  • 9. scalability experiments • cluster: Apache Hadoop on 6 machines (two 8-core Opteron CPUs, 32 GB memory and four 1 TB drives each) • dataset: R2 - Yahoo! Music (717M ratings, 1.8M users, 136k songs) • similarity computation with differently sized interaction-cuts, measured prediction quality on 18M held out ratings 9
  • 10. scalability experiments • ran several experiments in Amazon‘s EC2 cloud using up to 20 m1.xlarge instances (15GB RAM, 8 virtual cores each) → linear speedup with the number of machines → linear scalability with a growing user base 10
  • 11. thank you. Questions? Sebastian Schelter, Christoph Boden, Volker Markl Database Systems and Information Management Group (DIMA), TU Berlin mail: ssc@apache.org twitter: @sscdotopen code to reproduce our experiments is available at: https://github.com/dima-tuberlin/publications-ssnmm The research leading to these results has received funding from the European Union (EU) in the course of the project ‚ROBUST‘ (EU grant no. 257859) and used data provided by ‚Yahoo Academic Relations‘. 11