SlideShare a Scribd company logo
1 of 22
Download to read offline
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Mahout
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Apache Mahout?

• A scalable machine learning infrastructure


• Built on top of Hadoop MapReduce


• Currently supports:


   • Clustering, classification, and collaborative filtering, etc...
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




A Little History

• Founded by folks active in the Lucene community


• Inspired by work at Stanford: “Map-Reduce for Machine Learning on
  Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-
  mapreducemulticore.pdf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Project Goal

• Create a community driven scalable and robust machine learning
  infrastructure


• Leverage Hadoop for parallel processing and scalability


• Provide an abstraction on top of Hadoop so the machine-learning users are
  not concerned with the map and reduce primitives when they build their
  solutions.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Algorithms

 • Collaborative Filtering


 • User and Item based recommenders


 • K-Means, Fuzzy K-Means clustering


 • Mean Shift clustering


 • Dirichlet process clustering
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Supported Algorithms

 • Latent Dirichlet Allocation


 • Singular value decomposition


 • Parallel Frequent Pattern mining


 • Complementary Naive Bayes classifier


 • Random forest decision tree based classifier


 • ...and growing
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Focus Areas

 • Collaborative Filtering


 • Clustering


 • Classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Build and Install

• Required Software:


   • Java 1.6.x


   • Maven 2.0.11+


• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout


• Compile & install core & examples: mvn install


   • Alternatively, individually mvn compile, mvn package, and mvn install
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                     Copyright for all other & referenced work is retained by their respective owners.




Recommendation Examples

 • mvn -q exec:java -
   Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group
   LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/
   workspace/hadoop_workspace/grouplens/ratings.dat"


 • https://cwiki.apache.org/confluence/display/MAHOUT/
   RecommendationExamples
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Common Use Cases

 • Shopping: Amazon, Netflix


 • Who to follow/friend: Twitter/Facebook


 • Web resource classification, spam filtering, financial markets pattern
   recognition, classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Collaborative Filtering Basis

  • User-based: recommend items by finding similar users. User preferences
    keep changing so this method poses challenges.


  • Item-based: calculate similarity between items and make
    recommendations. Usually items don’t change much so the method is
    often reliable.


  • Slope-one: fast and efficient item based recommendation when user
    ratings are more than boolean yes/no, like/dislike.


  • Model-based: provide recommendation on the basis of developing a
    model of users and their ratings.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Clustering Basis

 • Clustering algorithms also use the notion of similarity to group similar
   items into a cluster.


 • Both Collaborative filtering and clustering use the notion of a distance,
   which could be calculated using a number of different techniques.


    • Example: Euclidean distance,
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.


• Mahout supports building recommendation engines primarily basis the Taste
  library.


   • The library supports both user-based and item-based recommendations.


• Can be used with Java or over RESTful web-service endpoints.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Primary Classes

 • DataModel: Model for Users, Items, and Preferences


 • UserSimilarity: Interface defining the similarity between two users


 • ItemSimilarity: Interface defining the similarity between two items


 • Recommender: Interface for providing recommendations


 • UserNeighborhood: Interface for computing a neighborhood of similar
   users. These are used by the Recommenders.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Online vs Offline

 • Can do online recommendations for a few thousand data sets.


 • Leverages Hadoop for offline recommendation calculations on large data
   sets.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Understanding the Group Lens Implementation

• Provide an insight into a sample Mahout Taste Framework Implementation.


• Uses the publicly available data set


• Part of the distribution so you can analyze it, modify it, and use it as an
  inspiration for your own implementation


• Easy to follow example
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Group Lens Implementation Source

• GroupLensDataModel.java


• GroupLensRecommender.java


• GroupLensRecommenderBuilder.java


• GroupLensRecommenderEvaluatorRunner.java
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- evaluator

• Instantiates an evaluator:


   • RecommenderEvaluator evaluator = new
     AverageAbsoluteDifferenceRecommenderEvaluator();


   • a “mean average error” algorithm


• Parses input parameters:


   • File ratingsFile = TasteOptionParser.getRatings(args);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- data model

 • Parses a colon delimiter pattern file:


    • DataModel model = ratingsFile == null ? new GroupLensDataModel() :
      new GroupLensDataModel(ratingsFile);
Group Lens Runner -- evaluate with
                Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




recommendation builder

• evaluates using GroupLensRecommender


  • double evaluation = evaluator.evaluate(new
    GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

Similar to SDEC2011 Essentials of Mahout

Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source ContributionsNeev Technologies
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to usersjobinwilson
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to usersFlytxt
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable AppsBuğra Oral
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Den Delimarsky
 
Automated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceAutomated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceMatthew Lancaster
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Red Gate Software
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--devaltsav
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat
 

Similar to SDEC2011 Essentials of Mahout (20)

Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source Contributions
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
 
Automated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceAutomated perf optimization - jQuery Conference
Automated perf optimization - jQuery Conference
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--dev
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Case study
Case studyCase study
Case study
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
 

More from Korea Sdec

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerKorea Sdec
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionKorea Sdec
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopKorea Sdec
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopKorea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopKorea Sdec
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveKorea Sdec
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 RapidantKorea Sdec
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACCKorea Sdec
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesKorea Sdec
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudKorea Sdec
 

More from Korea Sdec (14)

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

SDEC2011 Essentials of Mahout

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Mahout Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Apache Mahout? • A scalable machine learning infrastructure • Built on top of Hadoop MapReduce • Currently supports: • Clustering, classification, and collaborative filtering, etc...
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A Little History • Founded by folks active in the Lucene community • Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Project Goal • Create a community driven scalable and robust machine learning infrastructure • Leverage Hadoop for parallel processing and scalability • Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Algorithms • Collaborative Filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Mean Shift clustering • Dirichlet process clustering
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Supported Algorithms • Latent Dirichlet Allocation • Singular value decomposition • Parallel Frequent Pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • ...and growing
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Focus Areas • Collaborative Filtering • Clustering • Classification
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build and Install • Required Software: • Java 1.6.x • Maven 2.0.11+ • Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout • Compile & install core & examples: mvn install • Alternatively, individually mvn compile, mvn package, and mvn install
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Recommendation Examples • mvn -q exec:java - Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/ workspace/hadoop_workspace/grouplens/ratings.dat" • https://cwiki.apache.org/confluence/display/MAHOUT/ RecommendationExamples
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Common Use Cases • Shopping: Amazon, Netflix • Who to follow/friend: Twitter/Facebook • Web resource classification, spam filtering, financial markets pattern recognition, classification
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Collaborative Filtering Basis • User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges. • Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable. • Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike. • Model-based: provide recommendation on the basis of developing a model of users and their ratings.
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Clustering Basis • Clustering algorithms also use the notion of similarity to group similar items into a cluster. • Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques. • Example: Euclidean distance,
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully. • Mahout supports building recommendation engines primarily basis the Taste library. • The library supports both user-based and item-based recommendations. • Can be used with Java or over RESTful web-service endpoints.
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Primary Classes • DataModel: Model for Users, Items, and Preferences • UserSimilarity: Interface defining the similarity between two users • ItemSimilarity: Interface defining the similarity between two items • Recommender: Interface for providing recommendations • UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Online vs Offline • Can do online recommendations for a few thousand data sets. • Leverages Hadoop for offline recommendation calculations on large data sets.
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Understanding the Group Lens Implementation • Provide an insight into a sample Mahout Taste Framework Implementation. • Uses the publicly available data set • Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation • Easy to follow example
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Implementation Source • GroupLensDataModel.java • GroupLensRecommender.java • GroupLensRecommenderBuilder.java • GroupLensRecommenderEvaluatorRunner.java
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- evaluator • Instantiates an evaluator: • RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); • a “mean average error” algorithm • Parses input parameters: • File ratingsFile = TasteOptionParser.getRatings(args);
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- data model • Parses a colon delimiter pattern file: • DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
  • 21. Group Lens Runner -- evaluate with Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. recommendation builder • evaluates using GroupLensRecommender • double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com