Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
SDEC2011 Essentials of Mahout
1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Essentials of Mahout
Mastering Hadoop Map-reduce for Data Analysis
Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
What is Apache Mahout?
• A scalable machine learning infrastructure
• Built on top of Hadoop MapReduce
• Currently supports:
• Clustering, classification, and collaborative filtering, etc...
3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
A Little History
• Founded by folks active in the Lucene community
• Inspired by work at Stanford: “Map-Reduce for Machine Learning on
Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-
mapreducemulticore.pdf
4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Project Goal
• Create a community driven scalable and robust machine learning
infrastructure
• Leverage Hadoop for parallel processing and scalability
• Provide an abstraction on top of Hadoop so the machine-learning users are
not concerned with the map and reduce primitives when they build their
solutions.
5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Supported Algorithms
• Collaborative Filtering
• User and Item based recommenders
• K-Means, Fuzzy K-Means clustering
• Mean Shift clustering
• Dirichlet process clustering
6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
More Supported Algorithms
• Latent Dirichlet Allocation
• Singular value decomposition
• Parallel Frequent Pattern mining
• Complementary Naive Bayes classifier
• Random forest decision tree based classifier
• ...and growing
7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Focus Areas
• Collaborative Filtering
• Clustering
• Classification
8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Build and Install
• Required Software:
• Java 1.6.x
• Maven 2.0.11+
• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
• Compile & install core & examples: mvn install
• Alternatively, individually mvn compile, mvn package, and mvn install
9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Recommendation Examples
• mvn -q exec:java -
Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group
LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/
workspace/hadoop_workspace/grouplens/ratings.dat"
• https://cwiki.apache.org/confluence/display/MAHOUT/
RecommendationExamples
10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Common Use Cases
• Shopping: Amazon, Netflix
• Who to follow/friend: Twitter/Facebook
• Web resource classification, spam filtering, financial markets pattern
recognition, classification
11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Collaborative Filtering Basis
• User-based: recommend items by finding similar users. User preferences
keep changing so this method poses challenges.
• Item-based: calculate similarity between items and make
recommendations. Usually items don’t change much so the method is
often reliable.
• Slope-one: fast and efficient item based recommendation when user
ratings are more than boolean yes/no, like/dislike.
• Model-based: provide recommendation on the basis of developing a
model of users and their ratings.
12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Clustering Basis
• Clustering algorithms also use the notion of similarity to group similar
items into a cluster.
• Both Collaborative filtering and clustering use the notion of a distance,
which could be calculated using a number of different techniques.
• Example: Euclidean distance,
13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on
SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on
SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
• Mahout supports building recommendation engines primarily basis the Taste
library.
• The library supports both user-based and item-based recommendations.
• Can be used with Java or over RESTful web-service endpoints.
15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Primary Classes
• DataModel: Model for Users, Items, and Preferences
• UserSimilarity: Interface defining the similarity between two users
• ItemSimilarity: Interface defining the similarity between two items
• Recommender: Interface for providing recommendations
• UserNeighborhood: Interface for computing a neighborhood of similar
users. These are used by the Recommenders.
16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Online vs Offline
• Can do online recommendations for a few thousand data sets.
• Leverages Hadoop for offline recommendation calculations on large data
sets.
17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Understanding the Group Lens Implementation
• Provide an insight into a sample Mahout Taste Framework Implementation.
• Uses the publicly available data set
• Part of the distribution so you can analyze it, modify it, and use it as an
inspiration for your own implementation
• Easy to follow example
18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Group Lens Implementation Source
• GroupLensDataModel.java
• GroupLensRecommender.java
• GroupLensRecommenderBuilder.java
• GroupLensRecommenderEvaluatorRunner.java
19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- evaluator
• Instantiates an evaluator:
• RecommenderEvaluator evaluator = new
AverageAbsoluteDifferenceRecommenderEvaluator();
• a “mean average error” algorithm
• Parses input parameters:
• File ratingsFile = TasteOptionParser.getRatings(args);
20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- data model
• Parses a colon delimiter pattern file:
• DataModel model = ratingsFile == null ? new GroupLensDataModel() :
new GroupLensDataModel(ratingsFile);
21. Group Lens Runner -- evaluate with
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
recommendation builder
• evaluates using GroupLensRecommender
• double evaluation = evaluator.evaluate(new
GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Questions?
• blog: shanky.org | twitter: @tshanky
• st@treasuryofideas.com