Apache Mahout is a scalable machine learning library built on Hadoop. It provides algorithms for recommendation engines, clustering, classification and other machine learning tasks. Some key algorithms include user-based and item-based collaborative filtering for recommendations, k-means and fuzzy k-means clustering, logistic regression for classification. Mahout is well suited for large datasets and allows machine learning tasks to be easily parallelized across a Hadoop cluster. It has advantages of being open source, scalable, and built on production-quality libraries.
2. What is Machine Learning?
“Machine learning - branch of artificial
intelligence, concerns the construction
and study of systems that can learn from
data”
3. Typical Use Cases
●
Recommend products/friends …
●
Classify content into predefined groups
●
Computer vision
●
Sentiment analysis/opinion mining
●
Find patterns in users behavior/actions
●
Identify key topics/summarize text
●
Detect anomalies/fraud
●
Ranking search results
●
Speech and handwriting recognition
●
Natural language processing
4. ML Algorithms (subset):
●
Supervised learning
–
–
Logistic regression
–
Support Vector Machines
–
●
Linear regression
Random Forests
Unsupervised learning
–
–
Blind signal separation
–
●
Clustering
Hidden Markov models
Semi-supervised
5. Many ML libraries, frameworks
and tools:
●
Weka
●
Python Scikit
●
Pylearn/Pylearn2
●
Theano
●
Orange
●
SSBrain :)
●
More can be find here: http://mloss.org/software/
9. What is Mahout?
●
●
Scalable ML library built on Hadoop, written in Java
Driven by Ng et al's. Paper “MapReduce for Machine Learning on
Multicore”
●
Started as Lucene sub-project. Became Apache TLP in April 2010
●
25 July 2013 - Apache Mahout 0.8 released
●
Taste Recommended Framework by Sean Owen was added in
2008
11. When you need Mahout?
Data Size
Lines, Sample Data
Task
Analysis and
visualization
Tools
Whiteboard, bash, ...
KBs – low MBs,
Prototype Data
Analysis and
visualization
Octave, R, bash, ...
MBs – low Gbs,
Online Data
Storage
Data bases (MySQL,
Postgresql), ...
Analysis
NumPy, SciPy, BLAS,
Weka
Visualization
GBs – TBs – Pbs
Big Data
Protovis, D3, ...
Storage
HDFS, Hbase,
Cassandra, ...
Analysis
Mahout, Hive, Pig, ….
table from Varad Meru
16. Algorithms
●
User and Item based recommenders
●
Matrix factorization based recommenders
●
K-Means, Fuzzy K-Means clustering
●
Latent Dirichlet Allocation
●
Singular value decomposition
●
Logistic regression based classifier
●
Complementary Naive Bayes classifier
●
Random forest decision tree based classifier
18. Personalization level
●
Generic / Non-Personalized: everyone
receives same recommendations
●
Demographic: matches a target group
●
Ephemeral: matches current activity
●
Persistent: matches long-term interests
19. Content based
●
User Ratings x Item Attributes => Model
●
Model applied to new items via attributes
●
●
Alternative: knowledge-based (Item
attributes form model of item space)
Example: Personalized news feeds
28. Evaluation
●
Average absolute difference
●
RMSE
●
Precision and recall
●
●
Precision is the proportion of top results that are relevant, for some
definition of relevant.
Recall is the proportion of all relevant results included in the top
results.
32. Mahout Classification
Algorithms
●
Logistic regression (SGD) - model parameter
selection can be done in Hadoop
●
Naive Bayes - training runs on Hadoop
●
Random Forests - training is done in Hadoop
●
Hidden Markov Models - training is done in
Map-Reduce