In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts needed to understand clustering, classification and collaborative filtering before demonstrating what it takes to get started with Mahout. In addition to learning how you get Mahout set-up, you will learn what it takes to process and prepare data, how to execute an “embarrassing parallel” batch recommendation job and subsequently how to integrate the result back into your existing ecosystem.
5. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Riding the Elephant
Born out of the Apache Lucene project
Top-level Apache project
A scalable machine learning library
Fast, Efficient & Pragmatic
Many of the algorithms can be run on Hadoop
11. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Classification
Using a pre-determined set of groups:
Predict the type of a new object based on its
features
Classifiable Data
Continuous – Quantitative Value (i.e. Stock Price)
Categorical – Small known set (i.e. Colors)
Word-Like – Large unknown set
Text-Like – Many word-like that are unordered
Examples:
Spam Identification
Photo Facial Recognition
17. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Data Acquisition
Sources of Data for Recommendation
Implicit
Ratings
Feedback
Demographics
Pyschographics (Personality/Lifestyle/Attitude),
Ephemeral Need (Need for a moment)
Explicit
Purchase History
Click/Browse History
Product/Item
Taxonomy
Attributes
Descriptions
21. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
CF Pseudo-Code
for each item i that u has no preference
for each user v that has a preference for i
compute similarity s between u and v
calculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood