SlideShare a Scribd company logo
1 of 30
Download to read offline
Apache Mahout
Thursday, November 4, 2010
Apache Mahout
Now with extra whitening and classification powers!
Thursday, November 4, 2010
• Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classifiers
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the scalable stuff
• Not a competitor for R or Weka
Thursday, November 4, 2010
General Areas
• Recommendations
• lots of support, lots of flexibility,
production ready
• Unsupervised learning (clustering)
• lots of options, lots of flexibility,
production ready (ish)
Thursday, November 4, 2010
General Areas
• Supervised learning (classification)
• multiple architectures, fair number of
options, somewhat inter-operable
• production ready (for the right definition
of production and ready)
• Large scale SVD
• larger scale coming, beware sharp edges
Thursday, November 4, 2010
Scalable?
• Scalable means
• Time is proportional to problem size by
resource size
• Does not imply Hadoop or parallel
THE AUTHOR
t ∝
|P|
|R|
Thursday, November 4, 2010
Wall
Clock
Time
# of Training Examples
Scalable Algorithm
(Mahout wins!)
Traditional
Datamining
Works here
Scalable Solutions Required
Non-scalable Algorithm
Thursday, November 4, 2010
Scalable means ...
• One unit of work requires about a unit of
time
• Not like the company store (bit.ly/22XVa4)
t ∝
|P|
|R|
|P| = O(1) =⇒ t = O(1)
Thursday, November 4, 2010
Wall
Clock
Time
# of Training Examples
Parallel Algorithm
Sequential
Algorithm
Preferred
Parallel Algorithm Preferred
Sequential Algorithm
Thursday, November 4, 2010
Toy Example
Thursday, November 4, 2010
Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate
shape
predictor
variables
target
variable
Thursday, November 4, 2010
What matters most?
!
!
!
!
!
!
!
!
!
!
Thursday, November 4, 2010
SGD Classification
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimized for high dimensional
sparse data, possibly with interactions
• Scalable, real dang fast to train
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Stateless,
parallel
Thursday, November 4, 2010
Small example
• On 20 newsgroups
• converges in < 10,000 training examples
(less than one pass through the data)
• accuracy comparable to SVM, Naive
Bayes, Complementary Naive Bayes
• learning rate, regularization set
automagically on held-out data
Thursday, November 4, 2010
System Structure
EvolutionaryProcess ep
void train(target, features)
AdaptiveLogisticRegression
20
1
OnlineLogisticRegression folds
void train(target, tracking, features)
double auc()
CrossFoldLearner
5
1
Matrix beta
void train(target, features)
double classifyScalar(features)
OnlineLogisticRegression
Thursday, November 4, 2010
Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int actual, Vector instance);
void train(long trackingKey, String groupKey, int actual, Vector instance);
void close();
}
Thursday, November 4, 2010
Classification API
public class AdaptiveLogisticRegression implements OnlineLearner {
public AdaptiveLogisticRegression(int numCategories, int numFeatures,
PriorFunction prior);
public void train(int actual, Vector instance);
public void train(long trackingKey, int actual, Vector instance);
public void train(long trackingKey, String groupKey, int actual,
Vector instance);
public void close();
public double auc();
public State<Wrapper> getBest();
}
CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner();
double averageCorrect = model.percentCorrect();
double averageLL = model.logLikelihood();
double p = model.classifyScalar(features);
Thursday, November 4, 2010
Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• byte[] and double allows speed
• Abstract interactions supported
Thursday, November 4, 2010
Speed!
• Parsing and encoding dominate single
learner
• Moderate optimization allows 1 million
training examples with 200 features to be
encoded in 14 seconds in a single core
• 20 million mixed text, categorical features
with many interactions learned in ~ 1 hour
Thursday, November 4, 2010
More Speed!
• Evolutionary optimization of learning
parameters allows simple operation
• 20x threading allows high machine use
• 20 newsgroup test completes in less time
on single node with SGD than on Hadoop
with Complementary Naive Bayes
Thursday, November 4, 2010
Summary
• Mahout provides early production quality
scalable data-mining
• New classification systems allow industrial
scale classification
Thursday, November 4, 2010
Contact Info
Ted Dunning
tdunning@maprtech.com
Thursday, November 4, 2010
Contact Info
Ted Dunning
tdunning@maprtech.com
or tdunning@apache.com
Thursday, November 4, 2010

More Related Content

Viewers also liked

Viewers also liked (8)

Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Big Data Analytics London
Big Data Analytics LondonBig Data Analytics London
Big Data Analytics London
 
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys BotzumSecuring Hadoop by MapR's Senior Principal Technologist Keys Botzum
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
 

Similar to SD Forum 11 04-2010

2010.10.30 steven sustaining tdd agile tour shenzhen
2010.10.30 steven sustaining tdd   agile tour shenzhen2010.10.30 steven sustaining tdd   agile tour shenzhen
2010.10.30 steven sustaining tdd agile tour shenzhen
Odd-e
 
Using+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applicationsUsing+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applications
Muhammad Ikram Ul Haq
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
Steven Mak
 

Similar to SD Forum 11 04-2010 (20)

2010.10.30 steven sustaining tdd agile tour shenzhen
2010.10.30 steven sustaining tdd   agile tour shenzhen2010.10.30 steven sustaining tdd   agile tour shenzhen
2010.10.30 steven sustaining tdd agile tour shenzhen
 
Building Brilliant APIs
Building Brilliant APIsBuilding Brilliant APIs
Building Brilliant APIs
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalksto
 
Crowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI TestingCrowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI Testing
 
Using+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applicationsUsing+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applications
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
 
BRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQLBRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQL
 
Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)
 
ExpressionEngine FUGN presentation
ExpressionEngine FUGN presentationExpressionEngine FUGN presentation
ExpressionEngine FUGN presentation
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
#3 Information extraction from news to conversations
#3 Information extraction from news to conversations#3 Information extraction from news to conversations
#3 Information extraction from news to conversations
 
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
 
Best Practices - Mobile Developer Summit
Best Practices - Mobile Developer SummitBest Practices - Mobile Developer Summit
Best Practices - Mobile Developer Summit
 
2011 july-nyc-gtug-go
2011 july-nyc-gtug-go2011 july-nyc-gtug-go
2011 july-nyc-gtug-go
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
PyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh GrantPyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh Grant
 
Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)
 

More from MapR Technologies

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

SD Forum 11 04-2010

  • 2. Apache Mahout Now with extra whitening and classification powers! Thursday, November 4, 2010
  • 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classifiers Thursday, November 4, 2010
  • 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
  • 8. General Areas • Recommendations • lots of support, lots of flexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of flexibility, production ready (ish) Thursday, November 4, 2010
  • 9. General Areas • Supervised learning (classification) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right definition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
  • 10. Scalable? • Scalable means • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel THE AUTHOR t ∝ |P| |R| Thursday, November 4, 2010
  • 11. Wall Clock Time # of Training Examples Scalable Algorithm (Mahout wins!) Traditional Datamining Works here Scalable Solutions Required Non-scalable Algorithm Thursday, November 4, 2010
  • 12. Scalable means ... • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) t ∝ |P| |R| |P| = O(1) =⇒ t = O(1) Thursday, November 4, 2010
  • 13. Wall Clock Time # of Training Examples Parallel Algorithm Sequential Algorithm Preferred Parallel Algorithm Preferred Sequential Algorithm Thursday, November 4, 2010
  • 15. Training Data Sample yes no 0.92 0.01 circle 0.30 0.41 square Filled? x coordinate y coordinate shape predictor variables target variable Thursday, November 4, 2010
  • 17. SGD Classification • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
  • 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Thursday, November 4, 2010
  • 19. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Thursday, November 4, 2010
  • 20. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Stateless, parallel Thursday, November 4, 2010
  • 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
  • 22. System Structure EvolutionaryProcess ep void train(target, features) AdaptiveLogisticRegression 20 1 OnlineLogisticRegression folds void train(target, tracking, features) double auc() CrossFoldLearner 5 1 Matrix beta void train(target, features) double classifyScalar(features) OnlineLogisticRegression Thursday, November 4, 2010
  • 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
  • 24. Classification API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
  • 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
  • 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
  • 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
  • 28. Summary • Mahout provides early production quality scalable data-mining • New classification systems allow industrial scale classification Thursday, November 4, 2010
  • 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010