Suche senden
Hochladen
Dunning ml-conf-2014
•
Als PPTX, PDF herunterladen
•
1 gefällt mir
•
553 views
MapR Technologies
Folgen
Melden
Teilen
Melden
Teilen
1 von 41
Jetzt herunterladen
Empfohlen
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning
My talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
Ted Dunning
Doing-the-impossible
Doing-the-impossible
Ted Dunning
T digest-update
T digest-update
Ted Dunning
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
Ted Dunning
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
Strata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
Ted Dunning
Empfohlen
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning
My talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
Ted Dunning
Doing-the-impossible
Doing-the-impossible
Ted Dunning
T digest-update
T digest-update
Ted Dunning
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
Ted Dunning
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
Strata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
Ted Dunning
Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
Ted Dunning
Which Algorithms Really Matter
Which Algorithms Really Matter
Ted Dunning
What is the past future tense of data?
What is the past future tense of data?
Ted Dunning
What's new in Apache Mahout
What's new in Apache Mahout
Ted Dunning
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
How to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit
Mahout and Recommendations
Mahout and Recommendations
Ted Dunning
Polyvalent recommendations
Polyvalent recommendations
Ted Dunning
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
Artificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
Hadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
MapR Technologies
Atlhug 20150625
Atlhug 20150625
MapR Technologies
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
MapR Technologies
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm Spanish
Connected Futures
เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
Marg Kok
Drill njhug -19 feb2013
Drill njhug -19 feb2013
MapR Technologies
Drill 1.0
Drill 1.0
MapR Technologies
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
MapR Technologies
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
Chus Fernández de la Fuente
Weitere ähnliche Inhalte
Was ist angesagt?
Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
Ted Dunning
Which Algorithms Really Matter
Which Algorithms Really Matter
Ted Dunning
What is the past future tense of data?
What is the past future tense of data?
Ted Dunning
What's new in Apache Mahout
What's new in Apache Mahout
Ted Dunning
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
How to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit
Mahout and Recommendations
Mahout and Recommendations
Ted Dunning
Polyvalent recommendations
Polyvalent recommendations
Ted Dunning
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
Artificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
Hadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
tboubez
Was ist angesagt?
(13)
Dunning time-series-2015
Dunning time-series-2015
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
Which Algorithms Really Matter
Which Algorithms Really Matter
What is the past future tense of data?
What is the past future tense of data?
What's new in Apache Mahout
What's new in Apache Mahout
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
How to tell which algorithms really matter
How to tell which algorithms really matter
Mahout and Recommendations
Mahout and Recommendations
Polyvalent recommendations
Polyvalent recommendations
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Artificial intelligence and data stream mining
Artificial intelligence and data stream mining
Hadoop and R Go to the Movies
Hadoop and R Go to the Movies
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Andere mochten auch
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
MapR Technologies
Atlhug 20150625
Atlhug 20150625
MapR Technologies
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
MapR Technologies
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm Spanish
Connected Futures
เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
Marg Kok
Drill njhug -19 feb2013
Drill njhug -19 feb2013
MapR Technologies
Drill 1.0
Drill 1.0
MapR Technologies
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
MapR Technologies
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
Chus Fernández de la Fuente
Andere mochten auch
(9)
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
Atlhug 20150625
Atlhug 20150625
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm Spanish
เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
Drill njhug -19 feb2013
Drill njhug -19 feb2013
Drill 1.0
Drill 1.0
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
Ähnlich wie Dunning ml-conf-2014
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
MLconf
Predictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
Practical Computing with Chaos
Practical Computing with Chaos
MapR Technologies
Practical Computing With Chaos
Practical Computing With Chaos
DataWorks Summit
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Allen Day, PhD
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
AJAY KHARAT
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
MapR Technologies
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
MapR Technologies
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
DataWorks Summit
Introduction to Mahout
Introduction to Mahout
Ted Dunning
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
Polyvalent Recommendations
Polyvalent Recommendations
MapR Technologies
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
MapR Technologies
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
Ted Dunning
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
Ähnlich wie Dunning ml-conf-2014
(20)
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Predictive Analytics with Hadoop
Predictive Analytics with Hadoop
Practical Computing with Chaos
Practical Computing with Chaos
Practical Computing With Chaos
Practical Computing With Chaos
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
Introduction to Mahout
Introduction to Mahout
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
Polyvalent Recommendations
Polyvalent Recommendations
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Mehr von MapR Technologies
Converging your data landscape
Converging your data landscape
MapR Technologies
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
MapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
Mehr von MapR Technologies
(20)
Converging your data landscape
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
Dunning ml-conf-2014
1.
© 2014 MapR
Technologies 1© 2014 MapR Technologies
2.
© 2014 MapR
Technologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
3.
© 2014 MapR
Technologies 3 Agenda • Background – recommending with puppies and ponies • Speed tricks • Accuracy tricks • Moving to real-time
4.
© 2014 MapR
Technologies 4 Puppies and Ponies
5.
© 2014 MapR
Technologies 5 Cooccurrence AnalysisCooccurrence Analysis
6.
© 2014 MapR
Technologies 6 How Often Do Items Co-occur How often do items co-occur? ¢A A A
7.
© 2014 MapR
Technologies 7 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document ¢A A sparsify ¢A A( )
8.
© 2014 MapR
Technologies 8 Recommendations Alice got an apple and a puppyAlice
9.
© 2014 MapR
Technologies 9 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles
10.
© 2014 MapR
Technologies 10 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
11.
© 2014 MapR
Technologies 11 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob What else would Bob like?
12.
© 2014 MapR
Technologies 12 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
13.
© 2014 MapR
Technologies 13 By the way, like me, Bob also wants a pony…
14.
© 2014 MapR
Technologies 14 Recommendations ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
15.
© 2014 MapR
Technologies 15 Recommendations ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict...
16.
© 2014 MapR
Technologies 16 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
17.
© 2014 MapR
Technologies 17 Overview: Get Useful Indicators from Behaviors 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
18.
© 2014 MapR
Technologies 18 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
19.
© 2014 MapR
Technologies 19 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
20.
© 2014 MapR
Technologies 20 Collection of Documents: Insert Meta-Data Search Engine Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS ✔ indicators: (t1)
21.
© 2014 MapR
Technologies 22 Cooccurrence Mechanics • Cooccurrence is just a self-join for each user, i for each history item j1 in Ai* for each history item j2 in Ai* count pair (j1, j2) aij1 aij2 = ¢A A i å
22.
© 2014 MapR
Technologies 23 Cross-occurrence Mechanics • Cross occurrence is just a self-join of adjoined matrices for each user, i for each history item j1 in Ai* for each history item j2 in Bi* count pair (j1, j2) aij1 bij2 = ¢A B i å A | B[ ]¢ A | B[ ] = ¢A A ¢A B ¢B A ¢B B é ë ê ù û ú
23.
© 2014 MapR
Technologies 24 A word about scaling
24.
© 2014 MapR
Technologies 25 A few pragmatic tricks • Downsample all user histories to max length (interaction cut) – Can be random or most-recent (no apparent effect on accuracy) – Prolific users are often pathological anyway – Common limit is 300 items (no apparent effect on accuracy) • Downsample all items to limit max viewers (frequency limit) – Can be random or earliest (no apparent effect) – Ubiquitous items are uninformative – Common limit is 500 users (no apparent effect) Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce. Proceedings of the sixth ACM conference on Recommender systems. 2012
25.
© 2014 MapR
Technologies 26 But note! • Number of pairs for a user history with ki distinct items is ≈ ki 2/2 • Average size of user history increases with increasing dataset – Average may grow more slowly than N (or not!) – Full cooccurrence cost grows strictly faster than N – i.e. it just doesn’t scale • Downsampling interactions places bounds on per user cost – Cooccurrence with interaction cut is scalable t µ ki 2 i å Î o(N) t µ min kmax 2 ,ki 2 ( )< Nkmax 2 i å Î O(N)
26.
© 2014 MapR
Technologies 27 0 200 400 600 800 1000 0123456 Benefit of down−sampling User Limit Pairs(x109 ) Without down−sampling Track limit = 1000 500 200 Computed on 48,373,586 pair−wise triples from the million song dataset ● ●
27.
© 2014 MapR
Technologies 28 Batch Scaling in Time Implies Scaling in Space • Note: – With frequency limit sampling, max cooccurrence count is small (<1000) – With interaction cut, total number of non-zero pairs is relatively small – Entire cooccurrence matrix can be stored in memory in ~10-15 GB • Specifically: – With interaction cut, cooccurrence scales in size – Without interaction cut, cooccurrence does not scale size-wise ¢A A 0 Î O(N) ¢A A 0 Îw(N)
28.
© 2014 MapR
Technologies 29 Impact of Interaction Cut Downsampling • Interaction cut allows batch cooccurrence analysis to be O(N) in time and space • This is intriguing – Amortized cost is low – Could this be extended to an on-line form? • Incremental matrix factorization is hard – Could cooccurrence be a key alternative? • Scaling matters most at scale – Cooccurrence is very accurate at large scale – Factorization shows benefits at smaller scales
29.
© 2014 MapR
Technologies 30 Online update
30.
© 2014 MapR
Technologies 31 Requirements for Online Algorithms • Each unit of input must require O(1) work – Theoretical bound • The constants have to be small enough on average – Pragmatic constraint • Total accumulated data must be small (enough) – Pragmatic constraint
31.
© 2014 MapR
Technologies 32 Log Files Search Technology Item Meta-Data via NFS MapR Cluster via NFS PostPre Recommendations New User History Web Tier Recommendations happen in real-time Batch co- occurrence Want this to be real-time Real-time recommendations using MapR data platform
32.
© 2014 MapR
Technologies 33 Space Bound Implies Time Bound • Because user histories are pruned, only a limited number of value updates need be made with each new observation • This bound is just twice the interaction cut kmax – Which is a constant • Bounding the number of updates trivially bounds the time
33.
© 2014 MapR
Technologies 34 Implications for Online Update A+eij( )¢ A+eij( )- ¢A A = ¢eij A+ ¢A eij +djj = 0 Ai* 0 é ë ê ê ê ù û ú ú ú + 0 Ai*( )¢ 0 é ë ê ù û ú+djj
34.
© 2014 MapR
Technologies 35 Ai* + Ai*( )¢ +dii 0 £ 2kmax -1Î O(1) With interaction cut at kmax
35.
© 2014 MapR
Technologies 36 But Wait, It Gets Better • The new observation may be pruned – For users at the interaction cut, we can ignore updates – For items at the frequency cut, we can ignore updates – Ignoring updates only affects indicators, not recommendation query – At million song dataset size, half of all updates are pruned • On average ki is much less than the interaction cut – For million song dataset, average appears to grow with log of frequency limit, with little dependency on values of interaction cut > 200 • LLR cutoff avoids almost all updates to index • Average grows slowly with frequency cut
36.
© 2014 MapR
Technologies 37 0 200 400 600 800 1000 05101520253035 Interaction cut (kmax) kave Frequency cut = 1000 = 500 = 200
37.
© 2014 MapR
Technologies 38 0 200 400 600 800 1000 05101520253035 Frequency cut kave
38.
© 2014 MapR
Technologies 39 Recap • Cooccurrence-based recommendations are simple – Deploying with a search engine is even better • Interaction cut and frequency cut are key to batch scalability • Similar effect occurs in online form of updates – Only dozens of updates per transaction needed – Data structure required is relatively small – Very, very few updates cause search engine updates • Fully online recommendation is very feasible, almost easy
39.
© 2014 MapR
Technologies 40 More Details Available available for free at available for free at http://www.mapr.com/practical-machine-learning
40.
© 2014 MapR
Technologies 41 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout Apache Drill http://incubator.apache.org/drill/ Twitter @ApacheDrill
41.
© 2014 MapR
Technologies 42 Q&A @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies
Hinweis der Redaktion
Mention that the Pony book said “RowSimilarityJob”…
Jetzt herunterladen