Lahug 2012-02-07

•Als PPTX, PDF herunterladen•

2 gefällt mir•1,096 views

Ted Dunning

A corrected set of slides from the LA HUG talk that I gave in February 2012

Technologie

Mahout
• Scalable Data Mining for Everybody

What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)

Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training

And Another

From: Thu, Paul 20, 2010 at 10:51 AM
Date: Dr. May Acquah
Dear Sir,
From: George <george@fumble-tech.com>
Re: Proposal for over-invoice Contract Benevolence
Hi Ted, was a pleasure talking to you last night
Based on information gathered from the idea of
at the Hadoop User Group. I liked the India
hospital directory, I am pleased to propose a
going for lunch together. Are you available
confidential business noon? for our mutual
tomorrow (Friday) at deal
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...

How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value

A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?

A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience

A Second Conclusion
• A single number is a bad way to express
uncertain knowledge

• A distribution of values might be better

Which One to Play?
• One may be better than the other
• The better machine pays off at some rate
• Playing the other will pay off at a lesser rate
– Playing the lesser machine has “opportunity cost”

• But how do we know which is which?
– Explore versus Exploit!

Algorithmic Costs
• Option 1
– Explicitly code the explore/exploit trade-off

• Option 2
– Bayesian Bandit

Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2

The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation

• Can be extended to more general response
models

Deployment with Storm/MapR
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
RPC Model
Impression
Logs
Training
Conversion Online
Training
Detector Model
Training
Click Logs

RPC

All state managed transactionally
in MapR file system
Conversion
Dashboard

Service Architecture

MapR Pluggable Service Management

Storm
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
Impression
Logs

Conversion
Detector
RPC

Training

Training
Model

Online
Hadoop
Model
Training
Click Logs

RPC

Conversion
Dashboard

MapR Lockless Storage Services

Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning

Empfohlen

Capture @ #Wildtech '11Jason Neiffer

ONA DC 2012: Javaun Moradi on Seeing PatternsElise Hu-Stiles

Lessons from Launching NPR StateImpact and The Texas TribuneElise Hu-Stiles

1 content optimization-hug-2010-07-21Hadoop User Group

HUG August 2010: Best practicesHadoop User Group

2 hadoop@e bay-hug-2010-07-21Hadoop User Group

Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...confluent

Tracing Micro Services with OpenTracingHemant Kumar

Empfohlen

Capture @ #Wildtech '11Jason Neiffer

ONA DC 2012: Javaun Moradi on Seeing PatternsElise Hu-Stiles

Lessons from Launching NPR StateImpact and The Texas TribuneElise Hu-Stiles

1 content optimization-hug-2010-07-21Hadoop User Group

HUG August 2010: Best practicesHadoop User Group

2 hadoop@e bay-hug-2010-07-21Hadoop User Group

Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...confluent

Tracing Micro Services with OpenTracingHemant Kumar

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...Roberto Pepato

The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com

Boosting spark performance: An Overview of TechniquesAhsan Javed Awan

Paddle_Spark_SummitMin Hsieh (Kyle) Tsai

Deploying Data Science Engines to ProductionMostafa Majidpour

Scalable Deep Learning Platform On Spark In BaiduJen Aman

Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks

Software ArchitectureYoav Avrahami

Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Databricks

Benchmarking Hadoop and Big DataNicolas Poggi

Functional Ideas for a Cloudy FutureRichard Minerich

All Aboard the DatabusAmy W. Tang

Complex Event Processing: What?, Why?, How?Fabien Coppens

Performance Oriented DesignRodrigo Campos

SnappyData @ Seattle Spark MeetupSnappyData

Sybase Complex Event ProcessingSybase Türkiye

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Benchmarking OTM and Java - Is Your Platform Limiting PerformanceMavenWire

Launching Your First Big Data Project on AWSAmazon Web Services

Seattle Spark Meetup Mobius CSharp APIshareddatamsft

Dunning - SIGMOD - Data Economy.pptxTed Dunning

How to Get Going with KubernetesTed Dunning

Weitere ähnliche Inhalte

Ähnlich wie Lahug 2012-02-07

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...Roberto Pepato

The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com

Boosting spark performance: An Overview of TechniquesAhsan Javed Awan

Paddle_Spark_SummitMin Hsieh (Kyle) Tsai

Deploying Data Science Engines to ProductionMostafa Majidpour

Scalable Deep Learning Platform On Spark In BaiduJen Aman

Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks

Software ArchitectureYoav Avrahami

Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Databricks

Benchmarking Hadoop and Big DataNicolas Poggi

Functional Ideas for a Cloudy FutureRichard Minerich

All Aboard the DatabusAmy W. Tang

Complex Event Processing: What?, Why?, How?Fabien Coppens

Performance Oriented DesignRodrigo Campos

SnappyData @ Seattle Spark MeetupSnappyData

Sybase Complex Event ProcessingSybase Türkiye

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Benchmarking OTM and Java - Is Your Platform Limiting PerformanceMavenWire

Launching Your First Big Data Project on AWSAmazon Web Services

Seattle Spark Meetup Mobius CSharp APIshareddatamsft

Ähnlich wie Lahug 2012-02-07 (20)

Continuous Inspection - Uma abordagem efetiva para melhoria contínua da quali...

The Analytics Frontier of the Hadoop Eco-System

Boosting spark performance: An Overview of Techniques

Paddle_Spark_Summit

Deploying Data Science Engines to Production

Scalable Deep Learning Platform On Spark In Baidu

Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale

Software Architecture

Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...

Benchmarking Hadoop and Big Data

Functional Ideas for a Cloudy Future

All Aboard the Databus

Complex Event Processing: What?, Why?, How?

Performance Oriented Design

SnappyData @ Seattle Spark Meetup

Sybase Complex Event Processing

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Benchmarking OTM and Java - Is Your Platform Limiting Performance

Launching Your First Big Data Project on AWS

Seattle Spark Meetup Mobius CSharp API

Mehr von Ted Dunning

Dunning - SIGMOD - Data Economy.pptxTed Dunning

How to Get Going with KubernetesTed Dunning

Progress for big data in KubernetesTed Dunning

Anomaly Detection: How to find what you didn’t know to look forTed Dunning

Streaming Architecture including Rendezvous for Machine LearningTed Dunning

Machine Learning LogisticsTed Dunning

Tensor Abuse - how to reuse machine learning frameworksTed Dunning

Machine Learning logisticsTed Dunning

T digest-updateTed Dunning

Finding Changes in Real DataTed Dunning

Where is Data Going? - RMDC KeynoteTed Dunning

Real time-hadoopTed Dunning

Cheap learning-dunning-9-18-2015Ted Dunning

Sharing Sensitive Data SecurelyTed Dunning

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning

How the Internet of Things is Turning the Internet Upside DownTed Dunning

Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning

Dunning time-series-2015Ted Dunning

Doing-the-impossibleTed Dunning

Anomaly Detection - New York Machine LearningTed Dunning

Mehr von Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx

How to Get Going with Kubernetes

Progress for big data in Kubernetes

Anomaly Detection: How to find what you didn’t know to look for

Streaming Architecture including Rendezvous for Machine Learning

Machine Learning Logistics

Tensor Abuse - how to reuse machine learning frameworks

Machine Learning logistics

T digest-update

Finding Changes in Real Data

Where is Data Going? - RMDC Keynote

Real time-hadoop

Cheap learning-dunning-9-18-2015

Sharing Sensitive Data Securely

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

How the Internet of Things is Turning the Internet Upside Down

Apache Kylin - OLAP Cubes for SQL on Hadoop

Dunning time-series-2015

Doing-the-impossible

Anomaly Detection - New York Machine Learning

Kürzlich hochgeladen

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

How to write a Business Continuity PlanDatabarracks

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Advanced Computer Architecture – An IntroductionDilum Bandara

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Kürzlich hochgeladen (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

What's New in Teams Calling, Meetings and Devices March 2024

Designing IA for AI - Information Architecture Conference 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Human Factors of XR: Using Human Factors to Design XR Systems

Dev Dives: Streamline document processing with UiPath Studio Web

Developer Data Modeling Mistakes: From Postgres to NoSQL

How to write a Business Continuity Plan

Connect Wave/ connectwave Pitch Deck Presentation

Vertex AI Gemini Prompt Engineering Tips

"Debugging python applications inside k8s environment", Andrii Soldatenko

Commit 2024 - Secret Management made easy

SIP trunking in Janus @ Kamailio World 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Nell’iperspazio con Rocket: il Framework Web di Rust!

Advanced Computer Architecture – An Introduction

Artificial intelligence in cctv survelliance.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

Lahug 2012-02-07

1. Beating up on Bayesian Bandits

2. Mahout • Scalable Data Mining for Everybody

3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)

4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)

5. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

6. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

7. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!

8. An Example

9. And Another From: Thu, Paul 20, 2010 at 10:51 AM Date: Dr. May Acquah Dear Sir, From: George <george@fumble-tech.com> Re: Proposal for over-invoice Contract Benevolence Hi Ted, was a pleasure talking to you last night Based on information gathered from the idea of at the Hadoop User Group. I liked the India hospital directory, I am pleased to propose a going for lunch together. Are you available confidential business noon? for our mutual tomorrow (Friday) at deal benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ...

10. Feature Encoding

11. Hashed Encoding

12. Feature Collisions

13. How it Works • We are given “features” – Often binary values in a vector • Algorithm learns weights – Weighted sum of feature * weight is the key • Each weight is a single real value

14. A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?

15. A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience

16. A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better

17. I Dunno

18. 5 and 5

19. 2 and 10

20. The Cynic Among Us

21. A Second Diversion

22. Two-armed Bandit

23. Which One to Play? • One may be better than the other • The better machine pays off at some rate • Playing the other will pay off at a lesser rate – Playing the lesser machine has “opportunity cost” • But how do we know which is which? – Explore versus Exploit!

24. Algorithmic Costs • Option 1 – Explicitly code the explore/exploit trade-off • Option 2 – Bayesian Bandit

25. Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2

26.

27.

28. The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models

29. Deployment with Storm/MapR Targeting Online Engine Model RPC RPC Model Selector RPC Online RPC Model Impression Logs Training Conversion Online Training Detector Model Training Click Logs RPC All state managed transactionally in MapR file system Conversion Dashboard

30. Service Architecture MapR Pluggable Service Management Storm Targeting Online Engine Model RPC RPC Model Selector RPC Online Impression Logs Conversion Detector RPC Training Training Model Online Hadoop Model Training Click Logs RPC Conversion Dashboard MapR Lockless Storage Services

31. Find Out More • Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com • MapR: http://www.mapr.com • Mahout: http://mahout.apache.org • Code: https://github.com/tdunning

Hinweis der Redaktion

No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.