SlideShare ist ein Scribd-Unternehmen logo
1 von 41
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Me, Us
• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill
– VP Incubator
– Bought the beer at the first HUG
• MapR
– Distributes more open source components for Hadoop
– Adds major technology for performance, HA, industry standard API’s
• Info
– Hash tag - #mapr #mlconfatl
– See also - @ApacheDrill
@ted_dunning and @mapR
© 2014 MapR Technologies 3
Agenda
• Rationale
• Why cheap isn't the same as simple-minded
• Some techniques
• Examples
© 2014 MapR Technologies 4
Why is cheap better than deep (sometimes)
Greenfield problems can be
– Easy (large number of these)
– Impossible (large number of these)
– Hard but possible (right on the boundary)
Mature problems can be
– Easy (these are already done)
– Impossible (still a large number of these)
– Hard but possible (now the majority of the effort)
© 2014 MapR Technologies 5
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
© 2014 MapR Technologies 6
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value
© 2014 MapR Technologies 7
If we can handle the scale
It’s really big
© 2014 MapR Technologies 8
With great scale comes great opportunity
• Increasing scale by 1000x changes the game
• We essentially have green fields opening up all around
• Most of the opportunities don’t require advanced learning
© 2014 MapR Technologies 9
A simple example - security monitoring
• “Small” data
– Capture IDS logs
– Detect what you already know
• “Big” data
– Capture switch, server, firewall logs as well
– New patterns emerge immediately
© 2014 MapR Technologies 10
Another example – fraud detection
• “Small” data
– Maintain card profiles
– Segment models
– Evaluate all transactions
• “Big” Data
– Maintain card profiles, full 90 day transaction history
– Per user hierarchical models
– Evaluate all transactions
© 2014 MapR Technologies 11
Easy != Stupid
• You still have to do things reasonably well
– Techniques that are not well founded are still problems
• Heuristic frequency ratios still fail
– Coincidences still dominate the data
– Accidental 100% correlations abound
• Related techniques still broken for coincidence
– Pearson’s χ2
– Simple correlations
© 2014 MapR Technologies 12
Blast from the past
© 2014 MapR Technologies 13
Scale does not cure wrong
It just makes easy more common
© 2014 MapR Technologies 14
A core technique
• Many of these easy problems reduce to finding interesting
coincidences
• This can be summarized as a 2 x 2 table
• Actually, many of these tables
A Other
B k11 k12
Other k21 k22
© 2014 MapR Technologies 15
How do you do that?
• This is well handled using G-test
– See wikipedia
– See http://bit.ly/surprise-and-coincidence
• Original application in linguistics now cited > 2000 times
• Available in ElasticSearch, in Solr, in Mahout
• Available in R, C, Java, Python
© 2014 MapR Technologies 16
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2014 MapR Technologies 17
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)
© 2014 MapR Technologies 18
So we can find interesting coincidence
and that gets us exactly what?
© 2014 MapR Technologies 19
Cooccurrence AnalysisCooccurrence Analysis
© 2014 MapR Technologies 20
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2014 MapR Technologies 21
Real-life example
© 2014 MapR Technologies 22
Any other domains?
© 2014 MapR Technologies 23
Document classification
© 2014 MapR Technologies 24
Language identification
© 2014 MapR Technologies 25
OK … Works for language
Anything else?
© 2014 MapR Technologies 26
Species identification
© 2014 MapR Technologies 27
Anything useful?
Like, to do with money?
© 2014 MapR Technologies 28
Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data
© 2014 MapR Technologies 29
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2014 MapR Technologies 30
Simulation Strategy
• For each consumer
– Pick consumer parameters such as transaction rate, preferences
– Generate transactions until end of sim-time
• If merchant 0 during compromise time, possibly mark as compromised
• For all transactions, possible mark as fraud, probability depends on history
• Merchants are selected using hierarchical Pittman-Yor
• Restate data
– Flatten transaction streams
– Sort by time
• Tunables
– Compromise probability, background fraud, detection probability
© 2014 MapR Technologies 31
But that isn’t very realistic!
• No details of the fraud
• No details of the fraudsters
• No details on the transactions
• No details on the models
• How can this be any good at all?
© 2014 MapR Technologies 32
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment
© 2014 MapR Technologies 33
Secure Development is Hard
System
knowledge
Observed
data
Training
algorithm
Model
New
measurements
Model
Anomaly
scores
Model
deployment
Outside collaborators
are outside the security
perimeter
They can’t see the data
and they can’t tune new
algorithms to fit reality
© 2014 MapR Technologies 34
How To Make Realistic Data
System
under test
Live
data
Failure
signatures
Fake
data
Failure
signatures
© 2014 MapR Technologies 35
Parametric Simulation
Match here
Live
data
System
under test
Failure
signatures
Fake
data
Failure
signatures
Fake
data
System
under test
Failure
signatures
Parametric matching of failure signatures
allows emulation of complex data properties
Matching on KPI’s and failure modes
guarantees practical fidelity
© 2014 MapR Technologies 36
Performance Indicators to Match
• User and merchant population
• Transaction count/consumer
• Merchant propensity skew
• Level of detected fraud
• Spectrum of meta-model scores
© 2014 MapR Technologies 37
So how does it work in practice?
© 2014 MapR Technologies 38
© 2014 MapR Technologies 39
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys
© 2014 MapR Technologies 40
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Cooccurrence An
Summary
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
• We live in a golden age of newly achieved scale
• That scale has lowered the tree
– Hard problems are much easier
– Lots of low-hanging fruit all around us
• Cheap learning has huge value
• Code available at
http://github.com/tdunning
© 2014 MapR Technologies 41
Me, Us
• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill
– VP Incubator
– Bought the beer at the first HUG
• MapR
– Distributes more open source components for Hadoop
– Adds major technology for performance, HA, industry standard API’s
• Info
– Hash tag - #mapr #mlconfatl
– See also - @ted_dunning and @mapR

Weitere ähnliche Inhalte

Ähnlich wie Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with ChaosMapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsMapR Technologies
 
Buzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationBuzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationTed Dunning
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01MapR Technologies
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentMapR Technologies
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Matt Stubbs
 

Ähnlich wie Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15 (20)

Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
 
Buzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendationBuzz words-dunning-multi-modal-recommendation
Buzz words-dunning-multi-modal-recommendation
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Polyvalent Recommendations
Polyvalent RecommendationsPolyvalent Recommendations
Polyvalent Recommendations
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Me, Us • Ted Dunning, Chief Application Architect, MapR – Committer PMC member Zookeeper, Drill – VP Incubator – Bought the beer at the first HUG • MapR – Distributes more open source components for Hadoop – Adds major technology for performance, HA, industry standard API’s • Info – Hash tag - #mapr #mlconfatl – See also - @ApacheDrill @ted_dunning and @mapR
  • 3. © 2014 MapR Technologies 3 Agenda • Rationale • Why cheap isn't the same as simple-minded • Some techniques • Examples
  • 4. © 2014 MapR Technologies 4 Why is cheap better than deep (sometimes) Greenfield problems can be – Easy (large number of these) – Impossible (large number of these) – Hard but possible (right on the boundary) Mature problems can be – Easy (these are already done) – Impossible (still a large number of these) – Hard but possible (now the majority of the effort)
  • 5. © 2014 MapR Technologies 5 Most data isn’t worth much in isolation First data is valuable Later data is dregs
  • 6. © 2014 MapR Technologies 6 Suddenly worth processing First data is valuable Later data is dregs But has high aggregate value
  • 7. © 2014 MapR Technologies 7 If we can handle the scale It’s really big
  • 8. © 2014 MapR Technologies 8 With great scale comes great opportunity • Increasing scale by 1000x changes the game • We essentially have green fields opening up all around • Most of the opportunities don’t require advanced learning
  • 9. © 2014 MapR Technologies 9 A simple example - security monitoring • “Small” data – Capture IDS logs – Detect what you already know • “Big” data – Capture switch, server, firewall logs as well – New patterns emerge immediately
  • 10. © 2014 MapR Technologies 10 Another example – fraud detection • “Small” data – Maintain card profiles – Segment models – Evaluate all transactions • “Big” Data – Maintain card profiles, full 90 day transaction history – Per user hierarchical models – Evaluate all transactions
  • 11. © 2014 MapR Technologies 11 Easy != Stupid • You still have to do things reasonably well – Techniques that are not well founded are still problems • Heuristic frequency ratios still fail – Coincidences still dominate the data – Accidental 100% correlations abound • Related techniques still broken for coincidence – Pearson’s χ2 – Simple correlations
  • 12. © 2014 MapR Technologies 12 Blast from the past
  • 13. © 2014 MapR Technologies 13 Scale does not cure wrong It just makes easy more common
  • 14. © 2014 MapR Technologies 14 A core technique • Many of these easy problems reduce to finding interesting coincidences • This can be summarized as a 2 x 2 table • Actually, many of these tables A Other B k11 k12 Other k21 k22
  • 15. © 2014 MapR Technologies 15 How do you do that? • This is well handled using G-test – See wikipedia – See http://bit.ly/surprise-and-coincidence • Original application in linguistics now cited > 2000 times • Available in ElasticSearch, in Solr, in Mahout • Available in R, C, Java, Python
  • 16. © 2014 MapR Technologies 16 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 17. © 2014 MapR Technologies 17 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  • 18. © 2014 MapR Technologies 18 So we can find interesting coincidence and that gets us exactly what?
  • 19. © 2014 MapR Technologies 19 Cooccurrence AnalysisCooccurrence Analysis
  • 20. © 2014 MapR Technologies 20 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 21. © 2014 MapR Technologies 21 Real-life example
  • 22. © 2014 MapR Technologies 22 Any other domains?
  • 23. © 2014 MapR Technologies 23 Document classification
  • 24. © 2014 MapR Technologies 24 Language identification
  • 25. © 2014 MapR Technologies 25 OK … Works for language Anything else?
  • 26. © 2014 MapR Technologies 26 Species identification
  • 27. © 2014 MapR Technologies 27 Anything useful? Like, to do with money?
  • 28. © 2014 MapR Technologies 28 Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  • 29. © 2014 MapR Technologies 29 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 30. © 2014 MapR Technologies 30 Simulation Strategy • For each consumer – Pick consumer parameters such as transaction rate, preferences – Generate transactions until end of sim-time • If merchant 0 during compromise time, possibly mark as compromised • For all transactions, possible mark as fraud, probability depends on history • Merchants are selected using hierarchical Pittman-Yor • Restate data – Flatten transaction streams – Sort by time • Tunables – Compromise probability, background fraud, detection probability
  • 31. © 2014 MapR Technologies 31 But that isn’t very realistic! • No details of the fraud • No details of the fraudsters • No details on the transactions • No details on the models • How can this be any good at all?
  • 32. © 2014 MapR Technologies 32 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment
  • 33. © 2014 MapR Technologies 33 Secure Development is Hard System knowledge Observed data Training algorithm Model New measurements Model Anomaly scores Model deployment Outside collaborators are outside the security perimeter They can’t see the data and they can’t tune new algorithms to fit reality
  • 34. © 2014 MapR Technologies 34 How To Make Realistic Data System under test Live data Failure signatures Fake data Failure signatures
  • 35. © 2014 MapR Technologies 35 Parametric Simulation Match here Live data System under test Failure signatures Fake data Failure signatures Fake data System under test Failure signatures Parametric matching of failure signatures allows emulation of complex data properties Matching on KPI’s and failure modes guarantees practical fidelity
  • 36. © 2014 MapR Technologies 36 Performance Indicators to Match • User and merchant population • Transaction count/consumer • Merchant propensity skew • Level of detected fraud • Spectrum of meta-model scores
  • 37. © 2014 MapR Technologies 37 So how does it work in practice?
  • 38. © 2014 MapR Technologies 38
  • 39. © 2014 MapR Technologies 39 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  • 40. © 2014 MapR Technologies 40 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Cooccurrence An Summary 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds • We live in a golden age of newly achieved scale • That scale has lowered the tree – Hard problems are much easier – Lots of low-hanging fruit all around us • Cheap learning has huge value • Code available at http://github.com/tdunning
  • 41. © 2014 MapR Technologies 41 Me, Us • Ted Dunning, Chief Application Architect, MapR – Committer PMC member Zookeeper, Drill – VP Incubator – Bought the beer at the first HUG • MapR – Distributes more open source components for Hadoop – Adds major technology for performance, HA, industry standard API’s • Info – Hash tag - #mapr #mlconfatl – See also - @ted_dunning and @mapR

Hinweis der Redaktion

  1. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  2. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  3. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  4. Laugh on tech term in American English = garbage 10:38