SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Streaming Data Mining
PRESENTED BY Edo Liberty⎪ April 11, 2014
Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.
Parts of this presentation
were given with Jelani Nelson
(Harvard) as a KDD tutorial on
streaming data mining.
2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Distributed model (map/reduce, message passing, …)
5 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
ComputationQuery
Distributed model (indexes, tables, databases, …)
207 big-data infographics (meta infographic)
6 Yahoo Confidential & Proprietary
7 Yahoo Confidential & Proprietary
8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm ResultQuery
Result
Computation
The streaming model
9 Yahoo Confidential & Proprietary
Aggregate+
Sketch
The World
Query Algorithm ResultQuery
Result
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
The parallel streaming model
10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accurately)
O(n)Items
O(polylog(n)) Space
O(polylog(n)) Computation per item
11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
Frequent items
Misra, Gries. Finding repeated elements, 1982.
Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002
Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003
The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002
Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
13 Yahoo Confidential & Proprietary
d
n
f( ) = 5
14 Yahoo Confidential & Proprietary
f( ) = 5
d
15 Yahoo Confidential & Proprietary
`
16 Yahoo Confidential & Proprietary
`
17 Yahoo Confidential & Proprietary
`
18 Yahoo Confidential & Proprietary
`
19 Yahoo Confidential & Proprietary
`
20 Yahoo Confidential & Proprietary
`
21 Yahoo Confidential & Proprietary
`
22 Yahoo Confidential & Proprietary
f0
( ) = 0
`
f0
( ) = 2
23 Yahoo Confidential & Proprietary
Assume we do this timest
Second fact: f0
(x) f(x) t
f0
(x)  f(x)First fact:
The proof (very short)
24 Yahoo Confidential & Proprietary
Third (not so obvious) fact:
Which gives . In words:
We can only delete items times!
t  n/`
0
P
f0
(x) =
P
f(x) t · ` = n t · `
⌅
The proof (very short)
` n/`
|f0
(x) f(x)|  n/`
Useful form…
25 Yahoo Confidential & Proprietary
Define
And
We get that
This is very useful for keeping approx’ distributions!
p(x) = f(x)/n
p0
(x) = f0
(x)/n
|p0
(x) p(x)|  1/`
Threading Machine Generated Email
27 Yahoo Confidential & Proprietary
Email threads
A simple email thread (that’s not very hard to do…)
Threading Machine Generated Email
28 Yahoo Confidential & Proprietary
Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
29 Yahoo Confidential & Proprietary
Threading Machine Generated Email
30 Yahoo Confidential & Proprietary
Threading Machine Generated Email
What else can we do in the streaming model…
31 Yahoo Confidential & Proprietary
Items (words, IP-adresses, events, clicks,...):
§  Item frequencies
§  Counting distinct elements
§  Moment and entropy estimation
§  Approximate set operations
Vectors (text documents, images, example features,...)
§  Dimensionality reduction
§  Clustering (k-means, k-median,…)
§  Linear Regression
§  Machine learning (some of it at least)
Matrices (text corpora, user preferences, graphs...)
§  Covariance estimation matrix
§  Low rank approximation
§  Sparsification
Thanks!
32 Yahoo Confidential & Proprietary
Yahoo does big data algorithms, software and systems!
Speak to our Talent Team or visit Careers.Yahoo.com and explore our
career opportunities in NYC or Sunnyvale, CA
Seth Tropper
satropper@yahoo-inc.com
Doug DeSimone
desimone@yahoo-inc.com
Keith Daniels
kdnl@yahoo-inc.com
Yahoo is an equal opportunity employer.

Weitere ähnliche Inhalte

Was ist angesagt?

Exponential Functions
Exponential FunctionsExponential Functions
Exponential Functions
acwalk03
 

Was ist angesagt? (11)

Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reduce
 
Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reduce
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R Graphics
 
Funções 5
Funções  5Funções  5
Funções 5
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์
 
MS2 POwer Rules
MS2 POwer RulesMS2 POwer Rules
MS2 POwer Rules
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
Exponential Functions
Exponential FunctionsExponential Functions
Exponential Functions
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
 

Ähnlich wie MLconf NYC Edo Liberty

Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
Databricks
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
Editor IJARCET
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
Ilya Grigorik
 

Ähnlich wie MLconf NYC Edo Liberty (20)

Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-course
 
F sharp - an overview
F sharp - an overviewF sharp - an overview
F sharp - an overview
 
Master Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno SiebesMaster Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno Siebes
 
Data flow
Data flowData flow
Data flow
 
How to Data Flow Diagram
How to Data Flow Diagram How to Data Flow Diagram
How to Data Flow Diagram
 
208 dataflowdgm
208 dataflowdgm208 dataflowdgm
208 dataflowdgm
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5
 
Applications of Machine Learning at UCSB
Applications of Machine Learning at UCSBApplications of Machine Learning at UCSB
Applications of Machine Learning at UCSB
 
208 dataflowdgm
208 dataflowdgm208 dataflowdgm
208 dataflowdgm
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014
 
How it works- Data Science
How it works- Data ScienceHow it works- Data Science
How it works- Data Science
 

Mehr von MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

MLconf NYC Edo Liberty

  • 1. Streaming Data Mining PRESENTED BY Edo Liberty⎪ April 11, 2014 Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission. Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
  • 2. 2 Yahoo Confidential & Proprietary Data Computation Result The World Single machine data mining
  • 3. 3 Yahoo Confidential & Proprietary Data Data Data Data Computation Result The World Distributed storage
  • 4. 4 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute Distributed model (map/reduce, message passing, …)
  • 5. 5 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute ComputationQuery Distributed model (indexes, tables, databases, …)
  • 6. 207 big-data infographics (meta infographic) 6 Yahoo Confidential & Proprietary
  • 7. 7 Yahoo Confidential & Proprietary
  • 8. 8 Yahoo Confidential & Proprietary Sketch The World Query Algorithm ResultQuery Result Computation The streaming model
  • 9. 9 Yahoo Confidential & Proprietary Aggregate+ Sketch The World Query Algorithm ResultQuery Result Compute + Sketch Compute + Sketch Compute + Sketch Compute + Sketch The parallel streaming model
  • 10. 10 Yahoo Confidential & Proprietary 1 7 8 1 0 1 7 7 Sketch Result Iterator Computation The streaming model (more accurately) O(n)Items O(polylog(n)) Space O(polylog(n)) Computation per item
  • 11. 11 Yahoo Confidential & Proprietary Sketch Result Iterator Iterator Communication complexity 1 7 8 1 0 1 7 7
  • 12. Frequent items Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
  • 13. 13 Yahoo Confidential & Proprietary d n f( ) = 5
  • 14. 14 Yahoo Confidential & Proprietary f( ) = 5 d
  • 15. 15 Yahoo Confidential & Proprietary `
  • 16. 16 Yahoo Confidential & Proprietary `
  • 17. 17 Yahoo Confidential & Proprietary `
  • 18. 18 Yahoo Confidential & Proprietary `
  • 19. 19 Yahoo Confidential & Proprietary `
  • 20. 20 Yahoo Confidential & Proprietary `
  • 21. 21 Yahoo Confidential & Proprietary `
  • 22. 22 Yahoo Confidential & Proprietary f0 ( ) = 0 ` f0 ( ) = 2
  • 23. 23 Yahoo Confidential & Proprietary Assume we do this timest Second fact: f0 (x) f(x) t f0 (x)  f(x)First fact: The proof (very short)
  • 24. 24 Yahoo Confidential & Proprietary Third (not so obvious) fact: Which gives . In words: We can only delete items times! t  n/` 0 P f0 (x) = P f(x) t · ` = n t · ` ⌅ The proof (very short) ` n/` |f0 (x) f(x)|  n/`
  • 25. Useful form… 25 Yahoo Confidential & Proprietary Define And We get that This is very useful for keeping approx’ distributions! p(x) = f(x)/n p0 (x) = f0 (x)/n |p0 (x) p(x)|  1/`
  • 27. 27 Yahoo Confidential & Proprietary Email threads A simple email thread (that’s not very hard to do…)
  • 28. Threading Machine Generated Email 28 Yahoo Confidential & Proprietary Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
  • 29. 29 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 30. 30 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 31. What else can we do in the streaming model… 31 Yahoo Confidential & Proprietary Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least) Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification
  • 32. Thanks! 32 Yahoo Confidential & Proprietary Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA Seth Tropper satropper@yahoo-inc.com Doug DeSimone desimone@yahoo-inc.com Keith Daniels kdnl@yahoo-inc.com Yahoo is an equal opportunity employer.