SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Hokusai
Sketching streams in real time
Sergiy Matusevych1
Alexander J. Smola2
Amr Ahmed2
1Yahoo! Research, Santa Clara, CA
2Google, Mountain View, CA
UAI 2012
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
Thanks
Alex Smola
Google and CMU
Amr Ahmed
Google
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
Motivation
Compute frequencies of elements in the data stream
Item frequencies change over time.
Number of items unkonwn and variable.
Example - logging query frequency over time.
Applications
Flow counting for IP traffic (who sent what, when and how much)
Spam detection and filtering (detect bursts immediately)
Website analytics (feedback to editors, trend detection)
State of the art
CountMin sketch is instantaneous but does not log time.
Naive snapshotting costs linear memory.
MapReduce batch job provides exact counts but long delays.
Resource constraints
Fixed memory footprint for entire sketch regardless of duration
High query throughput
Real time aggregation and response
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
Strategy
1. Use CountMin sketch to store snapshots of data.
(this solves the real time logging problem)
2. Compress snapshots linearly as they age
We care most about recent events
Logarithmic storage since
T
t=1
t−1
= O(log T)
3. Exploit CountMin data structure for efficient compression
Variant 1: reduce storage per snapshot
Variant 2: increase timespan per snapshot
4. Interpolate between both variants for improved accuracy
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
CountMin Sketch (Cormode & Muthukrishnan)
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
In-memory data structure for instantaneous retrieval
Aggregate statistic of observation interval (instantanous retrieval)
Intuition — Bloom filter with integers
Algorithm
insert(x):
for i = 1 to d do
M[i, hi (x)] ← M[i, hi (x)] + 1
end for
query(x):
ˆnx ← min
i∈{1,...d}
M[i, hi (x)]
return ˆnx
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
Guarantees
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Approximation guarantee
For sketch with d = log 1
δ and n = e
we have with probability
1 − δ that the estimate ˆnx deviates from the count nx via
nx ≤ ˆnx ≤ nx +
x
nx for all x.
Linear statistic of the data
Power law distributions with exponent z only use O(N −1/z) space.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
Step 1: Combining time intervals
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
MT and MT sketches at time intervals T and T with T ∩ T = ∅.
Combine sketches by adding them up
+
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
4
2
1
1 1
1 2
1 1
1 1
1 1
1 1
42
4
2
1
2
1
1 1
1 1 2 4
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
Step 1: Efficient computation
Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.
Insert into the leftmost aggregation interval.
Aggregate as cumulative sum from the left using 1 +
n
i=0
2i
= 2n+1
Computation is
∞
n=1
n · 2−n
= O(1) amortized time, O(log t) space.
2
2
8
1
1 1
1 1
1 1
1 421
8
8
8
4
4
4
2
4
2
1 1
1 1
1 1
42
4
2
1
1 1
1 1 2 4
8
8
8
8
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
Step 2: Folding over
M ∈ Rd×n matrix
d hash functions
n bins
hash h1 M11 M12 M13 M14 M15 M16
. . . M1n
hash h2 M21 M22 M23 M24 M25 M26
. . . M2n
hash h3 M31 M32 M33 M34 M35 M36
. . . M3n
x
Mb is sketch with n = 2b bins.
Mb−1 can obtained as
Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1
]
by “folding over” the sketch
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
Step 2: Efficient computation
Halve the size of the sketch every 2t intervals.
Computation costs O(1) time and O(log t) space.
. . .
1 x 16 bins
2 x 8 bins
4 x 4 bins
interval 1
interval 2 3
4 5 6 7
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
Step 3: Resolution Interpolation
Time aggregation reports good estimate over long time interval.
Item aggregation reports poor estimate over short time interval.
Marginals of joint distribution — assume independence & interpolate
n(t)
n(x)n
Torso and Tail
Item aggregated estimate nx
Time aggregated estimate nt
Count interpolation
ˆnxt =
nx · nt
n
where n =
t
nt =
x
nx
Head
Sketch accuracy decreases with e · t
Use regular CountMin sketch whenever
˜n(x, t) > e · t · 2−b
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Configuration
Platform
64-bit Linux
4-core 2GHz x86
16GB RAM
Gigabit network
Sketch setup
4 hash functions
223
bins
211
aggregation
intervals (7 days in
5 minute intervals)
3-gram interpolation
12GB sketch with
3 hash functions
230
bins
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
Setup and Throughput
Web query data, 5 days sample
Term frequency
Numberofuniqueterms
100
102
104
106
97.9M unique terms,
378.1M total
100
101
102
103
104
105
106
Wikipedia data
Term frequency
Numberofuniqueterms
100
101
102
103
104
105
106
4.5M unique terms,
1291.5M total
100
102
104
106
Speed
Software
Client-server system
ICE middleware
1 server, 10 clients
Throughput/s
50k inserts
22k requests
(time aggregation)
8.5k requests
(resolution interp.)
Limiting Factors
TCP/IP Overhead
Package query
Memory latency
Random access
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
Accuracy (aggregate absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
Accuracy (stratified absolute error ˆn − n)
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
Sketching for Graphical Models
Goal
Observe stream of observations
Estimate joint probability in O(1) time
CountMin is good for head but interpolation better for torso and tail
General Strategy
Markov network with junction tree: cliques C and separator sets S.
Estimate counts for xC and xS with C ∈ C and S ∈ S to generate
ˆp(x) = n|S|−|C|
C∈C
nxC
S∈S
n−1
xS
.
Estimates are fast — only lookup in CountMin sketch. No need to
solve convex program for graphical model inference.
Markov Chain
p(abc) ≈ n−3
· ˆna · ˆnb · ˆnc Unigrams
p(abc) ≈ n−2
·
ˆnab · ˆnbc
ˆnb
Bigrams
Backoff smoothing (e.g. Kneser-Ney) in practice.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
n-gram Interpolation
Trigram approximation
Wikipedia dataset (1291.5M terms, 405M unique trigrams)
Absolute error Relative error
Unigram approximation 2.50 · 107 0.266
Bigram approximation 1.22 · 106 0.013
Trigram sketching (CountMin) 8.35 · 106 0.089
Sketching trigrams is not accurate enough on the tail.
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
Summary
Fast and simple algorithm to aggregate statistics of data streams.
Effective compressed representation of the temporal data.
Works well for graphical models.
High-performance scalable implementation with O(1) time access.
Can be distributed over many servers.
Hokusai Katsushika
Great Wave off Kanagawa
Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21

Weitere ähnliche Inhalte

Was ist angesagt?

Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
rahulmonikasharma
 

Was ist angesagt? (20)

MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...A common fixed point theorem for two random operators using random mann itera...
A common fixed point theorem for two random operators using random mann itera...
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...Moment Preserving Approximation of Independent Components for the Reconstruct...
Moment Preserving Approximation of Independent Components for the Reconstruct...
 
LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
Constrained Support Vector Quantile Regression for Conditional Quantile Estim...
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
Lec17 sparse signal processing & applications
Lec17 sparse signal processing & applicationsLec17 sparse signal processing & applications
Lec17 sparse signal processing & applications
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 

Andere mochten auch

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
Talei85
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
Eric
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
dandeliondandelion23
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
mkredford
 

Andere mochten auch (20)

Jones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2pptJones_Talei_EDE202_Assessment2ppt
Jones_Talei_EDE202_Assessment2ppt
 
Hokusai
HokusaiHokusai
Hokusai
 
Hokusai
HokusaiHokusai
Hokusai
 
George seurat
George seuratGeorge seurat
George seurat
 
Post impressionism Art Period Study Guide
Post impressionism Art Period Study GuidePost impressionism Art Period Study Guide
Post impressionism Art Period Study Guide
 
Seurat powerpoint
Seurat powerpointSeurat powerpoint
Seurat powerpoint
 
Jackson Pollock
Jackson PollockJackson Pollock
Jackson Pollock
 
Japanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson pptJapanese printmaking elementary lesson ppt
Japanese printmaking elementary lesson ppt
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Paul klee.ppt
Paul klee.pptPaul klee.ppt
Paul klee.ppt
 
Paul Klee
Paul KleePaul Klee
Paul Klee
 
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets  (nx power lite)
Hokusai~ The Last Series ~ Pictures for 100 Poems by 100 Poets (nx power lite)
 
PaulKlee
PaulKleePaulKlee
PaulKlee
 
Leonardo da Vinci
Leonardo da VinciLeonardo da Vinci
Leonardo da Vinci
 
Hokusai Nº2
Hokusai Nº2Hokusai Nº2
Hokusai Nº2
 
Henri Matisse
Henri MatisseHenri Matisse
Henri Matisse
 
Vincent van gogh
Vincent van goghVincent van gogh
Vincent van gogh
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Bauhaus
BauhausBauhaus
Bauhaus
 
Periods of Art
Periods of ArtPeriods of Art
Periods of Art
 

Ähnlich wie Hokusai - Sketching streams in real time

!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
King Nisar
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
Nasir Jumani
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
Alexander Decker
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
University of Glasgow
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
MAHERMOHAMED27
 

Ähnlich wie Hokusai - Sketching streams in real time (20)

D143136
D143136D143136
D143136
 
!Business statistics tekst
!Business statistics tekst!Business statistics tekst
!Business statistics tekst
 
Cg
CgCg
Cg
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound imageOptimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
 
11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image11.optimal nonlocal means algorithm for denoising ultrasound image
11.optimal nonlocal means algorithm for denoising ultrasound image
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Viii sem
Viii semViii sem
Viii sem
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Teknik Simulasi
Teknik SimulasiTeknik Simulasi
Teknik Simulasi
 
Ch-2 final exam documet compler design elements
Ch-2 final exam documet compler design elementsCh-2 final exam documet compler design elements
Ch-2 final exam documet compler design elements
 
Meshing for computer graphics
Meshing for computer graphicsMeshing for computer graphics
Meshing for computer graphics
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
 
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cyclesAtomic algorithm and the servers' s use to find the Hamiltonian cycles
Atomic algorithm and the servers' s use to find the Hamiltonian cycles
 

Kürzlich hochgeladen

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 

Kürzlich hochgeladen (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 

Hokusai - Sketching streams in real time

  • 1. Hokusai Sketching streams in real time Sergiy Matusevych1 Alexander J. Smola2 Amr Ahmed2 1Yahoo! Research, Santa Clara, CA 2Google, Mountain View, CA UAI 2012 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1
  • 2. Thanks Alex Smola Google and CMU Amr Ahmed Google Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2
  • 3. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3
  • 4. Motivation Compute frequencies of elements in the data stream Item frequencies change over time. Number of items unkonwn and variable. Example - logging query frequency over time. Applications Flow counting for IP traffic (who sent what, when and how much) Spam detection and filtering (detect bursts immediately) Website analytics (feedback to editors, trend detection) State of the art CountMin sketch is instantaneous but does not log time. Naive snapshotting costs linear memory. MapReduce batch job provides exact counts but long delays. Resource constraints Fixed memory footprint for entire sketch regardless of duration High query throughput Real time aggregation and response Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4
  • 5. Strategy 1. Use CountMin sketch to store snapshots of data. (this solves the real time logging problem) 2. Compress snapshots linearly as they age We care most about recent events Logarithmic storage since T t=1 t−1 = O(log T) 3. Exploit CountMin data structure for efficient compression Variant 1: reduce storage per snapshot Variant 2: increase timespan per snapshot 4. Interpolate between both variants for improved accuracy Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5
  • 6. CountMin Sketch (Cormode & Muthukrishnan) M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x In-memory data structure for instantaneous retrieval Aggregate statistic of observation interval (instantanous retrieval) Intuition — Bloom filter with integers Algorithm insert(x): for i = 1 to d do M[i, hi (x)] ← M[i, hi (x)] + 1 end for query(x): ˆnx ← min i∈{1,...d} M[i, hi (x)] return ˆnx Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6
  • 7. Guarantees M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Approximation guarantee For sketch with d = log 1 δ and n = e we have with probability 1 − δ that the estimate ˆnx deviates from the count nx via nx ≤ ˆnx ≤ nx + x nx for all x. Linear statistic of the data Power law distributions with exponent z only use O(N −1/z) space. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7
  • 8. Step 1: Combining time intervals M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x MT and MT sketches at time intervals T and T with T ∩ T = ∅. Combine sketches by adding them up + Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8
  • 9. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 4 2 1 1 1 1 2 1 1 1 1 1 1 1 1 42 4 2 1 2 1 1 1 1 1 2 4 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9
  • 10. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10
  • 11. Step 1: Efficient computation Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}. Insert into the leftmost aggregation interval. Aggregate as cumulative sum from the left using 1 + n i=0 2i = 2n+1 Computation is ∞ n=1 n · 2−n = O(1) amortized time, O(log t) space. 2 2 8 1 1 1 1 1 1 1 1 421 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 42 4 2 1 1 1 1 1 2 4 8 8 8 8 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11
  • 12. Step 2: Folding over M ∈ Rd×n matrix d hash functions n bins hash h1 M11 M12 M13 M14 M15 M16 . . . M1n hash h2 M21 M22 M23 M24 M25 M26 . . . M2n hash h3 M31 M32 M33 M34 M35 M36 . . . M3n x Mb is sketch with n = 2b bins. Mb−1 can obtained as Mb−1[i, j] = Mb[i, j] + Mb[i, j + 2b−1 ] by “folding over” the sketch Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12
  • 13. Step 2: Efficient computation Halve the size of the sketch every 2t intervals. Computation costs O(1) time and O(log t) space. . . . 1 x 16 bins 2 x 8 bins 4 x 4 bins interval 1 interval 2 3 4 5 6 7 Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13
  • 14. Step 3: Resolution Interpolation Time aggregation reports good estimate over long time interval. Item aggregation reports poor estimate over short time interval. Marginals of joint distribution — assume independence & interpolate n(t) n(x)n Torso and Tail Item aggregated estimate nx Time aggregated estimate nt Count interpolation ˆnxt = nx · nt n where n = t nt = x nx Head Sketch accuracy decreases with e · t Use regular CountMin sketch whenever ˜n(x, t) > e · t · 2−b Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14
  • 15. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Configuration Platform 64-bit Linux 4-core 2GHz x86 16GB RAM Gigabit network Sketch setup 4 hash functions 223 bins 211 aggregation intervals (7 days in 5 minute intervals) 3-gram interpolation 12GB sketch with 3 hash functions 230 bins Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15
  • 16. Setup and Throughput Web query data, 5 days sample Term frequency Numberofuniqueterms 100 102 104 106 97.9M unique terms, 378.1M total 100 101 102 103 104 105 106 Wikipedia data Term frequency Numberofuniqueterms 100 101 102 103 104 105 106 4.5M unique terms, 1291.5M total 100 102 104 106 Speed Software Client-server system ICE middleware 1 server, 10 clients Throughput/s 50k inserts 22k requests (time aggregation) 8.5k requests (resolution interp.) Limiting Factors TCP/IP Overhead Package query Memory latency Random access Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16
  • 17. Accuracy (aggregate absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17
  • 18. Accuracy (stratified absolute error ˆn − n) Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18
  • 19. Sketching for Graphical Models Goal Observe stream of observations Estimate joint probability in O(1) time CountMin is good for head but interpolation better for torso and tail General Strategy Markov network with junction tree: cliques C and separator sets S. Estimate counts for xC and xS with C ∈ C and S ∈ S to generate ˆp(x) = n|S|−|C| C∈C nxC S∈S n−1 xS . Estimates are fast — only lookup in CountMin sketch. No need to solve convex program for graphical model inference. Markov Chain p(abc) ≈ n−3 · ˆna · ˆnb · ˆnc Unigrams p(abc) ≈ n−2 · ˆnab · ˆnbc ˆnb Bigrams Backoff smoothing (e.g. Kneser-Ney) in practice. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19
  • 20. n-gram Interpolation Trigram approximation Wikipedia dataset (1291.5M terms, 405M unique trigrams) Absolute error Relative error Unigram approximation 2.50 · 107 0.266 Bigram approximation 1.22 · 106 0.013 Trigram sketching (CountMin) 8.35 · 106 0.089 Sketching trigrams is not accurate enough on the tail. Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20
  • 21. Summary Fast and simple algorithm to aggregate statistics of data streams. Effective compressed representation of the temporal data. Works well for graphical models. High-performance scalable implementation with O(1) time access. Can be distributed over many servers. Hokusai Katsushika Great Wave off Kanagawa Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21