In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
A Production Quality Sketching Library for the Analysis of Big Data
1.
2. A Production Quality Sketching
Library for the Analysis of Big Data
Lee Rhodes
Distinguished Architect, Verizon Media (Yahoo), Inc.
3. Outline
Problematic Queries of Big Data
Where traditional analysis methods don’t work well
Approximate Analysis
Using Sketches
How using stochastic processes and probabilistic
analysis wins in a systems architecture context
The Open Source Apache
DataSketches Library
A quick overview of this unique library
dedicated to production systems that
process big data
4. The Data Analysis Challenge …
… Analyze This Data in Near-Real Time
Time
Stamp
User
ID
Device
ID
Site Time Spent
Sec
Items
Viewed
9:00 AM U1 D1 Apps 59 5
9:30 AM U2 D2 Apps 179 15
10:00 AM U3 D3 Music 29 3
1:00 PM U1 D4 Music 89 10
Billions of Rows or Key, Value Pairs …
Example: Web Site Logs
5. Some Very Common Queries …
Unique Identifiers
with Set Expressions:
(AUB) ∩ (CUD) E
Quantiles, CDFs
Histograms, PMFs
Frequent Items /
Heavy Hitters
Vector & Matrix
Operations:
SVD, etc.
Graph Analysis
Uniform
Weighted
Reservoir Sampling
5 ⋯ 2
⋮ ⋱ ⋮
4 ⋯ 3
All Non-Additive
6. When The Data Gets Large Or Resources Are Limited,
All Of These Queries Become Problematic,
Because the Aggregations are Non-Additive.
A
B
Result A
Result B
+ Combined
Result
Current
Result
New
Item
Updated
Result+
7. Col1, ..., Item, ..., Coln
… Billions of rows ...
Ω(u) size: ~ Big Data
Exact
Results
Difficult
Query
Local Item Copies
Query processing
often requires sorting
…which is very slow.
Query Engine
Traditional, Exact Analysis Methods Require Local
Copies
Note: Micro-batch “Streaming Platforms”, e.g., Storm
Do Not Solve The Fundamental Problem!
Big Data
8. Parallelization Does Not Help Much
Because of Non-Additivity.
You have to keep the copies somewhere!
Col1, ..., Item, ..., Coln
… Billions of rows ...
𝝨
Exact
Results
Expensive Shuffle
Copy
Copy
Copy
Copy
Copy
Copy
Example: Map-Reduce
9. Every dataset is processed N times for a rolling N-day window!
Traditional, Exact Time Windowing
Requires Multiple Touches of Every Item
10. Let’s challenge a fundamental premise:
… that our results must be exact!
If we can allow for approximation,
along with some accuracy guarantees,
we can achieve orders-of-magnitude improvement in
• speed and
• reduction of resources.
11. Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm)
Stream
Processor
Data
Structure
Size = f(k)
Query
Processor
Stochastic Process
Merge / Set
Operations
Results +/- ε
𝛆 = f(1/k)
Probabilistic Analysis
Result Sketch
Sketch
Stream
Data
Stream
Query
Model the Problem as a Stochastic Process
Analyze using Probability & Statistics
Random Selection Sizing, Storing
12. How & Why Sketches Achieve Superior Performance
For Systems Processing Massive Data
13. Major Sketch Properties
• Small Stored Size
• Sub-linear in Space
• Single-pass, “One-Touch”
• Data Insensitive
• Mergeable
• Approximate, Probabilistic
• Mathematically Proven Error Properties
Sub-linear
Stream Size
Linear
SketchSize
Sketching Sampling
Sketches Overlap
with Sampling
Based on the
Specific Sketch
14. Win #1: Small Query Space
O(k) size: ~ Kilobytes
Approximate
Answer ± ε
Difficult
Query
Minimal or
no sorting required!
Ideal for Streaming & Batch
Query Engine
Sketch
Col1, ..., Item, ..., Coln
… Billions of rows ...
Sketches Start Small
Sublinear Means they Stay Small
Single Pass Simplifies Processing
15. Win #2: Mergeability
Sketch Approx. ± ε
Sketch
Sketch
… , Item
… many rows ...
… , Item
… many rows ...
Query
Query
Partitions
Merge
Mergeability Enables Parallelism … With No Additional Loss of Accuracy!
Sketches Transform Non-Additive Metrics Into Additive Objects
The Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches
Col1, ..., Item, ..., Coln
… Billions of rows ...
19. Win #6: Lower System Cost ($)
Case Study: Real-time Flurry, Before and After
• Customers: >250K Mobile App Developers
• Data: 40-50 TB per day
• Platform: 2 clusters X 80 Nodes = 160 Nodes
– Node: 24 CPUs, 250GB RAM
Before Sketches After Sketches
VCS* / Mo. ~80B ~20B
Result Freshness
Daily: 2 to 8 hours; Weekly: ~3 days
Real-time Unique Counts Not Feasible
15 seconds!
Big Wins!
Near-Real Time
Lower System $
* VCS: Virtual Core Seconds
21. The DataSketches Team
Core Team / Committers
• Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012
• Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015
• Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016
• Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015
• Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015
• Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018
• Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018
Extended Team & Consultants
• Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017
• Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019
• Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019
… And our Community is Growing!
* VM = Verizon Media
22. Our Mission…
Combine Deep Science with Exceptional Engineering
To Develop Production Quality Sketches
That Address These Difficult Queries
23. Cardinality, 4 Families
• HLL (on/off Heap): A very high performing implementation of this well-known sketch
• CPC: The best accuracy per space
• Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap
• Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches:
Quantiles Sketches, 2 Families
• Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap.
KLL, highly optimized for accuracy-space.
• Relative Error Quantiles (under development)
Frequent Items (Heavy-Hitters) Sketches, 2 Families
• Frequent Items: Weighted or Unweighted
• Frequent Directions: Approximate SVD (a Vector Sketch)
Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families
• Uniform and weighted sampling to fixed-k sized buckets
Specialty Sketches
• Customer Engagement, Frequent Distinct Tuples, Maps, etc.
The Apache DataSketches Library
Languages Supported:
• Java, C++, Python
• Binary Compatibility
24. Bright Future for Sketching Technology & Solutions
Items (words, IDs, events, clicks, …)
• Count Distinct
• Frequent Items, Heavy-Hitters, etc
• Quantiles, Ranks, PMFs, CDFs, Histograms
• Set Operations
• Sampling
• Mobile (IoT)
• Moment and Entropy Estimation
Graphs (Social Networks, Communications, …)
• Connectivity
• Cut Sparsification
• Weighted Matching
• …
Areas where we have sketch implementations
Areas of research (World-wide)
Vectors (text docs, images, features, …) &
Matrices (text corpora, recommendations, …)
• Dimensionality Reduction (SVD)
• Covariance Estimation
• Low Rank Approximation
• Sparsification
• Clustering (k-means, k-median, …)
• Linear Regression
• Machine Learning (in some areas)
• Density Estimation
25. THANK YOU!
Open Invitation for
Collaboration
Learn More About Apache DataSketches
Come And Visit Us!
https://datasketches.apache.org