SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Small Summaries for
Big Data
Ke Yi
HKUST
yike@ust.hk
The case for “Big Data” in one slide
 “Big” data arises in many forms:
– Medical data: genetic sequences, time series
– Activity data: GPS location, social network activity
– Business data: customer behavior tracking at fine detail
– Physical Measurements: from science (physics, astronomy)
 The 3 V’s:
– Volume
– Velocity
– Variety
Small Summaries for Big Data
2
focus of this talk
Computational scalability
 The first (prevailing) approach: scale up the computation
 Many great technical ideas:
– Use many cheap commodity devices
– Accept and tolerate failure
– Move code to data
– MapReduce: BSP for programmers
– Break problem into many small pieces
– Decide which constraints to drop: noSQL, ACID, CAP
 Scaling up comes with its disadvantages:
– Expensive (hardware, equipment, energy), still not always fast
 This talk is not about this approach!
Small Summaries for Big Data
3
Downsizing data
 A second approach to computational scalability:
scale down the data!
– A compact representation of a large data set
– Too much redundancy in big data anyway
– What we finally want is small: human readable analysis / decisions
– Necessarily gives up some accuracy: approximate answers
– Often randomized (small constant probability of error)
– Examples: samples, sketches, histograms, wavelet transforms
 Complementary to the first approach: not a case of either-or
 Some drawbacks:
– Not a general purpose approach: need to fit the problem
– Some computations don’t allow any useful summary
Small Summaries for Big Data
4
Outline for the talk
 Some most important data summaries
– Sketches: Bloom filter, Count-Min, Count-Sketch (AMS)
– Counters: Heavy hitters, quantiles
– Sampling: simple samples, distinct sampling
– Summaries for more complex objects: graphs and matrices
 Current trends and future challenges for data summarization
 Many abbreviations and omissions (histograms, wavelets, ...)
 A lot of work relevant to compact summaries
– In both the database and the algorithms literature
Small Summaries for Big Data
5
Small Summaries for Big Data
6
Summary Construction
 There are several different models for summary construction
– Offline computation: e.g. sort data, take percentiles
– Streaming: summary merged with one new item each step
– Full mergeability: allow arbitrary merges of partial summaries
 The most general and widely applicable category
 Key methods for summaries:
– Create an empty summary
– Update with one new tuple: streaming processing
 Insert and delete
– Merge summaries together: distributed processing (eg MapR)
– Query: may tolerate some approximation (parameterized by 𝜀)
 Several important cost metrics (as function of 𝜀, 𝑛):
– Size of summary, time cost of each operation
Small Summaries for Big Data 7
Sketches
Small Summaries for Big Data
Bloom Filters [Bloom 1970]
 Bloom filters compactly encode set membership
– E.g. store a list of many long URLs compactly
– 𝑘 random hash functions map items to 𝑚-bit vector 𝑘 times
– Update: Set all 𝑘 entries to 1 to indicate item is present
– Query: Is 𝑥 in the set?
 Return “yes” if all 𝑘 bits 𝑥 maps to are 1.
 Duplicate insertions do not change Bloom filters
 Can be merged by OR-ing vectors (of same size)
item
1 1 1
8
Bloom Filters Analysis
 If item is in the set, then always return “yes”
– No false negative
 If item is not in the set, may return “yes” with a probability.
– Prob. that a bit is 0: 1 −
1
𝑚
𝑘𝑛
– Prob. that all 𝑘 bits are 1:
1 − 1 −
1
𝑚
𝑘𝑛 𝑘
≈ exp(𝑘 ln(1 − 𝑒 𝑘𝑛/𝑚
)
– Setting 𝑘 =
𝑚
𝑛
ln 2 minimizes this probability to 0.6185 𝑚/𝑛
.
 For false positive prob 𝛿, Bloom filter needs 𝑂(𝑛 log
1
𝛿
) bits
Small Summaries for Big Data
9
Bloom Filters Applications
 Bloom Filters widely used in “big data” applications
– Many problems require storing a large set of items
 Can generalize to allow deletions
– Swap bits for counters: increment on insert, decrement on delete
– If no duplicates, small counters suffice: 4 bits per counter
 Bloom Filters are still an active research area
– Often appear in networking conferences
– Also known as “signature files” in DB
Small Summaries for Big Data
10
item
1 2 22 2
item item
Invertible Bloom Lookup Table [GM11]
 A summary that
– Support insertions and deletions
– When # item ≤ 𝑡, can retrieve all items
– Also known as sparse recovery
 Easy case when 𝑡 = 1
– Assume all items are integers
– Just keep 𝑧 = the bitwise-XOR of all items
 To insert 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
 To delete 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥
– Assumption: no duplicates; can’t delete a non-existing item
 A problem: How do we know when # items = 1?
– Just keep another counter
Small Summaries for Big Data
11
Invertible Bloom Lookup Table
 Structure is the same as a Bloom filter, but each cell
– is the XOR or all items mapped here
– also keeps a counter of such items
 How to recover?
– First check if there is any cell has only 1 item
– Recover the item if there is one
– Delete the item from the all cells, and repeat
Small Summaries for Big Data
12
item
1 2 22 2
item item
Invertible Bloom Lookup Table
 Can show that
– Using 1.5𝑡 cells, can recover ≤ 𝑡 items with high probability
Small Summaries for Big Data
13
2 21 1
item item
Small Summaries for Big Data
Count-Min Sketch [CM 04]
 Count Min sketch encodes item counts
– Allows estimation of frequencies (e.g. for selectivity estimation)
– Some similarities in appearance to Bloom filters
 Model input data as a vector 𝑥 of dimension 𝑈
– E.g., [0. . 𝑈] is the space of all IP addresses, 𝑥[𝑖] is the frequency
of IP address 𝑖 in the data
– Create a small summary as an array of 𝑤 × 𝑑 in size
– Use 𝑑 hash function to map vector entries to [1..𝑤]
𝑊
𝑑
Array:
𝐶𝑀[𝑖, 𝑗]
14
Small Summaries for Big Data
Count-Min Sketch Structure
 Update: each entry in vector 𝑥 is mapped to one bucket per row.
 Merge two sketches by entry-wise summation
 Query: estimate 𝑥[𝑖] by taking min 𝑘 𝐶𝑀[𝑘, ℎ 𝑘(𝑖)]
– Never underestimate
– Will bound the error of overestimation
+𝑐
+𝑐
+𝑐
+𝑐
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤 = 2/𝜀
15
Count-Min Sketch Analysis
 Focusing on first row:
– 𝑥 𝑖 is always added to 𝐶𝑀[1, ℎ1 𝑖 ]
– 𝑥 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶𝑀[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 /𝑤
– The total expected error is
𝑗≠𝑖 𝑥 𝑗
𝑤
≤
| 𝑥 |1
𝑤
– By Markov inequality, Pr[error>
2⋅| 𝑥 |1
𝑤
] <
1
2
 By taking the minimum of 𝑑 rows, this prob. is
1
2
𝑑
 To give an 𝜀||𝑥||1 error with prob. 1 − 𝛿, the sketch needs to
have size
2
𝜀
× log
1
𝛿
Small Summaries for Big Data
16
Generalization: Sketch Structures
 Sketch is a class of summary that is a linear transform of input
– Sketch(𝑥) = 𝑆𝑥 for some matrix 𝑆
– Hence, Sketch(𝑥 + 𝛽𝑦) =  Sketch(𝑥) + 𝛽 Sketch(𝑦)
– Trivial to update and merge
 Often describe 𝑆 in terms of hash functions
 Analysis relies on properties of the hash functions
– Some assumes truly random hash functions (e.g., Bloom filter)
 Don’t exist, but ad hoc hash functions work well for many
real-world data
– Others need hash functions with limited independence.
Small Summaries for Big Data
17
Small Summaries for Big Data
Count Sketch / AMS Sketch [CCF-C02, AMS96]
 Second hash function: 𝑔 𝑘 maps each 𝑖 to +1 or −1
 Update: each entry in vector 𝑥 is mapped to one bucket per row.
 Merge two sketches by entry-wise summation
 Query: estimate 𝑥[𝑖] by taking median 𝑘 𝐶[𝑘, ℎ 𝑘 𝑖 𝑔 𝑘 𝑖 ]
– May either overestimate or underestimate
+𝑐𝑔1(𝑖)
+𝑐𝑔2(𝑖)
+𝑐𝑔3(𝑖)
+𝑐𝑔 𝑑(𝑖)
ℎ1(𝑖)
ℎ 𝑑(𝑖)
𝑖, +𝑐
𝑑rows
𝑤
18
Count Sketch Analysis
 Focusing on first row:
– 𝑥 𝑖 𝑔1 𝑖 is always added to 𝐶[1, ℎ1 𝑖 ]
 We return 𝐶 1, ℎ1 𝑖 𝑔1 𝑖 = 𝑥[𝑖]
– 𝑥 𝑗 𝑔1 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶[1, ℎ1 𝑖 ] with prob. 1/𝑤
– The expected error is 𝑥 𝑗 𝐸[𝑔1 𝑖 𝑔1 𝑗 ]/𝑤 = 0
– E[total error] = 0, E[|total error|] ≤
| 𝑥 |2
𝑤
– By Chebyshev inequality, Pr[|total error|>
2⋅| 𝑥 |2
𝑤
] <
1
4
 By taking the median of 𝑑 rows, this prob. is
1
2
𝑂(𝑑)
 To give an 𝜀||𝑥||2 error with prob. 1 − 𝛿, the sketch needs to
have size 𝑂
1
𝜀2 log
1
𝛿
Small Summaries for Big Data
19
Count-Min Sketch vs Count Sketch
 Count-Min:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||1/𝑤
Small Summaries for Big Data
20
 Count Sketch:
– Size: 𝑂 𝑤𝑑
– Error: ||𝑥||2/ 𝑤
 Other benefits:
– Unbiased estimator
– Can also be used to estimate
𝑥 ⋅ 𝑦 (join size)
Count Sketch is better when ||𝑥||2 < ||𝑥||1/ 𝑤
Application to Large Scale Machine Learning
 In machine learning, often have very large feature space
– Many objects, each with huge, sparse feature vectors
– Slow and costly to work in the full feature space
 “Hash kernels”: work with a sketch of the features
– Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09]
 Similar analysis explains why:
– Essentially, not too much noise on the important features
Small Summaries for Big Data
21
Small Summaries for Big Data 22
Counter Based Summaries
1 2 3 4 5 6 7 8 9
Small Summaries for Big Data
23
Heavy hitters [MG82]
 Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
– Equivalently, achieves 𝜀𝑁 error with 𝑂(1/𝜀) space
 Better than Count-Min but doesn’t support deletions
 Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
Small Summaries for Big Data
24
Heavy hitters
 Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
 Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
Small Summaries for Big Data
25
Heavy hitters
 Misra-Gries (MG) algorithm finds up to 𝑘 items that occur
more than 1/𝑘 fraction of the time in a stream [MG’82]
– Estimate their frequencies with additive error ≤ 𝑁/𝑘
 Keep 𝑘 different candidates in hand. To add a new item:
– If item is monitored, increase its counter
– Else, if < 𝑘 items monitored, add new item with count 1
– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
𝑘 = 5
Small Summaries for Big Data
26
MG analysis
 𝑁 = total input size
 𝑀 = sum of counters in data structure
 Error in any estimated count at most (𝑁 − 𝑀)/(𝑘 + 1)
– Estimated count a lower bound on true count
– Each decrement spread over (𝑘 + 1) items: 1 new one and 𝑘 in
MG
– Equivalent to deleting (𝑘 + 1) distinct items from stream
– At most (𝑁 − 𝑀)/(𝑘 + 1) decrement operations
– Hence, can have “deleted” (𝑁 − 𝑀)/(𝑘 + 1) copies of any item
– So estimated counts have at most this much error
Small Summaries for Big Data
27
Merging two MG summaries [ACHPZY12]
 Merging algorithm:
– Merge two sets of 𝑘 counters in the obvious way
– Take the (𝑘 + 1)th largest counter = 𝐶 𝑘+1, and subtract from
all
– Delete non-positive counters
1 2 3 4 5 6 7 8 9
𝑘 = 5
(prior error) (from merge)
Small Summaries for Big Data
28
Merging two MG summaries
 This algorithm gives mergeability:
– Merge subtracts at least 𝑘 + 1 𝐶 𝑘+1 from counter sums
– So 𝑘 + 1 𝐶 𝑘+1  (𝑀1 + 𝑀2 – 𝑀12)
 Sum of remaining (at most 𝑘) counters is 𝑀12
– By induction, error is
((𝑁1 − 𝑀1) + (𝑁2 − 𝑀2) + (𝑀1 + 𝑀2– 𝑀12))/(𝑘 + 1)
= ((𝑁1 + 𝑁2) – 𝑀12)/(𝑘 + 1)
1 2 3 4 5 6 7 8 9
𝑘 = 5
𝐶 𝑘+1
Summarizing Disitributed Data
29
Quantiles (order statistics)
 Exact quantiles: 𝐹−1() for 0 <  < 1, 𝐹  : CDF
 Approximate version: tolerate answer between
𝐹−1( − ) … 𝐹−1( + )
 Dual problem - rank estimation: Given 𝑥, estimate 𝐹(𝑥)
– Can use binary search to find quantiles
Quantiles gives equi-height histogram
 Automatically adapts to skew data distributions
 Equi-width histograms (fixed binning) are is to construct but
does not adapt to data distribution
Summarizing Disitributed Data
30
The GK Quantile Summary [GK01]
3
Incoming: 3
0
node: value & rank
N=1
GK
3
Incoming: 14
0
14
1
N=2
GK
3
Incoming: 50
0
14
1
50
2
We know the exact rank of any element
Space: 𝑂(#node)
N=3
GK
3
Incoming: 9
0
14
1
9
1
Insertion in the middle: affects other nodes’ ranks
50
2
N=4
GK
3
0
14
2
9
1
Deletion: introducing errors
50
3
N=4
GK
3
0
9
1
Deletion: introducing errors
rank(13) = 2 or 3 ?
Incoming: 13
50
3
N=5
GK
3
0
13
2-3
9
1
Incoming: 14
node: value & rank bounds
50
3
N=5
GK
3
0
13
2-3
9
1
Incoming: 14
Insertion: does not introduce errors
Error retains in the interval
50
4
14
3-4
N=6
GK
3
0
13
2-3
9
1
One comes, one goes
New elements are inserted immediately
Try to delete one after each insertion
50
5
14
3-4
N=6
GK
3
0
13
2-3
9
1
Removing a node causes errors
Delete the node with smallest error, if it is smaller than 𝜀𝑁
50
5
14
3-4
1 2 2 2 2
N=6
GK
3
0
13
2-3
9
1
Query: rank(20)
50
5
14
3-4
Return rank(20)=4
The error is at most 1
The GK Quantile Summary
 The practical version
– Space: Open problem!
– Small in practice
 A complicated version achieves 𝑂
1
𝜀
log 𝜀𝑁 space
 Doesn’t support deletions
 Doesn’t (don’t know how to) support merging
 There is a randomized mergeable quantile summary
A Quantile Summary Supporting Deletions [WLYC13]
 Assumption: all elements are integers from [0..U]
 Use a dyadic decomposition
Small Summaries for Big Data
43
0 U-1U/2
# of elements
between U/2 and U-
1
A Quantile Summary Supporting Deletions
0 U-1x
rank(x)
A Quantile Summary Supporting Deletions
 Consider each layer as a frequency vector, and build a sketch
– Count Sketch is better than Count-Min due to unbiasedness
Space: 𝑂
1
𝜀
log1.5
𝑈
CS
CS
CS
CS
Small Summaries for Big Data 46
Random Sampling
Random Sample as a Summary
 Cost
– A random sample is cheaper to build if you have random access
to data
 Error-size tradeoff
– Specialized summaries often have size 𝑂(1/𝜀)
– A random sample needs size 𝑂(1/𝜀2
)
 Generality
– Specialized summaries can only answer one type of queries
– Random sample can be used to answer different queries
 In particular, ad hoc queries that are unknown beforehand
Small Summaries for Big Data
47
Sampling Definitions
 Sampling without replacement
– Randomly draw an element
– Don’t put it back
– Repeat s times
 Sampling with replacement
– Randomly draw an element
– Put it back
– Repeat s times
 Coin-flip sampling
– Sample each element with probability 𝑝
– Sample size changes for dynamic data
Small Summaries for Big Data
48
The statistical difference is
very small, for 𝑁 ≫ 𝑠 ≫ 1
Reservoir Sampling
 Given a sample of size 𝑠 drawn from a data set of size 𝑁
(without replacement)
 A new element in added, how to update the sample?
 Algorithm
– 𝑁 ← 𝑁 + 1
– With probability 𝑠/𝑁, use it to replace an item in the
current sample chosen uniformly at random
– With probability 1 − 𝑠/𝑁, throw it away
Small Summaries for Big Data
49
Reservoir Sampling Correctness Proof
 Many “proofs” found online are actually wrong
– They only show that each item is sampled with probability 𝑠/𝑁
– Need to show that every subset of size 𝑠 has the same
probability to be the sample
 Correct proof relates with the Fisher-Yates shuffle
Small Summaries for Big Data
50
a
b
c
d
b
a
c
d
a
b
c
d
b
c
a
d
b
d
a
c
s = 2
Merging random samples
Summarizing Disitributed Data
51
+
𝑁1 = 10 𝑁2 = 15
With prob. 𝑁1/(𝑁1 + 𝑁2), take a sample from 𝑆1
𝑆1
𝑆2
Merging random samples
Summarizing Disitributed Data
52
+
𝑁1 = 9 𝑁2 = 15
With prob. 𝑁1/(𝑁1 + 𝑁2), take a sample from 𝑆1
𝑆1
𝑆2
Merging random samples
Summarizing Disitributed Data
53
+
𝑁1 = 9 𝑁2 = 15
With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2
𝑆1
𝑆2
Merging random samples
Summarizing Disitributed Data
54
+
𝑁1 = 9 𝑁2 = 14
With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2
𝑆1
𝑆2
Merging random samples
Summarizing Disitributed Data
55
+
𝑁1 = 8 𝑁2 = 12
With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2
𝑆1
𝑆2
𝑁3 = 25
Random Sampling with Deletions
 Naïve method:
– Maintain a sample of size 𝑀 > 𝑠
– Handle insertions as in reservoir sampling
– If the deleted item is not in the sample, ignore it
– If the deleted item is in the sample, delete it from the sample
– Problem: the sample size may drop to below 𝑠
 Some other algorithms in the DB literature but sample size
still drops when there are many deletions
Small Summaries for Big Data
56
Small Summaries for Big Data
Min-wise Hashing / Sampling
 For each item 𝑥, compute ℎ 𝑥
 Return the 𝑠 items with the smallest values of ℎ(𝑥) as the
sample
– Assumption: ℎ is a truly random hash function
– A lot of research on how to remove this assumption
57
𝑳 𝟎-Sampling [FIS05, CMR05]
 Suppose ℎ 𝑥 ∈ [0. . 𝑢 − 1] and always has log 𝑢 bits (pad
zeroes when necessary)
 Map all items to level 0
 Map items that have one leading zero in ℎ(𝑥) to level 1
 Map items that have two leading zeroes in ℎ(𝑥) to level 2
 …
 Use a 2𝑠 -sparse recovery summary for each level
Small Summaries for Big Data
p=1
p=1/U
2𝑠-sparse recovery
58
Estimating Set Similarity with MinHash
 Assumption: The sample from 𝐴 and the sample from 𝐵 are
obtained by the same ℎ
 Pr[(A item randomly sampled from 𝐴 ∪ 𝐵) ∈ 𝐴 ∩ 𝐵] = 𝐽(𝐴, 𝐵)
 To estimate 𝐽(𝐴, 𝐵):
– Merge Sample(𝐴) and Sample(𝐵) to get Sample(𝐴 ∪ 𝐵) of size 𝑠
– Count how many items in Sample(𝐴 ∪ 𝐵) are in 𝐴 ∩ 𝐵
– Divide by 𝑠
Small Summaries for Big Data
59
𝐽 𝐴, 𝐵 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
Small Summaries for Big Data 60
Advanced Summaries
Geometric data: -approximations
 A sample that preserves “density” of point sets
– For any range (e.g., a circle),
|fraction of sample point − fraction of all points|≤ 𝜀
– Can simply draw a random sample of size 𝑂(1/𝜀2)
– More careful construction yields size 𝑂(1/𝜀2𝑑/(𝑑+1))
Summarizing Disitributed Data
61
Summarizing Disitributed Data
62
Geometric data: -kernels
 -kernels approximately preserve the convex hull of points
– -kernel has size O(1/(d-1)/2)
– Streaming -kernel has size O(1/(d-1)/2 log(1/))
Graph Sketching [AGM12]
 Goal: Build a sketch for each vertex and use it to answer queries
 Connectivity: want to test if there is a path between any two nodes
– Trivial if edges can only be inserted; want to support deletions
 Basic idea: repeatedly contract edges between components
– Use L0 sampling to get edges from vector of adjacencies
– The L0 sampling sketch supports deletions and merges
 Problem: as components grow, sampling edges from components
most likely to produce internal links
Small Summaries for Big Data
63
Graph Sketching
 Idea: use clever encoding of edges
 Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i
 When node i and node j get merged, sum their L0 sketches
– Contribution of edge (i,j) exactly cancels out
– Only non-internal edges remain in the L0 sketches
 Use independent sketches for each iteration of the algorithm
– Only need O(log n) rounds with high probability
 Result: O(poly-log n) space per node for connectivity
Small Summaries for Big Data
i
j
+
=
64
Other Graph Results via sketching
 Recent flurry of activity in summaries for graph problems
– K-connectivity via connectivity
– Bipartiteness via connectivity:
– (Weight of the) Minimum spanning tree:
– Sparsification: find G’ with few edges so that cut(G,C)  cut(G’,C)
– Matching: find a maximal matching (assuming it is small)
 Total cost is typical O(|V|), rather than O(|E|)
– Semi-streaming / semi-external model
Small Summaries for Big Data
65
Matrix Sketching
 Given matrices A, B, want to approximate matrix product AB
– Measure the normed error of approximation C: ǁAB – Cǁ
 Main results for the Frobenius (entrywise) norm ǁǁF
– ǁCǁF = (i,j Ci,j
2)½
– Results rely on sketches, so this entrywise norm is most natural
Small Summaries for Big Data
66
Direct Application of Sketches
 Build AMS sketch of each row of A (Ai), each column of B (Bj)
 Estimate Ci,j by estimating inner product of Ai with Bj
– Absolute error in estimate is  ǁAiǁ2 ǁBjǁ2 (whp)
– Sum over all entries in matrix, squared error is ǁAǁFǁBǁF
 Outline formalized & improved by Clarkson & Woodruff [09,13]
– Improve running time to linear in number of non-zeros in A,B
Small Summaries for Big Data
67
Compressed Matrix Multiplication
 What if we are just interested in the large entries of AB?
– Or, the ability to estimate any entry of (AB)
– Arises in recommender systems, other ML applications
 If we had a sketch of (AB), could find these approximately
 Compressed Matrix Multiplication [Pagh 12]:
– Can we compute sketch(AB) from sketch(A) and sketch(B)?
– To do this, need to dive into structure of the Count (AMS) sketch
 Several insights needed to build the method:
– Express matrix product as summation of outer products
– Take convolution of sketches to get a sketch of outer product
– New hash function enables this to proceed
– Use the FFT to speed up from O(w2) to O(w log w)
Small Summaries for Big Data
68
More Linear Algebra
 Matrix multiplication improvement: use more powerful hash fns
– Obtain a single accurate estimate with high probability
 Linear regression given matrix A and vector b:
find x  Rd to (approximately) solve minx ǁAx – bǁ
– Approach: solve the minimization in “sketch space”
– From a summary of size O(d2/) [independent of rows of A]
 Frequent directions: approximate matrix-vector product
[Ghashami, Liberty, Phillips, Woodruff 15]
– Use the SVD to (incrementally) summarize matrices
 The relevant sketches can be built quickly: proportional to the
number of nonzeros in the matrices (input sparsity)
– Survey: Sketching as a tool for linear algebra [Woodruff 14]
Small Summaries for Big Data
69
Current Directions in Data Summarization
 Sparse representations of high dimensional objects
– Compressed sensing, sparse fast fourier transform
 General purpose numerical linear algebra for (large) matrices
– k-rank approximation, linear regression, PCA, SVD, eigenvalues
 Summaries to verify full calculation: a ‘checksum for computation’
 Geometric (big) data: coresets, clustering, machine learning
 Use of summaries in large-scale, distributed computation
– Build them in MapReduce, Continuous Distributed models
 Communication-efficient maintenance of summaries
– As the (distributed) input is modified
Small Summaries for Big Data
70
 Two complementary approaches in response to growing data sizes
– Scale the computation up; scale the data down
 The theory and practice of data summarization has many guises
– Sampling theory (since the start of statistics)
– Streaming algorithms in computer science
– Compressive sampling, dimensionality reduction… (maths, stats, CS)
 Continuing interest in applying and developing new theory
Small Summary of Small Summaries
71
Small Summaries for Big Data
Small Summaries for Big Data 72
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

prediction of_inventory_management
prediction of_inventory_managementprediction of_inventory_management
prediction of_inventory_management
FEG
 

Was ist angesagt? (19)

Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather Prediction
 
Principal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding documentPrincipal Component Analysis(PCA) understanding document
Principal Component Analysis(PCA) understanding document
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Analysis of algorithm
Analysis of algorithmAnalysis of algorithm
Analysis of algorithm
 
Six sigma
Six sigma Six sigma
Six sigma
 
"FingerPrint Recognition Using Principle Component Analysis(PCA)”
"FingerPrint Recognition Using Principle Component Analysis(PCA)”"FingerPrint Recognition Using Principle Component Analysis(PCA)”
"FingerPrint Recognition Using Principle Component Analysis(PCA)”
 
2.mathematics for machine learning
2.mathematics for machine learning2.mathematics for machine learning
2.mathematics for machine learning
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
prediction of_inventory_management
prediction of_inventory_managementprediction of_inventory_management
prediction of_inventory_management
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Merge sort
Merge sortMerge sort
Merge sort
 
external sorting
external sortingexternal sorting
external sorting
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimization
 
PCA
PCAPCA
PCA
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Pca
PcaPca
Pca
 
Hash tables
Hash tablesHash tables
Hash tables
 

Ähnlich wie Ke yi small summaries for big data

streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 

Ähnlich wie Ke yi small summaries for big data (20)

complexity analysis.pdf
complexity analysis.pdfcomplexity analysis.pdf
complexity analysis.pdf
 
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdfCS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
 
Unit i
Unit iUnit i
Unit i
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
Enar short course
Enar short courseEnar short course
Enar short course
 
Unit i
Unit iUnit i
Unit i
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learning
 
Chapter two
Chapter twoChapter two
Chapter two
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
Daamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaperDaamen r 2010scwr-cpaper
Daamen r 2010scwr-cpaper
 
Merge sort analysis and its real time applications
Merge sort analysis and its real time applicationsMerge sort analysis and its real time applications
Merge sort analysis and its real time applications
 
Divide and Conquer / Greedy Techniques
Divide and Conquer / Greedy TechniquesDivide and Conquer / Greedy Techniques
Divide and Conquer / Greedy Techniques
 
2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx
 
chapter 1
chapter 1chapter 1
chapter 1
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 

Mehr von jins0618

Mehr von jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 

Kürzlich hochgeladen

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Kürzlich hochgeladen (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Ke yi small summaries for big data

  • 1. Small Summaries for Big Data Ke Yi HKUST yike@ust.hk
  • 2. The case for “Big Data” in one slide  “Big” data arises in many forms: – Medical data: genetic sequences, time series – Activity data: GPS location, social network activity – Business data: customer behavior tracking at fine detail – Physical Measurements: from science (physics, astronomy)  The 3 V’s: – Volume – Velocity – Variety Small Summaries for Big Data 2 focus of this talk
  • 3. Computational scalability  The first (prevailing) approach: scale up the computation  Many great technical ideas: – Use many cheap commodity devices – Accept and tolerate failure – Move code to data – MapReduce: BSP for programmers – Break problem into many small pieces – Decide which constraints to drop: noSQL, ACID, CAP  Scaling up comes with its disadvantages: – Expensive (hardware, equipment, energy), still not always fast  This talk is not about this approach! Small Summaries for Big Data 3
  • 4. Downsizing data  A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Often randomized (small constant probability of error) – Examples: samples, sketches, histograms, wavelet transforms  Complementary to the first approach: not a case of either-or  Some drawbacks: – Not a general purpose approach: need to fit the problem – Some computations don’t allow any useful summary Small Summaries for Big Data 4
  • 5. Outline for the talk  Some most important data summaries – Sketches: Bloom filter, Count-Min, Count-Sketch (AMS) – Counters: Heavy hitters, quantiles – Sampling: simple samples, distinct sampling – Summaries for more complex objects: graphs and matrices  Current trends and future challenges for data summarization  Many abbreviations and omissions (histograms, wavelets, ...)  A lot of work relevant to compact summaries – In both the database and the algorithms literature Small Summaries for Big Data 5
  • 6. Small Summaries for Big Data 6 Summary Construction  There are several different models for summary construction – Offline computation: e.g. sort data, take percentiles – Streaming: summary merged with one new item each step – Full mergeability: allow arbitrary merges of partial summaries  The most general and widely applicable category  Key methods for summaries: – Create an empty summary – Update with one new tuple: streaming processing  Insert and delete – Merge summaries together: distributed processing (eg MapR) – Query: may tolerate some approximation (parameterized by 𝜀)  Several important cost metrics (as function of 𝜀, 𝑛): – Size of summary, time cost of each operation
  • 7. Small Summaries for Big Data 7 Sketches
  • 8. Small Summaries for Big Data Bloom Filters [Bloom 1970]  Bloom filters compactly encode set membership – E.g. store a list of many long URLs compactly – 𝑘 random hash functions map items to 𝑚-bit vector 𝑘 times – Update: Set all 𝑘 entries to 1 to indicate item is present – Query: Is 𝑥 in the set?  Return “yes” if all 𝑘 bits 𝑥 maps to are 1.  Duplicate insertions do not change Bloom filters  Can be merged by OR-ing vectors (of same size) item 1 1 1 8
  • 9. Bloom Filters Analysis  If item is in the set, then always return “yes” – No false negative  If item is not in the set, may return “yes” with a probability. – Prob. that a bit is 0: 1 − 1 𝑚 𝑘𝑛 – Prob. that all 𝑘 bits are 1: 1 − 1 − 1 𝑚 𝑘𝑛 𝑘 ≈ exp(𝑘 ln(1 − 𝑒 𝑘𝑛/𝑚 ) – Setting 𝑘 = 𝑚 𝑛 ln 2 minimizes this probability to 0.6185 𝑚/𝑛 .  For false positive prob 𝛿, Bloom filter needs 𝑂(𝑛 log 1 𝛿 ) bits Small Summaries for Big Data 9
  • 10. Bloom Filters Applications  Bloom Filters widely used in “big data” applications – Many problems require storing a large set of items  Can generalize to allow deletions – Swap bits for counters: increment on insert, decrement on delete – If no duplicates, small counters suffice: 4 bits per counter  Bloom Filters are still an active research area – Often appear in networking conferences – Also known as “signature files” in DB Small Summaries for Big Data 10 item 1 2 22 2 item item
  • 11. Invertible Bloom Lookup Table [GM11]  A summary that – Support insertions and deletions – When # item ≤ 𝑡, can retrieve all items – Also known as sparse recovery  Easy case when 𝑡 = 1 – Assume all items are integers – Just keep 𝑧 = the bitwise-XOR of all items  To insert 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥  To delete 𝑥: 𝑧 ← 𝑧 ⊕ 𝑥 – Assumption: no duplicates; can’t delete a non-existing item  A problem: How do we know when # items = 1? – Just keep another counter Small Summaries for Big Data 11
  • 12. Invertible Bloom Lookup Table  Structure is the same as a Bloom filter, but each cell – is the XOR or all items mapped here – also keeps a counter of such items  How to recover? – First check if there is any cell has only 1 item – Recover the item if there is one – Delete the item from the all cells, and repeat Small Summaries for Big Data 12 item 1 2 22 2 item item
  • 13. Invertible Bloom Lookup Table  Can show that – Using 1.5𝑡 cells, can recover ≤ 𝑡 items with high probability Small Summaries for Big Data 13 2 21 1 item item
  • 14. Small Summaries for Big Data Count-Min Sketch [CM 04]  Count Min sketch encodes item counts – Allows estimation of frequencies (e.g. for selectivity estimation) – Some similarities in appearance to Bloom filters  Model input data as a vector 𝑥 of dimension 𝑈 – E.g., [0. . 𝑈] is the space of all IP addresses, 𝑥[𝑖] is the frequency of IP address 𝑖 in the data – Create a small summary as an array of 𝑤 × 𝑑 in size – Use 𝑑 hash function to map vector entries to [1..𝑤] 𝑊 𝑑 Array: 𝐶𝑀[𝑖, 𝑗] 14
  • 15. Small Summaries for Big Data Count-Min Sketch Structure  Update: each entry in vector 𝑥 is mapped to one bucket per row.  Merge two sketches by entry-wise summation  Query: estimate 𝑥[𝑖] by taking min 𝑘 𝐶𝑀[𝑘, ℎ 𝑘(𝑖)] – Never underestimate – Will bound the error of overestimation +𝑐 +𝑐 +𝑐 +𝑐 ℎ1(𝑖) ℎ 𝑑(𝑖) 𝑖, +𝑐 𝑑rows 𝑤 = 2/𝜀 15
  • 16. Count-Min Sketch Analysis  Focusing on first row: – 𝑥 𝑖 is always added to 𝐶𝑀[1, ℎ1 𝑖 ] – 𝑥 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶𝑀[1, ℎ1 𝑖 ] with prob. 1/𝑤 – The expected error is 𝑥 𝑗 /𝑤 – The total expected error is 𝑗≠𝑖 𝑥 𝑗 𝑤 ≤ | 𝑥 |1 𝑤 – By Markov inequality, Pr[error> 2⋅| 𝑥 |1 𝑤 ] < 1 2  By taking the minimum of 𝑑 rows, this prob. is 1 2 𝑑  To give an 𝜀||𝑥||1 error with prob. 1 − 𝛿, the sketch needs to have size 2 𝜀 × log 1 𝛿 Small Summaries for Big Data 16
  • 17. Generalization: Sketch Structures  Sketch is a class of summary that is a linear transform of input – Sketch(𝑥) = 𝑆𝑥 for some matrix 𝑆 – Hence, Sketch(𝑥 + 𝛽𝑦) =  Sketch(𝑥) + 𝛽 Sketch(𝑦) – Trivial to update and merge  Often describe 𝑆 in terms of hash functions  Analysis relies on properties of the hash functions – Some assumes truly random hash functions (e.g., Bloom filter)  Don’t exist, but ad hoc hash functions work well for many real-world data – Others need hash functions with limited independence. Small Summaries for Big Data 17
  • 18. Small Summaries for Big Data Count Sketch / AMS Sketch [CCF-C02, AMS96]  Second hash function: 𝑔 𝑘 maps each 𝑖 to +1 or −1  Update: each entry in vector 𝑥 is mapped to one bucket per row.  Merge two sketches by entry-wise summation  Query: estimate 𝑥[𝑖] by taking median 𝑘 𝐶[𝑘, ℎ 𝑘 𝑖 𝑔 𝑘 𝑖 ] – May either overestimate or underestimate +𝑐𝑔1(𝑖) +𝑐𝑔2(𝑖) +𝑐𝑔3(𝑖) +𝑐𝑔 𝑑(𝑖) ℎ1(𝑖) ℎ 𝑑(𝑖) 𝑖, +𝑐 𝑑rows 𝑤 18
  • 19. Count Sketch Analysis  Focusing on first row: – 𝑥 𝑖 𝑔1 𝑖 is always added to 𝐶[1, ℎ1 𝑖 ]  We return 𝐶 1, ℎ1 𝑖 𝑔1 𝑖 = 𝑥[𝑖] – 𝑥 𝑗 𝑔1 𝑗 , 𝑗 ≠ 𝑖 is added to 𝐶[1, ℎ1 𝑖 ] with prob. 1/𝑤 – The expected error is 𝑥 𝑗 𝐸[𝑔1 𝑖 𝑔1 𝑗 ]/𝑤 = 0 – E[total error] = 0, E[|total error|] ≤ | 𝑥 |2 𝑤 – By Chebyshev inequality, Pr[|total error|> 2⋅| 𝑥 |2 𝑤 ] < 1 4  By taking the median of 𝑑 rows, this prob. is 1 2 𝑂(𝑑)  To give an 𝜀||𝑥||2 error with prob. 1 − 𝛿, the sketch needs to have size 𝑂 1 𝜀2 log 1 𝛿 Small Summaries for Big Data 19
  • 20. Count-Min Sketch vs Count Sketch  Count-Min: – Size: 𝑂 𝑤𝑑 – Error: ||𝑥||1/𝑤 Small Summaries for Big Data 20  Count Sketch: – Size: 𝑂 𝑤𝑑 – Error: ||𝑥||2/ 𝑤  Other benefits: – Unbiased estimator – Can also be used to estimate 𝑥 ⋅ 𝑦 (join size) Count Sketch is better when ||𝑥||2 < ||𝑥||1/ 𝑤
  • 21. Application to Large Scale Machine Learning  In machine learning, often have very large feature space – Many objects, each with huge, sparse feature vectors – Slow and costly to work in the full feature space  “Hash kernels”: work with a sketch of the features – Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09]  Similar analysis explains why: – Essentially, not too much noise on the important features Small Summaries for Big Data 21
  • 22. Small Summaries for Big Data 22 Counter Based Summaries 1 2 3 4 5 6 7 8 9
  • 23. Small Summaries for Big Data 23 Heavy hitters [MG82]  Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream – Estimate their frequencies with additive error ≤ 𝑁/𝑘 – Equivalently, achieves 𝜀𝑁 error with 𝑂(1/𝜀) space  Better than Count-Min but doesn’t support deletions  Keep 𝑘 different candidates in hand. To add a new item: – If item is monitored, increase its counter – Else, if < 𝑘 items monitored, add new item with count 1 – Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 𝑘 = 5
  • 24. Small Summaries for Big Data 24 Heavy hitters  Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream [MG’82] – Estimate their frequencies with additive error ≤ 𝑁/𝑘  Keep 𝑘 different candidates in hand. To add a new item: – If item is monitored, increase its counter – Else, if < 𝑘 items monitored, add new item with count 1 – Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 𝑘 = 5
  • 25. Small Summaries for Big Data 25 Heavy hitters  Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream [MG’82] – Estimate their frequencies with additive error ≤ 𝑁/𝑘  Keep 𝑘 different candidates in hand. To add a new item: – If item is monitored, increase its counter – Else, if < 𝑘 items monitored, add new item with count 1 – Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 𝑘 = 5
  • 26. Small Summaries for Big Data 26 MG analysis  𝑁 = total input size  𝑀 = sum of counters in data structure  Error in any estimated count at most (𝑁 − 𝑀)/(𝑘 + 1) – Estimated count a lower bound on true count – Each decrement spread over (𝑘 + 1) items: 1 new one and 𝑘 in MG – Equivalent to deleting (𝑘 + 1) distinct items from stream – At most (𝑁 − 𝑀)/(𝑘 + 1) decrement operations – Hence, can have “deleted” (𝑁 − 𝑀)/(𝑘 + 1) copies of any item – So estimated counts have at most this much error
  • 27. Small Summaries for Big Data 27 Merging two MG summaries [ACHPZY12]  Merging algorithm: – Merge two sets of 𝑘 counters in the obvious way – Take the (𝑘 + 1)th largest counter = 𝐶 𝑘+1, and subtract from all – Delete non-positive counters 1 2 3 4 5 6 7 8 9 𝑘 = 5
  • 28. (prior error) (from merge) Small Summaries for Big Data 28 Merging two MG summaries  This algorithm gives mergeability: – Merge subtracts at least 𝑘 + 1 𝐶 𝑘+1 from counter sums – So 𝑘 + 1 𝐶 𝑘+1  (𝑀1 + 𝑀2 – 𝑀12)  Sum of remaining (at most 𝑘) counters is 𝑀12 – By induction, error is ((𝑁1 − 𝑀1) + (𝑁2 − 𝑀2) + (𝑀1 + 𝑀2– 𝑀12))/(𝑘 + 1) = ((𝑁1 + 𝑁2) – 𝑀12)/(𝑘 + 1) 1 2 3 4 5 6 7 8 9 𝑘 = 5 𝐶 𝑘+1
  • 29. Summarizing Disitributed Data 29 Quantiles (order statistics)  Exact quantiles: 𝐹−1() for 0 <  < 1, 𝐹  : CDF  Approximate version: tolerate answer between 𝐹−1( − ) … 𝐹−1( + )  Dual problem - rank estimation: Given 𝑥, estimate 𝐹(𝑥) – Can use binary search to find quantiles
  • 30. Quantiles gives equi-height histogram  Automatically adapts to skew data distributions  Equi-width histograms (fixed binning) are is to construct but does not adapt to data distribution Summarizing Disitributed Data 30
  • 31. The GK Quantile Summary [GK01] 3 Incoming: 3 0 node: value & rank N=1
  • 33. GK 3 Incoming: 50 0 14 1 50 2 We know the exact rank of any element Space: 𝑂(#node) N=3
  • 34. GK 3 Incoming: 9 0 14 1 9 1 Insertion in the middle: affects other nodes’ ranks 50 2 N=4
  • 36. GK 3 0 9 1 Deletion: introducing errors rank(13) = 2 or 3 ? Incoming: 13 50 3 N=5
  • 38. GK 3 0 13 2-3 9 1 Incoming: 14 Insertion: does not introduce errors Error retains in the interval 50 4 14 3-4 N=6
  • 39. GK 3 0 13 2-3 9 1 One comes, one goes New elements are inserted immediately Try to delete one after each insertion 50 5 14 3-4 N=6
  • 40. GK 3 0 13 2-3 9 1 Removing a node causes errors Delete the node with smallest error, if it is smaller than 𝜀𝑁 50 5 14 3-4 1 2 2 2 2 N=6
  • 42. The GK Quantile Summary  The practical version – Space: Open problem! – Small in practice  A complicated version achieves 𝑂 1 𝜀 log 𝜀𝑁 space  Doesn’t support deletions  Doesn’t (don’t know how to) support merging  There is a randomized mergeable quantile summary
  • 43. A Quantile Summary Supporting Deletions [WLYC13]  Assumption: all elements are integers from [0..U]  Use a dyadic decomposition Small Summaries for Big Data 43 0 U-1U/2 # of elements between U/2 and U- 1
  • 44. A Quantile Summary Supporting Deletions 0 U-1x rank(x)
  • 45. A Quantile Summary Supporting Deletions  Consider each layer as a frequency vector, and build a sketch – Count Sketch is better than Count-Min due to unbiasedness Space: 𝑂 1 𝜀 log1.5 𝑈 CS CS CS CS
  • 46. Small Summaries for Big Data 46 Random Sampling
  • 47. Random Sample as a Summary  Cost – A random sample is cheaper to build if you have random access to data  Error-size tradeoff – Specialized summaries often have size 𝑂(1/𝜀) – A random sample needs size 𝑂(1/𝜀2 )  Generality – Specialized summaries can only answer one type of queries – Random sample can be used to answer different queries  In particular, ad hoc queries that are unknown beforehand Small Summaries for Big Data 47
  • 48. Sampling Definitions  Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times  Sampling with replacement – Randomly draw an element – Put it back – Repeat s times  Coin-flip sampling – Sample each element with probability 𝑝 – Sample size changes for dynamic data Small Summaries for Big Data 48 The statistical difference is very small, for 𝑁 ≫ 𝑠 ≫ 1
  • 49. Reservoir Sampling  Given a sample of size 𝑠 drawn from a data set of size 𝑁 (without replacement)  A new element in added, how to update the sample?  Algorithm – 𝑁 ← 𝑁 + 1 – With probability 𝑠/𝑁, use it to replace an item in the current sample chosen uniformly at random – With probability 1 − 𝑠/𝑁, throw it away Small Summaries for Big Data 49
  • 50. Reservoir Sampling Correctness Proof  Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑠/𝑁 – Need to show that every subset of size 𝑠 has the same probability to be the sample  Correct proof relates with the Fisher-Yates shuffle Small Summaries for Big Data 50 a b c d b a c d a b c d b c a d b d a c s = 2
  • 51. Merging random samples Summarizing Disitributed Data 51 + 𝑁1 = 10 𝑁2 = 15 With prob. 𝑁1/(𝑁1 + 𝑁2), take a sample from 𝑆1 𝑆1 𝑆2
  • 52. Merging random samples Summarizing Disitributed Data 52 + 𝑁1 = 9 𝑁2 = 15 With prob. 𝑁1/(𝑁1 + 𝑁2), take a sample from 𝑆1 𝑆1 𝑆2
  • 53. Merging random samples Summarizing Disitributed Data 53 + 𝑁1 = 9 𝑁2 = 15 With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2 𝑆1 𝑆2
  • 54. Merging random samples Summarizing Disitributed Data 54 + 𝑁1 = 9 𝑁2 = 14 With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2 𝑆1 𝑆2
  • 55. Merging random samples Summarizing Disitributed Data 55 + 𝑁1 = 8 𝑁2 = 12 With prob. 𝑁2/(𝑁1 + 𝑁2), take a sample from 𝑆2 𝑆1 𝑆2 𝑁3 = 25
  • 56. Random Sampling with Deletions  Naïve method: – Maintain a sample of size 𝑀 > 𝑠 – Handle insertions as in reservoir sampling – If the deleted item is not in the sample, ignore it – If the deleted item is in the sample, delete it from the sample – Problem: the sample size may drop to below 𝑠  Some other algorithms in the DB literature but sample size still drops when there are many deletions Small Summaries for Big Data 56
  • 57. Small Summaries for Big Data Min-wise Hashing / Sampling  For each item 𝑥, compute ℎ 𝑥  Return the 𝑠 items with the smallest values of ℎ(𝑥) as the sample – Assumption: ℎ is a truly random hash function – A lot of research on how to remove this assumption 57
  • 58. 𝑳 𝟎-Sampling [FIS05, CMR05]  Suppose ℎ 𝑥 ∈ [0. . 𝑢 − 1] and always has log 𝑢 bits (pad zeroes when necessary)  Map all items to level 0  Map items that have one leading zero in ℎ(𝑥) to level 1  Map items that have two leading zeroes in ℎ(𝑥) to level 2  …  Use a 2𝑠 -sparse recovery summary for each level Small Summaries for Big Data p=1 p=1/U 2𝑠-sparse recovery 58
  • 59. Estimating Set Similarity with MinHash  Assumption: The sample from 𝐴 and the sample from 𝐵 are obtained by the same ℎ  Pr[(A item randomly sampled from 𝐴 ∪ 𝐵) ∈ 𝐴 ∩ 𝐵] = 𝐽(𝐴, 𝐵)  To estimate 𝐽(𝐴, 𝐵): – Merge Sample(𝐴) and Sample(𝐵) to get Sample(𝐴 ∪ 𝐵) of size 𝑠 – Count how many items in Sample(𝐴 ∪ 𝐵) are in 𝐴 ∩ 𝐵 – Divide by 𝑠 Small Summaries for Big Data 59 𝐽 𝐴, 𝐵 = 𝐴 ∩ 𝐵 𝐴 ∪ 𝐵
  • 60. Small Summaries for Big Data 60 Advanced Summaries
  • 61. Geometric data: -approximations  A sample that preserves “density” of point sets – For any range (e.g., a circle), |fraction of sample point − fraction of all points|≤ 𝜀 – Can simply draw a random sample of size 𝑂(1/𝜀2) – More careful construction yields size 𝑂(1/𝜀2𝑑/(𝑑+1)) Summarizing Disitributed Data 61
  • 62. Summarizing Disitributed Data 62 Geometric data: -kernels  -kernels approximately preserve the convex hull of points – -kernel has size O(1/(d-1)/2) – Streaming -kernel has size O(1/(d-1)/2 log(1/))
  • 63. Graph Sketching [AGM12]  Goal: Build a sketch for each vertex and use it to answer queries  Connectivity: want to test if there is a path between any two nodes – Trivial if edges can only be inserted; want to support deletions  Basic idea: repeatedly contract edges between components – Use L0 sampling to get edges from vector of adjacencies – The L0 sampling sketch supports deletions and merges  Problem: as components grow, sampling edges from components most likely to produce internal links Small Summaries for Big Data 63
  • 64. Graph Sketching  Idea: use clever encoding of edges  Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i  When node i and node j get merged, sum their L0 sketches – Contribution of edge (i,j) exactly cancels out – Only non-internal edges remain in the L0 sketches  Use independent sketches for each iteration of the algorithm – Only need O(log n) rounds with high probability  Result: O(poly-log n) space per node for connectivity Small Summaries for Big Data i j + = 64
  • 65. Other Graph Results via sketching  Recent flurry of activity in summaries for graph problems – K-connectivity via connectivity – Bipartiteness via connectivity: – (Weight of the) Minimum spanning tree: – Sparsification: find G’ with few edges so that cut(G,C)  cut(G’,C) – Matching: find a maximal matching (assuming it is small)  Total cost is typical O(|V|), rather than O(|E|) – Semi-streaming / semi-external model Small Summaries for Big Data 65
  • 66. Matrix Sketching  Given matrices A, B, want to approximate matrix product AB – Measure the normed error of approximation C: ǁAB – Cǁ  Main results for the Frobenius (entrywise) norm ǁǁF – ǁCǁF = (i,j Ci,j 2)½ – Results rely on sketches, so this entrywise norm is most natural Small Summaries for Big Data 66
  • 67. Direct Application of Sketches  Build AMS sketch of each row of A (Ai), each column of B (Bj)  Estimate Ci,j by estimating inner product of Ai with Bj – Absolute error in estimate is  ǁAiǁ2 ǁBjǁ2 (whp) – Sum over all entries in matrix, squared error is ǁAǁFǁBǁF  Outline formalized & improved by Clarkson & Woodruff [09,13] – Improve running time to linear in number of non-zeros in A,B Small Summaries for Big Data 67
  • 68. Compressed Matrix Multiplication  What if we are just interested in the large entries of AB? – Or, the ability to estimate any entry of (AB) – Arises in recommender systems, other ML applications  If we had a sketch of (AB), could find these approximately  Compressed Matrix Multiplication [Pagh 12]: – Can we compute sketch(AB) from sketch(A) and sketch(B)? – To do this, need to dive into structure of the Count (AMS) sketch  Several insights needed to build the method: – Express matrix product as summation of outer products – Take convolution of sketches to get a sketch of outer product – New hash function enables this to proceed – Use the FFT to speed up from O(w2) to O(w log w) Small Summaries for Big Data 68
  • 69. More Linear Algebra  Matrix multiplication improvement: use more powerful hash fns – Obtain a single accurate estimate with high probability  Linear regression given matrix A and vector b: find x  Rd to (approximately) solve minx ǁAx – bǁ – Approach: solve the minimization in “sketch space” – From a summary of size O(d2/) [independent of rows of A]  Frequent directions: approximate matrix-vector product [Ghashami, Liberty, Phillips, Woodruff 15] – Use the SVD to (incrementally) summarize matrices  The relevant sketches can be built quickly: proportional to the number of nonzeros in the matrices (input sparsity) – Survey: Sketching as a tool for linear algebra [Woodruff 14] Small Summaries for Big Data 69
  • 70. Current Directions in Data Summarization  Sparse representations of high dimensional objects – Compressed sensing, sparse fast fourier transform  General purpose numerical linear algebra for (large) matrices – k-rank approximation, linear regression, PCA, SVD, eigenvalues  Summaries to verify full calculation: a ‘checksum for computation’  Geometric (big) data: coresets, clustering, machine learning  Use of summaries in large-scale, distributed computation – Build them in MapReduce, Continuous Distributed models  Communication-efficient maintenance of summaries – As the (distributed) input is modified Small Summaries for Big Data 70
  • 71.  Two complementary approaches in response to growing data sizes – Scale the computation up; scale the data down  The theory and practice of data summarization has many guises – Sampling theory (since the start of statistics) – Streaming algorithms in computer science – Compressive sampling, dimensionality reduction… (maths, stats, CS)  Continuing interest in applying and developing new theory Small Summary of Small Summaries 71 Small Summaries for Big Data
  • 72. Small Summaries for Big Data 72 Thank you!