Data streaming algorithms

TABLE OF
CONTENTS
01
Approximate counting
Algorithm
Algorithm
What do you mean by an
algorithm
Hashing
What do you mean by hashing,
along with algorithms for
hashing
Allows counting of large
numbers of events using
low memory
Counting Distinct
Elements
We input stream of data and
output are distinct elements in
the data stream.
Frequency estimation
Frequency estimation is to
estimate the frequency of any
item x, i.e. the number of
occurrences of any item x
References
All the research papers
referred
02
03
04
05
06

What is an Algorithm?
An algorithm is a set of instructions that produces an output or a result.
It tells the system what to do in order to achieve the desired result. It may not know what the
result is beforehand, but it knows that it wants one.
4

Hashing is a sort of algorithm that takes information
of any size and changes over it into information of
settled size. The principle contrast between hashing
and encryption is that a hash is irreversible.
Hashing is most commonly used to implement hash
tables. A hash table stores key/value pairs in the form
of a list where any element can be accessed using its
index.
Hashing is also used in data encryption. Passwords
can be stored in the form of their hashes so that even if
a database is breached, plaintext passwords are not
accessible. MD5, SHA-1 and SHA-2 are popular
cryptographic hashes.

Hashing algorithms are functions that generate a fixed-length result (the hash, or hash value)
from a given input. The hash value is a summary of the original data.
Definition:
A hash function is a function h: D -> R,
where the domain D = {0,1}* and R = {0,1}n for some n >= 1
DESCRIPTION OF A HASH FUNCTION
In general, hash functions work as follows:
● The input message is divided into blocks.
● Then the hash for the first block, a value with a fixed size, is
calculated for the first block.
● Then, the hash for the second block is obtained and added to the
previous output.
● This process is repeated until all blocks are calculated.

8
● Unique Hash value
● Hashing Speed
● Secure hash
● Hash functions are widely used in IT.
● We can use them for digital signatures, message authentication codes (MACs), and other
forms of authentication.
● We can also use them for indexing data in hash tables, for fingerprinting, identifying files,
detecting duplicates or as checksums (we can detect if a sent file didn’t suffer accidental
or intentional data corruption).
● We can also use them for password storage.

Some Hashing
Algorithm:
1. MD5
2. SHA-1
3. SHA-2
4. SHA-3
9
Hash Algorithms Comparisons

10
Step1: //Define r as the following
var int[64] r, k
r[ 0..15] := {7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22}
r[16..31] := {5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20}
r[32..47] := {4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23}
r[48..63] := {6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21}
Step 2: //Use binary integer part of the sines of integers as constants:
for i from 0 to 63
k[i] := floor(abs(sin(i + 1)) × 2^32)
//Initialize variables:
var int h0 := 0x67452301
var int h1 := 0xEFCDAB89
var int h2 := 0x98BADCFE
var int h3 := 0x10325476
Step3: //Pre-processing:
append "1" bit to message
append "0" bits until message length in bits ≡ 448 (mod 512)
append bit length of message as 64-bit little-endian integer to message
//Process the message in successive 512-bit chunks:
for each 512-bit chunk of message
break chunk into sixteen 32-bit little-endian words w(i), 0 ≤ i ≤ 15
Step 4: //Initialize hash value for this chunk:
var int a := h0
var int b := h1
var int c := h2
var int d := h3
1. MD-5 PSEUDOCODE
Step 5: //Main loop:
for i from 0 to 63
if 0 ≤ i ≤ 15 then
f := (b and c) or ((not b) and d)
g := i
else if 16 ≤ i ≤ 31
f := (d and b) or ((not d) and c)
g := (5×i + 1) mod 16
else if 32 ≤ i ≤ 47
f := b xor c xor d
g := (3×i + 5) mod 16
else if 48 ≤ i ≤ 63
f := c xor (b or (not d))
g := (7×i) mod 16
temp := d
d := c
c := b
b := ((a + f + k(i) + w(g)) leftrotate r(i)) + b
a := temp
Step 6: //Add this chunk's hash to result so far:
h0 := h0 + a
h1 := h1 + b
h2 := h2 + c
h3 := h3 + d
var int digest := h0 append h1 append h2 append h3 //(expressed as
little-endian)

11
2. SHA-1 PSEUDOCODE
Step1: initialize all the variables
ml = message length in bits (always a multiple of the number of bits
in a character).
Step2: Pre-processing:
append the bit '1' to the message i.e. by adding 0x80 if characters are
8 bits.
Step 3: Process the message in successive 512-bit chunks: break
message into 512-bit chunks
for each chunk
break chunk into sixteen 32-bit big-endian words w[i], 0 ≤ i ≤ 15
Step 4 : Extend the sixteen 32-bit words into eighty 32-bit words: for i
from 16 to 79
w[i] = (w[i-3] xor w[i-8] xor w[i-14] xor w[i-16]) leftrotate 1
Step 5: Initialize hash value for this chunk: Main loop:
for i from 0 to 79
if 0 ≤ i ≤ 19 then
f = (b and c) or ((not b) and d)
k = 0x5A827999
else if 20 ≤ i ≤ 39
f = b xor c xor d
k = 0x6ED9EBA1
else if 40 ≤ i ≤ 59
f = (b and c) or (b and d) or (c and d)
k = 0x8F1BBCDC
else if 60 ≤ i ≤ 79
f = b xor c xor d
k = 0xCA62C1D6
temp = (a leftrotate 5) + f + e + k + w[i]
e = d
d = c
c = b
leftrotate 30
b = a
a = temp
Step 6: Add this chunk's hash to result so far:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
Step 7: Produce the final hash value (big-endian) as a 160 bit number

12
Step 1: Initialize hash values:
first 32 bits of the fractional parts of the square roots of the first 8 primes
2..19
Step 2: Initialize array of round constants:
first 32 bits of the fractional parts of the cube roots of the first 64 primes
2..311
Step 3: Pre-processing:
append the bit '1' to the message append k bits '0', where k is the minimum
number >= 0 such that the resulting message
length (modulo 512 in bits) is 448.
append length of message (without the '1' bit or padding), in bits, as 64-bit
big-endian integer (this will make the entire post-processed length a multiple
of 512 bits)
Step 4: Process the message in successive 512-bit chunks:
break message into 512-bit chunks for each chunk
create a 64-entry message schedule array w[0..63] of 32- bit words
Step 5: Extend the first 16 words into the remaining 48 words w[16..63] of the
message schedule array:
for i from 16 to 63
s0 := (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3)
s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10)
w[i] := w[i-16] + s0 + w[i-7] + s1
Step 6: Initialize working variables to current hash value Compression
function main loop:
for i from 0 to 63
S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
ch := (e and f) xor ((not e) and g)
temp1 := h + S1 + ch + k[i] + w[i]
S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
maj := (a and b) xor (a and c) xor (b and c)
temp2 := S0 + maj
h := g
g := f
f := e
e := d + temp1
d := c
c := b
b := a
a := temp1 + temp2
Step 7: Add the compressed chunk to the current hash value
Step 8: Produce the final hash value (big-endian):
digest := hash := h0 append h1 append h2 append h3 append h4 append h5
append h6 append h7.
SHA-224 is identical to SHA-256[11], except that: the initial hash values h0
through h7 are different, and the output is constructed by omitting h7.
3. SHA-2 PSEUDOCODE

Algorithms and Limitations
13
Sr
No
Hashing Algorithms Limitations
1 SHA-1 This requires a lot of computing power and resources
2 SHA-2 Increased resistance to collision means SHA256 and SHA512 produce longer outputs
(256b and 512b respectively) than SHA1 (160b). Those defending use of SHA2 cite
this increased output size as reason behind attack resistance.
3 SHA-3 SHA-3 is designed to be a good hash-function, not a good password-hashing-scheme
(PHS), whereas bcrypt is designed to be a PHS and was analyzed in this direction as
well.
4 MD5 Using salted md5 for passwords is a bad idea. Not because of MD5's cryptographic
weaknesses, but because it's fast. This means that an attacker can try billions of
candidate passwords per second on a single GPU.

Overview
● The Approximate counting algorithm also known as the morris algorithm allows counting of large
numbers of events using low memory
● Invented by robert morris it uses probabilistic counting to increment the counter
● This algorithm is considered one of the precursors of the current data streaming algorithms.
● The basic idea is to track log n instead of n and use log log n bits instead of log n bits
15

Origin
● The Approximate counting algorithm also known as the morris algorithm
allows counting of large numbers of events using low memory
● Invented by robert morris it uses probabilistic counting to increment the
counter
● This algorithm is considered one of the precursors of the current data
streaming algorithms.
● The basic idea is to track log n instead of n and use log log n bits instead of
log n bits
● The space complexity of this technique is O(log log n)
16

Working of the Algorithm
We need log2 n bits to store an integer between 1 and n else
two integers would map to the same bitstring and be
indistinguishable. But what if we only care about recovering
the integer up to a constant factor then it suffices to only
recover log n, and storing log n only requires O(log log n) bits.
Consider the streaming problem there is a stream of n
increments. We would like to compute n, though
approximately, and with some potential small probability of
failure. We could keep an explicit counter in memory and
increment it after each stream update, but that would require
log2 n bits. Morris’ clever algorithm works as follows: initialize
a counter c to 1, and after each update increment c with
probability 1/2 c and do nothing otherwise. Philippe Flajolet
showed that the expected value of 2c is n + 2 after n updates ,
and thus 2c −2 is an unbiased estimator of n.
17

Applications
● The algorithm is useful in examining large data streams for patterns.
● It is particularly useful in applications of data compression
● Sight and sound recognition
● Artificial intelligence applications.
18

19
Morris ‘s Counter :
1. Init():
a. C <-0
2. Update(item)
a. Increment c with
probability 2 ^ -c
b. And do nothing with
probability 1 - 2 ^-c
3. Query():
a. Return 2 ^c -1

04
We input stream of data and output are distinct elements in the data
stream.
For example, count the number of distinct of IP address you encounter.

What do you mean by Counting distinct Elements?
Our first problem is to approximate the Fp-norm of items in a stream.
Fp-norm: Let S be a multi-set, where every item i of S is in [N]. Let mi be the
number of occurrences of item i in set S. Then the Fp-norm of set S is defined
by,
Where 0^P is set to be 0. By definition, the F0-norm of set S is the number of
distinct items in S, and the F1-norm of S is the number of items in S.

Problem Statement : Let S be a data stream representing a multi set S. Items of
S arrive consecutively and every items i ∈ [n]. Design a streaming algorithm to
(ε,δ) approximate the F0-norm of set S.
Where ε is confidence parameter and δ is
approximation parameter
To solve this problem statement, we can implement 3
different algorithms,
1.The AMS Algorithm(Primitive)
2.The BJKST Algorithm(basic)
3.Indyk Algorithm(advanced)

1.The AMS Algorithm
This algorithm for approximating F0 is by Noga Alon, Yossi Matias, and Mario
Szegedy.
Assume that we have seen sufficiently many numbers,and these numbers
are uniformly distributed. We look at the binary expression Binary(x) of
every item x, and we expect that for one out of d distinct items Binary(x)
ends with d consecutive zeros. More generally, let

be the number of zeros that Binary(x) ends with, and we have the following observation:
1. Ifρ(x) = 1 for any x, then it is likely that the number of distinct integers is 2^1= 2.
4. Ifρ(x) =r for any x, then it is likely that the number of distinct integers is2r.
To implement this idea, we use a hash function h so that, after applying h, all items
in S are uniformly distributed, and on average one out of F0 distinct numbers hit
ρ(h(x)) ≥ logF0. Hence the maximum value of ρ(h(x)) over all items x in the stream could
give us a good approximation of the number of distinct items.

An Algorithm For Approximating F0:
1. Choose a random function h: [n]→[n] from a family of pairwise independent hash
functions;
2. Z←0;
3. While an item x arrives do
a. ifρ(h(x))> z then
i. z←ρ(h(x));
4. Return 2z+½

2.The BJKST Algorithm
Our second algorithm for approximating F0 is a simplified version of the algorithm by Bar-
Yossef et al. In contrast to the AMS algorithm, the BJKST algorithm uses a set to
keep the sampled items.
The basic idea behind the sampling scheme of the BJKST algorithm is as follows:
1. Let B be a set that is used to retain sampled items, and B=∅ initially. The size of
B is O(1/ε2) and only depends on approximation parameter ε.
2. The initial sampling probability is 1, i.e. the algorithm keeps all items seen so far
in B.
3. When the set B becomes full, shrink B by removing about half items and from
then on the sample probability becomes smaller
4. In the end the number of items in B and the current sampling probability are
used to approximate the F0-norm

The BJKST Algorithm (Simplified Version)
1. Choose a random function h: [n]→[n] from a family of pairwise independent hash
functions.
2. Z←0 //Z is the index of the current level
3. B←∅ //Set B keeps sampled items
4. While an item x arrives do
a. ifρ(h(x))≥z then
i. B←B∪{(x,ρ(h(x)))}7
ii. while|B|≥c/ε2 do //Set B becomes full
1. z←z+ 1 //Increase the level
2. Shrink B by removing all (x,ρ(h(x))) with ρ(h(x))< z
5. Return |B|·2z

3.INDYK Algorithm
We next show that F0-norm of a set Scan be estimated in dynamic streams.This
algorithm, due to Piotr Indyk, presents beautiful applications of the so-called
stable distributions in designing streaming algorithms.
Let S be a stream consisting of pairs of the form (si,Ui), where si∈[n] and
Ui= +/− represents dynamic changes of si. Design a data streaming
algorithm that, for any ε and δ, (ε,δ)-approximates the F0-norm of S.
Assume that every item in the stream is in [n], and we want to achieve an (ε,δ)-
approximation of the Fp-norm. Let us further assume that we have matrix M of
k=Θ(ε−2log(1/δ)) rows and n columns, where every item in M is a random variable drawn
from a p-stable distribution, generated by (BJKST). Given matrix M, (Indyk) keeps a vector z
∈ Rk which can be expressed by a linear combination of columns of matrix M

The F0 norm of multi-set S can be approximated by Indyk Algorithm for choosing sufficiently
small p, assuming that we have an upper bound K of the number of occurrences of every
item in the stream.
Approximating Fp-norm in a Turnstile Stream (An Idealized Algorithm):
1. While 1≤i≤k do
2. Zi←0
3. While an operation arrives do
a. If item j is added then
b. For i←1,k do
i. zi←zi+M[i,j]
c. If item j is deleted then
i. For i←1,k do
1. zi←zi−M[i,j]
d. If Fp-norm is asked then
i. Return medium1≤i≤k{|zi|p}·scalefactor(p)
This idealized algorithm relies on matrix M of size k×n, and for every occurrence of item i, the
algorithm needs the i th column of matrix M

Complexity of Various Algorithms:
1.The AMS Algorithm
Running k = Θ(log(1/δ)) independent copies of Algorithm above and returning the median value, we can
make the two probabilities above at most δ. This gives an(O(1),δ) -approximation of the number of
distinct items over the stream.
2.The BJKST Algorithm
By running Θ(log(1/δ)) independent copies in parallel and returning the medium of these outputs, the
BJKST algorithm(ε,δ)-approximates the F0-norm of the multiset S.
3.INDYK Algorithm
For any parametersε,δ, there is an algorithm(ε,δ)-approximates the number of distinct elements in a
turnstile stream. The algorithm needs O (ε−2 log n log(1/δ)) bits of space. The update time for every
coming item isO(ε−2log(1/δ)).

Frequency estimation
32
- Frequency estimation is to estimate the frequency of any item x, i.e.
the number of occurrences of any item x
- The basic setting is as follows :
- Let S be a multi-set, and is empty initially
- The data stream consists of a sequence of update operations to set S,
and each operation is one of the following three forms:

Three forms
performing the operation
S ← S ∪ {x};
INSERT DELETE
, performing the
operation S ← S {x};
QUERY
querying the number of
occurrences of x in the
multiset S

Algorithm
- Count-min sketch : Count-Min Sketch for this frequency estimation problem.
- It consists of a fixed array C of counters of width w and depth d
- These counters are all initialized to be zero. Each row is associated to a pairwise hash function hi
, where each hi maps an element from U to {1, . . . , w}.
34

Algorithm
1: d = [log(1/δ)]
2: w = [e/ε]
3: while an operation arrives do
4: if Insert(S, x) then
5: for j ← 1, d do
6: C[j, hj (x)] ← C[j, hj (x)] + 1
7: if Delete(S, x) then
8: for j ← 1, d do
9: C[ j, hj (x)] ← C[ j, hj (x)] − 1
10: if the number of occurrence of x is asked then
11: Return mx = min1≤j≤d C[j, hj (x)]
35
Where ε is confidence
parameter and δ is
approximation parameter

Choosing W and d
- For given parameters ε and δ, the width and height of Count-Min sketch is set to be w =[e/ε] and
d =[ln(1/δ)].
- Hence for constant ε and δ, the sketch only consists of constant number of counters.
- Note that the size of the Count-Min sketch only depends on the accuracy of the approximation,
and independent of the size of the universe.
36

Research papers Referenced:
1. https://inst.eecs.berkeley.edu/~cs170/fa18/assets/streaming-170.pdf
2. https://www.cs.dartmouth.edu/~ac/Teach/CS35-Spring20/Notes/lecnotes.pdf
3. https://resources.mpi-inf.mpg.de/departments/d1/teaching/ss14/gitcs/notes3.pdf
4. https://people.seas.harvard.edu/~minilek/publications/papers/xrds.pdf
5. https://www.quantamagazine.org/best-ever-algorithm-found-for-huge-streams-
of-data-20171024/

CREDITS: This presentation template was created
by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik.
THANKS!
Our team:
1.Aryan Singh(18070124017)
2.Hridyesh Singh Bisht(18070124030)
3.Kavya Suthar(18070124037)
4.Sejal Shrestha(18070124064)

Data streaming algorithms

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie Data streaming algorithms

Ähnlich wie Data streaming algorithms (20)

Mehr von Hridyesh Bisht

Mehr von Hridyesh Bisht (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data streaming algorithms