6. From WikiPedia
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
7. Algebraic Structure
“ Set of values, coupled with one or
more finite operations,and a set of
laws those operations must obey. “
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
8. Algebraic Structure
“ Set of values, coupled with one or more
finite operations, and a set of laws those
operations must obey. “
e.g Sum, Magma, Semigroup, Groups, Monoid,
Abelian Group, Semi Lattices, Rings, Monads,
etc.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
9. Semigroup
Semigroup Law :
(x <> y) <> z = x <> (y <> z)
(associativity)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
10. Semigroup
Semigroup Law :
(x <> y) <> z = x <> (y <> z)
(associativity)
trait Semigroup[T] {
def aggregate(x : T, y : T) : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
11. Monoids
Monoid Laws :
(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = x
x <> identity = x
(identity)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
12. Monoids
Monoid Laws :
(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = x
x <> identity = x
(identiy / zero)
trait Monoid[T] {
def identity : T
def aggregate (x, y) : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
13. Monoids
Monoid Laws :
(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = x
x <> identity = x
trait Monoid[T] extends Semigroup[T]{
def identity : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
14. Groups
Group Laws:
(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = x
x <> identity = x
(identity)
x <> inverse x = identity
inverse x <> x = identity
(invertibility)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
15. Groups
Group Laws
(x <> y) <> z = x <> (y <> z)
identity <> x = x
x <> identity = x
x <> inverse x = identity
inverse x <> x = identity
trait Group[T] extends Monoid[T]{
def inverse (v : T) :T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
16. Many More
- Abelian groups (Commutative Sets)
- Rings
- Semi Lattices
- Ordered Semigroups
- Fields ..
Many of those are in Algebird ….
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
17. Examples
- (a min b) min c = a (b min c) with Int.
- a max ( b max c) = (a max b) max c **
- a or (b or c) = (a or b) or c
- a and (b and c) = (a and b) and c
- int addition
- set union
- harmonic sum
- Integer mean
- Priority queue
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
19. Why do we need those algebraic
structures ?
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
20. We want to :
- Build scalable analytics systems
- Leverage distributed computing to perform aggregation
on really large data sets.
- A lot of operations in analytics are just sorting and
counting at the end of the day
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
24. Distributed Computing → Parallellism
Associativity enables parallelism
Identity means we can ignore some data
Commutativity helps us ignore order
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
27. .sortWithTake( … )
Looking into .sortWithTake in Scalding, there’s one
nice thing :
class PiorityQueueMonoid[T] (max : Int)
(implicit order : Ordering[T] )
extends Monoid[Priorityqueue[T] ]
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
28. class PiorityQueueMonoid[T] (max : Int)
(implicit order : Ordering[T] )
extends Monoid[Priorityqueue[T] ]
Let’s take a look :
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
Priority Queue :
Can be empty
Two Priority Queues can be “added” in any order
Associative + Commutative
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
29. class PiorityQueueMonoid[T] (max : Int)
(implicit order : Ordering[T] )
extends Monoid[Priorityqueue[T] ]
Let’s take a look :
PQ1 : 55, 45, 21, 3
PQ2: 100, 80, 40, 3
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
Priority Queue :
Makes Scalding go fast,
by doing sorting,
filtering and extracting
in one single “map”
step.
Can be empty
Two Priority Queues can be “added” in any order
Associative + Commutative
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
30. Stream Mining Challenges
- Update predictions after each observation
- Single pass : can’t read old data or replay
the stream
- Full size of the stream often unknown
- Limited time for computation per
observation
- O(1) memory size
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
32. Tradeoff : Space and speed over
accuracy.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
33. Tradeoff : Space and speed over
accuracy.
use sketches.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
34. Sketches
Probabilistic data structures that store a summary
(hashed mostly)of a data set that would be costly to
store in its entirety, thus providing most of the
time, sublinear algorithmic properties.
E.g Bloom Filters, Counter Sketch, KMV counters,
Count Min Sketch, HyperLogLog, Min Hashes
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
35. Bloom filters
Approximate data structure for set membership
Behaves like an approximate set
BloomFilter.contains(x) => NO | Maybe
P(False Positive) > 0
P(False Negative) = 0
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
36. Internally :
Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
(Boolean AND => associative)
Both are associative => BF can be designed as a Monoid
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
37. Bloom filters
import com.twitter.algebird._
import com.twitter.algebird.Operators._
// generate 2 lists
val A = (1 to 300).toList
// Generate a Bloomfilter
val NUM_HASHES = 6
val WIDTH = 6000 // bits
val SEED = 1
implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)
// approximate set with bloomfilter
val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _)
val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
38. Count Min Sketch
Gives an approximation of the number of occurrences of an
element in a set.
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
39. Count Min Sketch
Count min sketch
Adding an element is a numerical addition
Querying uses a MIN function.
Both are associative.
useful for detecting heavy hitters, topK, LSH
We have in Algebird :
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
40. HyperLogLog
Popular sketch for cardinality estimtion.
Gives within a probilistic distribution of an error
the number of distinct values in a data set.
HLL.size = Approx[Number]
Intuition
Long runs of trailings 0 in a random bits
chain are rare
But the more bit chains you look at, the more
likely you are to find a long one
The longest run of trailing 0-bits seen can be
an estimator of the number of unique bit chains
observed.
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
41. Adding an element uses a Max and Sum function.
Both are associative and Monoids. (Max is an
ordered
semigroup in Algebird really)
Querying for an element uses an harmonic mean
which is a Monoid.
In Algebird :
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
42. Many More juicy sketches ...
- MinHashes to compute Jaccard similarity
- QTree for quantiles estimation. Neat for anomaly
detection.
- SpaceSaverMonoid, Awesome to find the approximate
most frequent and top K elements.
- TopKMonoid
- SGD, PriorityQueues, Histograms, etc.
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
43. SummingBird : Lamba in a box
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
45. SummingBird
Same code for both batch and real time processing.
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
46. SummingBird
Same code, for both batch and real time processing.
But works only on Monoids.
Uses Storehaus, as a mergeable store layer.
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr
51. Links
-Algebra for analytics by Oscar Boykin (Creator of Algebird)
http://speakerdeck.com/johnynek/algebra-for-analytics
- Take a look into HLearn https://github.com/mikeizbicki/HLearn
- Great intro into Algebird by Michael Noll
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-
for-large-scala-data-analytics/
-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-
the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure
- Probabilistic data structures for web analytics.
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-
web-analytics-data-mining/
- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure-
for.html
- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
#Devoxx #algebird #scalding #monoid #hadoop #spark
@samklr