SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Algebird 
Abstract Algebra 
for 
Analytics 
Sam BESSALAH 
@samklr 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Abstract Algebra 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
From WikiPedia 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Algebraic Structure 
“ Set of values, coupled with one or 
more finite operations,and a set of 
laws those operations must obey. “ 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Algebraic Structure 
“ Set of values, coupled with one or more 
finite operations, and a set of laws those 
operations must obey. “ 
e.g Sum, Magma, Semigroup, Groups, Monoid, 
Abelian Group, Semi Lattices, Rings, Monads, 
etc. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Semigroup 
Semigroup Law : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Semigroup 
Semigroup Law : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
trait Semigroup[T] { 
def aggregate(x : T, y : T) : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identity) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identiy / zero) 
trait Monoid[T] { 
def identity : T 
def aggregate (x, y) : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Monoids 
Monoid Laws : 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
trait Monoid[T] extends Semigroup[T]{ 
def identity : T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Groups 
Group Laws: 
(x <> y) <> z = x <> (y <> z) 
(associativity) 
identity <> x = x 
x <> identity = x 
(identity) 
x <> inverse x = identity 
inverse x <> x = identity 
(invertibility) 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Groups 
Group Laws 
(x <> y) <> z = x <> (y <> z) 
identity <> x = x 
x <> identity = x 
x <> inverse x = identity 
inverse x <> x = identity 
trait Group[T] extends Monoid[T]{ 
def inverse (v : T) :T 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Many More 
- Abelian groups (Commutative Sets) 
- Rings 
- Semi Lattices 
- Ordered Semigroups 
- Fields .. 
Many of those are in Algebird …. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Examples 
- (a min b) min c = a (b min c) with Int. 
- a max ( b max c) = (a max b) max c ** 
- a or (b or c) = (a or b) or c 
- a and (b and c) = (a and b) and c 
- int addition 
- set union 
- harmonic sum 
- Integer mean 
- Priority queue 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Why do we need those algebraic 
structures ? 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
We want to : 
- Build scalable analytics systems 
- Leverage distributed computing to perform aggregation 
on really large data sets. 
- A lot of operations in analytics are just sorting and 
counting at the end of the day 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Associativity → enables parallelism 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Distributed Computing → Parallellism 
Associativity enables parallelism 
Identity means we can ignore some data 
Commutativity helps us ignore order 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Typical Map Reduce ... 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Finding Top-K Elements in Scalding ... 
class TopKJob(args : Args) extends Job (args) { 
Tsv ( args(‘input’), visitScheme) 
.filter (. ..) 
.leftJoinWithTiny ( … ) 
.filter ( … ) 
.groupBy( ‘fieldOne) 
{ _.sortWithTake (visitScheme -> top } 
(biggerSale) 
.write(Tsv(...) ) 
} 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
.sortWithTake( … ) 
Looking into .sortWithTake in Scalding, there’s one 
nice thing : 
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Let’s take a look : 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 
Priority Queue : 
Can be empty 
Two Priority Queues can be “added” in any order 
Associative + Commutative 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
class PiorityQueueMonoid[T] (max : Int) 
(implicit order : Ordering[T] ) 
extends Monoid[Priorityqueue[T] ] 
Let’s take a look : 
PQ1 : 55, 45, 21, 3 
PQ2: 100, 80, 40, 3 
top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 
Priority Queue : 
Makes Scalding go fast, 
by doing sorting, 
filtering and extracting 
in one single “map” 
step. 
Can be empty 
Two Priority Queues can be “added” in any order 
Associative + Commutative 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Stream Mining Challenges 
- Update predictions after each observation 
- Single pass : can’t read old data or replay 
the stream 
- Full size of the stream often unknown 
- Limited time for computation per 
observation 
- O(1) memory size 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Stream Mining Challenges 
http://radar.oreilly.com/2013/10/stream-mining-essentials.html 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Tradeoff : Space and speed over 
accuracy. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Tradeoff : Space and speed over 
accuracy. 
use sketches. 
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Sketches 
Probabilistic data structures that store a summary 
(hashed mostly)of a data set that would be costly to 
store in its entirety, thus providing most of the 
time, sublinear algorithmic properties. 
E.g Bloom Filters, Counter Sketch, KMV counters, 
Count Min Sketch, HyperLogLog, Min Hashes 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Bloom filters 
Approximate data structure for set membership 
Behaves like an approximate set 
BloomFilter.contains(x) => NO | Maybe 
P(False Positive) > 0 
P(False Negative) = 0 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Internally : 
Bit Array of fixed size 
add(x) : for all element i, b[h(x,i)]=1 
contains(x) : TRUE if b[h(x,i)] = = 1 for all i. 
(Boolean AND => associative) 
Both are associative => BF can be designed as a Monoid 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Bloom filters 
import com.twitter.algebird._ 
import com.twitter.algebird.Operators._ 
// generate 2 lists 
val A = (1 to 300).toList 
// Generate a Bloomfilter 
val NUM_HASHES = 6 
val WIDTH = 6000 // bits 
val SEED = 1 
implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) 
// approximate set with bloomfilter 
val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) 
val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Count Min Sketch 
Gives an approximation of the number of occurrences of an 
element in a set. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Count Min Sketch 
Count min sketch 
Adding an element is a numerical addition 
Querying uses a MIN function. 
Both are associative. 
useful for detecting heavy hitters, topK, LSH 
We have in Algebird : 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
HyperLogLog 
Popular sketch for cardinality estimtion. 
Gives within a probilistic distribution of an error 
the number of distinct values in a data set. 
HLL.size = Approx[Number] 
Intuition 
Long runs of trailings 0 in a random bits 
chain are rare 
But the more bit chains you look at, the more 
likely you are to find a long one 
The longest run of trailing 0-bits seen can be 
an estimator of the number of unique bit chains 
observed. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Adding an element uses a Max and Sum function. 
Both are associative and Monoids. (Max is an 
ordered 
semigroup in Algebird really) 
Querying for an element uses an harmonic mean 
which is a Monoid. 
In Algebird : 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Many More juicy sketches ... 
- MinHashes to compute Jaccard similarity 
- QTree for quantiles estimation. Neat for anomaly 
detection. 
- SpaceSaverMonoid, Awesome to find the approximate 
most frequent and top K elements. 
- TopKMonoid 
- SGD, PriorityQueues, Histograms, etc. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird : Lamba in a box 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Heard of Lambda Architecture ? 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird 
Same code for both batch and real time processing. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
SummingBird 
Same code, for both batch and real time processing. 
But works only on Monoids. 
Uses Storehaus, as a mergeable store layer. 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
http://github.com/twitter/algebird 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
http://github.com/twitter/algebird 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr 
These slides : 
http://bit.ly/1szncAZ 
http://slidesha.re/1zhhXKU
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr
Links 
-Algebra for analytics by Oscar Boykin (Creator of Algebird) 
http://speakerdeck.com/johnynek/algebra-for-analytics 
- Take a look into HLearn https://github.com/mikeizbicki/HLearn 
- Great intro into Algebird by Michael Noll 
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad- 
for-large-scala-data-analytics/ 
-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- 
the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure 
- Probabilistic data structures for web analytics. 
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- 
web-analytics-data-mining/ 
- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure- 
for.html 
- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 
#Devoxx #algebird #scalding #monoid #hadoop #spark 
@samklr

Weitere ähnliche Inhalte

Was ist angesagt?

Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
Holden Karau
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Konrad Malawski
 

Was ist angesagt? (20)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014Reactive Streams / Akka Streams - GeeCON Prague 2014
Reactive Streams / Akka Streams - GeeCON Prague 2014
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
XML-Motor
XML-MotorXML-Motor
XML-Motor
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka StreamsFresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
Fresh from the Oven (04.2015): Experimental Akka Typed and Akka Streams
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConf
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 

Andere mochten auch

An application of abstract algebra to music theory
An application of abstract algebra to music theoryAn application of abstract algebra to music theory
An application of abstract algebra to music theory
morkir
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
Deepak Mathew
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 

Andere mochten auch (19)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
algebraic-geometry
algebraic-geometryalgebraic-geometry
algebraic-geometry
 
A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...A Study to Design and Implement a Manual for the Learning Process of Technica...
A Study to Design and Implement a Manual for the Learning Process of Technica...
 
An application of abstract algebra to music theory
An application of abstract algebra to music theoryAn application of abstract algebra to music theory
An application of abstract algebra to music theory
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
Information Security Seminar #2
Information Security Seminar #2Information Security Seminar #2
Information Security Seminar #2
 
Definition ofvectorspace
Definition ofvectorspaceDefinition ofvectorspace
Definition ofvectorspace
 
Snapdragon Processor
Snapdragon ProcessorSnapdragon Processor
Snapdragon Processor
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
 
Kill the mutants - A better way to test your tests
Kill the mutants - A better way to test your testsKill the mutants - A better way to test your tests
Kill the mutants - A better way to test your tests
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Ähnlich wie Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
shinolajla
 
A tour on ruby and friends
A tour on ruby and friendsA tour on ruby and friends
A tour on ruby and friends
旻琦 潘
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
Serge Stinckwich
 
Ruby 入門 第一次就上手
Ruby 入門 第一次就上手Ruby 入門 第一次就上手
Ruby 入門 第一次就上手
Wen-Tien Chang
 

Ähnlich wie Algebird : Abstract Algebra for big data analytics. Devoxx 2014 (20)

Everything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-insEverything is Permitted: Extending Built-ins
Everything is Permitted: Extending Built-ins
 
Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)Concurrent programming with Celluloid (MWRC 2012)
Concurrent programming with Celluloid (MWRC 2012)
 
The things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and AkkaThe things we don't see – stories of Software, Scala and Akka
The things we don't see – stories of Software, Scala and Akka
 
Ruby — An introduction
Ruby — An introductionRuby — An introduction
Ruby — An introduction
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Blocks by Lachs Cox
Blocks by Lachs CoxBlocks by Lachs Cox
Blocks by Lachs Cox
 
Ruby 2: some new things
Ruby 2: some new thingsRuby 2: some new things
Ruby 2: some new things
 
A tour on ruby and friends
A tour on ruby and friendsA tour on ruby and friends
A tour on ruby and friends
 
Pharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source SmalltalkPharo, an innovative and open-source Smalltalk
Pharo, an innovative and open-source Smalltalk
 
Ruby 入門 第一次就上手
Ruby 入門 第一次就上手Ruby 入門 第一次就上手
Ruby 入門 第一次就上手
 
Spock: Test Well and Prosper
Spock: Test Well and ProsperSpock: Test Well and Prosper
Spock: Test Well and Prosper
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Ruby is an Acceptable Lisp
Ruby is an Acceptable LispRuby is an Acceptable Lisp
Ruby is an Acceptable Lisp
 
Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)Ruby Topic Maps Tutorial (2007-10-10)
Ruby Topic Maps Tutorial (2007-10-10)
 
Rails by example
Rails by exampleRails by example
Rails by example
 
Introductionto fp with groovy
Introductionto fp with groovyIntroductionto fp with groovy
Introductionto fp with groovy
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
BASE Meetup: "Analysing Scala Puzzlers: Essential and Accidental Complexity i...
 
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
Scala Up North: "Analysing Scala Puzzlers: Essential and Accidental Complexit...
 
Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2
 

Kürzlich hochgeladen

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

  • 1. Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
  • 2. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 3. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 4. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 5. Abstract Algebra Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 6. From WikiPedia Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 7. Algebraic Structure “ Set of values, coupled with one or more finite operations,and a set of laws those operations must obey. “ Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 8. Algebraic Structure “ Set of values, coupled with one or more finite operations, and a set of laws those operations must obey. “ e.g Sum, Magma, Semigroup, Groups, Monoid, Abelian Group, Semi Lattices, Rings, Monads, etc. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 9. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 10. Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) trait Semigroup[T] { def aggregate(x : T, y : T) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 11. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 12. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identiy / zero) trait Monoid[T] { def identity : T def aggregate (x, y) : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 13. Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x trait Monoid[T] extends Semigroup[T]{ def identity : T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 14. Groups Group Laws: (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) x <> inverse x = identity inverse x <> x = identity (invertibility) Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 15. Groups Group Laws (x <> y) <> z = x <> (y <> z) identity <> x = x x <> identity = x x <> inverse x = identity inverse x <> x = identity trait Group[T] extends Monoid[T]{ def inverse (v : T) :T } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 16. Many More - Abelian groups (Commutative Sets) - Rings - Semi Lattices - Ordered Semigroups - Fields .. Many of those are in Algebird …. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 17. Examples - (a min b) min c = a (b min c) with Int. - a max ( b max c) = (a max b) max c ** - a or (b or c) = (a or b) or c - a and (b and c) = (a and b) and c - int addition - set union - harmonic sum - Integer mean - Priority queue Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 18. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 19. Why do we need those algebraic structures ? Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 20. We want to : - Build scalable analytics systems - Leverage distributed computing to perform aggregation on really large data sets. - A lot of operations in analytics are just sorting and counting at the end of the day Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 21. Distributed Computing → Parallellism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 22. Distributed Computing → Parallellism Associativity → enables parallelism Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 23. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 24. Distributed Computing → Parallellism Associativity enables parallelism Identity means we can ignore some data Commutativity helps us ignore order Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 25. Typical Map Reduce ... Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 26. Finding Top-K Elements in Scalding ... class TopKJob(args : Args) extends Job (args) { Tsv ( args(‘input’), visitScheme) .filter (. ..) .leftJoinWithTiny ( … ) .filter ( … ) .groupBy( ‘fieldOne) { _.sortWithTake (visitScheme -> top } (biggerSale) .write(Tsv(...) ) } Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 27. .sortWithTake( … ) Looking into .sortWithTake in Scalding, there’s one nice thing : class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 28. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 29. class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Makes Scalding go fast, by doing sorting, filtering and extracting in one single “map” step. Can be empty Two Priority Queues can be “added” in any order Associative + Commutative Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 30. Stream Mining Challenges - Update predictions after each observation - Single pass : can’t read old data or replay the stream - Full size of the stream often unknown - Limited time for computation per observation - O(1) memory size Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 31. Stream Mining Challenges http://radar.oreilly.com/2013/10/stream-mining-essentials.html Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 32. Tradeoff : Space and speed over accuracy. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 33. Tradeoff : Space and speed over accuracy. use sketches. Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 34. Sketches Probabilistic data structures that store a summary (hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of the time, sublinear algorithmic properties. E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 35. Bloom filters Approximate data structure for set membership Behaves like an approximate set BloomFilter.contains(x) => NO | Maybe P(False Positive) > 0 P(False Negative) = 0 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 36. Internally : Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i. (Boolean AND => associative) Both are associative => BF can be designed as a Monoid #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 37. Bloom filters import com.twitter.algebird._ import com.twitter.algebird.Operators._ // generate 2 lists val A = (1 to 300).toList // Generate a Bloomfilter val NUM_HASHES = 6 val WIDTH = 6000 // bits val SEED = 1 implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) // approximate set with bloomfilter val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…) #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 38. Count Min Sketch Gives an approximation of the number of occurrences of an element in a set. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 39. Count Min Sketch Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative. useful for detecting heavy hitters, topK, LSH We have in Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 40. HyperLogLog Popular sketch for cardinality estimtion. Gives within a probilistic distribution of an error the number of distinct values in a data set. HLL.size = Approx[Number] Intuition Long runs of trailings 0 in a random bits chain are rare But the more bit chains you look at, the more likely you are to find a long one The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 41. Adding an element uses a Max and Sum function. Both are associative and Monoids. (Max is an ordered semigroup in Algebird really) Querying for an element uses an harmonic mean which is a Monoid. In Algebird : #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 42. Many More juicy sketches ... - MinHashes to compute Jaccard similarity - QTree for quantiles estimation. Neat for anomaly detection. - SpaceSaverMonoid, Awesome to find the approximate most frequent and top K elements. - TopKMonoid - SGD, PriorityQueues, Histograms, etc. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 43. SummingBird : Lamba in a box #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 44. Heard of Lambda Architecture ? #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 45. SummingBird Same code for both batch and real time processing. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 46. SummingBird Same code, for both batch and real time processing. But works only on Monoids. Uses Storehaus, as a mergeable store layer. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 47. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 48. http://github.com/twitter/algebird #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 49. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr These slides : http://bit.ly/1szncAZ http://slidesha.re/1zhhXKU
  • 50. #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
  • 51. Links -Algebra for analytics by Oscar Boykin (Creator of Algebird) http://speakerdeck.com/johnynek/algebra-for-analytics - Take a look into HLearn https://github.com/mikeizbicki/HLearn - Great intro into Algebird by Michael Noll http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad- for-large-scala-data-analytics/ -Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure - Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ - http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure- for.html - http://infolab.stanford.edu/~ullman/mmds/ch3.pdf #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr