SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Distributed Decision Tree
Learning for Mining Big Data
Streams
1
Master Thesis presentation by:
Arinto Murdopo
EMDC
arinto@yahoo-inc.com
Supervisors:
Albert Bifet
Gianmarco de Francisci Morales
Ricard Gavaldà
Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
350 million photos/day
Volume
Velocity
Variety
May 2013
March 2013
May 2013
Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we should:
1. Distribute, to scale
2. Stream, to be fast
3. Distribute and stream,
scale and fast
Are We Satisfied?
4
scale fast
fastscale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that
are able to scale, fast, and loose-coupling
loose-coupling
SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming model
• Scale, on top of distributed SPEs (Storm and S4)
• Loose-coupling between ML algorithms and SPEs
5
Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of
attributes
6
7
SAMOA Architecture
Frequent
Pattern
Mining
Storm Other SPEs
SAMOA
S4
Clustering
Methods
Classification
Methods
SAMOA Abstractions
To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hint
Topology
PI
External
Event Source
SAMOA SPE-adapter
• Transforms the abstractions into SPE-
specific runtime components
• Abstract factory pattern to decouple API
and SPE
• Platform developers need to provide
1. PI and EPI
2. Stream
3. Grouping 9
SAMOA SPE-adapter
Examples of SPE-specific runtime
components from SPE-adapter
10
Focus of this
thesis
Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
................
stream B
S1
S2
B1
B2
B3
B5
B4
stores useful information
data
storage
Stream
Spout
Bolt
DAG
Tuples
SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout  Entrance Processing Item (EPI)
2. Bolt  Processing Item
• Use composition for EPI and PI
3. Bolt Stream & Spout Stream  Stream
• Storm pull model
12
Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-adapter
MOA
Other ML
frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Flexibility
Scalability
Extensibility
Next Contribution…
Distributed Algorithm implementation:
Vertical Hoeffding Tree
Decision tree:
• Classification
• Divide and conquer
• Easy to interpret
14
Sample Dataset
ID
Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overcast hot high false yes
d rainy mild high false yes
… … … … … …
15
attribute class
a datum (an instance) to
build the tree
Decision Tree
16
outlook
Y
sunny
rainy
overcast
humidity windy
N Y NY
truefalsenormalhigh
root
split node
leaf node
Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
• Hoeffding bound decides whether the
difference in information gain is enough to
split or not
• Often called Hoeffding Tree
17
Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition the data by the attribute
• Task
• Tree leaf nodes grow in parallel 18
MOA Hoeffding Tree Profiling
19
Learn
70%
Split
24%
Other
6%
CPU Time Breakdown, 200 attributes
Vertical Hoeffding Tree
20
1 z1 zz
n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
control
attribute
result
Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon
CPU E5620 @ 2.4 GHz: 16 processors, Linux
Kernel 2.6.18
21
VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances queue,
local-result queue
• Achieved same accuracy but throughput is
low. Proceed with VHT 2
22
VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier instead of String
• Remove 2 internal queues in VHT1 
discard instances while attempting to split
23
tree-10
24
Around 8.2 % differences
in accuracy
tree-100
25
Same trend as tree-10
(7.9% difference in accuracy)
No. Leaf Nodes VHT2 –
tree-100
26
Very close and
very high accuracy
Accuracy VHT2 – text-1000
27
Low accuracy when
the number of
attributes increased
Throughput VHT2 – tree-
generator
28
Not good for dense
instance and low
number of attributes
Throughput VHT2 – text-generator
29
Higher throughput
than MHT
30
0
50
100
150
200
250
300
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-1000 with
1000000 instances
t_calc
t_comm
t_serial
Minimizing t_comm will
increase throughput
31
0
50
100
150
200
250
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec
Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on horizontal parallelism
32
Conclusions
Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streaming ML Framework
• Architecture and Abstractions
• Stream Processing Engine (SPE) adapter
• SAMOA Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of attributes
33

Weitere ähnliche Inhalte

Was ist angesagt?

Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Mining data streams
Mining data streamsMining data streams
Mining data streamsAkash Gupta
 
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkHandling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkIJCERT
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Mahoney mlconf-nov13
Mahoney mlconf-nov13Mahoney mlconf-nov13
Mahoney mlconf-nov13MLconf
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slidesMLconf
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Egbert Gramsbergen
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363SHIVA REDDY
 
The study on mining temporal patterns and related applications in dynamic soc...
The study on mining temporal patterns and related applications in dynamic soc...The study on mining temporal patterns and related applications in dynamic soc...
The study on mining temporal patterns and related applications in dynamic soc...Thanh Hieu
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern MiningPrakhar Dhama
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 

Was ist angesagt? (20)

Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkHandling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Mahoney mlconf-nov13
Mahoney mlconf-nov13Mahoney mlconf-nov13
Mahoney mlconf-nov13
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Temporal data mining
Temporal data miningTemporal data mining
Temporal data mining
 
IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
 
The study on mining temporal patterns and related applications in dynamic soc...
The study on mining temporal patterns and related applications in dynamic soc...The study on mining temporal patterns and related applications in dynamic soc...
The study on mining temporal patterns and related applications in dynamic soc...
 
Os
OsOs
Os
 
Spark
SparkSpark
Spark
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 

Andere mochten auch

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01Aseem Chakrabarthy
 
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Tarek Dib
 
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...shibbirtanvin
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards edaedaozdemir
 
Cultura mites
Cultura mitesCultura mites
Cultura mitesComalat1D
 
153 test plan
153 test plan153 test plan
153 test plan< <
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8yaying-yingg
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.persi-10
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japaneseCheyanneStotlar
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksArinto Murdopo
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Arinto Murdopo
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArinto Murdopo
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Cultura mites
Cultura mitesCultura mites
Cultura mitesComalat1D
 
how to say foods and drinks in japanese
how to say foods and drinks in japanesehow to say foods and drinks in japanese
how to say foods and drinks in japaneseCheyanneStotlar
 

Andere mochten auch (20)

Decision Trees
Decision TreesDecision Trees
Decision Trees
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
смирнов Data mining
смирнов Data miningсмирнов Data mining
смирнов Data mining
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
 
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...
Final thesis_Knowledge Discovery from Academic Data using Association Rule Mi...
 
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards eda
 
Cultura mites
Cultura mitesCultura mites
Cultura mites
 
153 test plan
153 test plan153 test plan
153 test plan
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japanese
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Cultura mites
Cultura mitesCultura mites
Cultura mites
 
how to say foods and drinks in japanese
how to say foods and drinks in japanesehow to say foods and drinks in japanese
how to say foods and drinks in japanese
 

Ähnlich wie Distributed Decision Tree Learning for Mining Big Data Streams

Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Tyrone Systems
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsScyllaDB
 
Distributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and TradeoffsDistributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and TradeoffsScyllaDB
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabeevipinpnair
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Sameer Wadkar
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Ähnlich wie Distributed Decision Tree Learning for Mining Big Data Streams (20)

Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 
Distributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and TradeoffsDistributed Database Consistency: Architectural Considerations and Tradeoffs
Distributed Database Consistency: Architectural Considerations and Tradeoffs
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabee
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013Big Data presentation at GITPRO 2013
Big Data presentation at GITPRO 2013
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 

Mehr von Arinto Murdopo

Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARNArinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideArinto Murdopo
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIArinto Murdopo
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 PresentationArinto Murdopo
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event ScalabilityArinto Murdopo
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideArinto Murdopo
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsArinto Murdopo
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network VirtualizationArinto Murdopo
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingArinto Murdopo
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Arinto Murdopo
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Arinto Murdopo
 

Mehr von Arinto Murdopo (17)

Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Kürzlich hochgeladen

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Kürzlich hochgeladen (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Distributed Decision Tree Learning for Mining Big Data Streams

  • 1. Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  • 2. Big Data 200 million users 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  • 3. Machine Learning (ML) 3 Make sense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  • 4. Are We Satisfied? 4 scale fast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  • 5. SAMOA Scalable Advanced Massive Online Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  • 6. Contributions SAMOA • Architecture and Abstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  • 7. 7 SAMOA Architecture Frequent Pattern Mining Storm Other SPEs SAMOA S4 Clustering Methods Classification Methods
  • 8. SAMOA Abstractions To develop distributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  • 9. SAMOA SPE-adapter • Transforms the abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  • 10. SAMOA SPE-adapter Examples of SPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  • 11. Storm • Distributed Streaming Processing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  • 12. SAMOA-Storm Integration Mapping between Storm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  • 13. Contributions so far .. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  • 14. Next Contribution… Distributed Algorithm implementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  • 15. Sample Dataset ID Code Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  • 16. Decision Tree 16 outlook Y sunny rainy overcast humidity windy N Y NY truefalsenormalhigh root split node leaf node
  • 17. Very Fast Decision Tree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  • 18. Distributed Decision Tree Types of parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  • 19. MOA Hoeffding Tree Profiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  • 20. Vertical Hoeffding Tree 20 1 z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  • 21. Evaluation Metrics: • Accuracy • Throughput Input data: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  • 22. VHT iteration 1 (VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  • 23. VHT Iteration 2 (VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  • 24. tree-10 24 Around 8.2 % differences in accuracy
  • 25. tree-100 25 Same trend as tree-10 (7.9% difference in accuracy)
  • 26. No. Leaf Nodes VHT2 – tree-100 26 Very close and very high accuracy
  • 27. Accuracy VHT2 – text-1000 27 Low accuracy when the number of attributes increased
  • 28. Throughput VHT2 – tree- generator 28 Not good for dense instance and low number of attributes
  • 29. Throughput VHT2 – text-generator 29 Higher throughput than MHT
  • 30. 30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  • 31. 31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • 32. Future Work • Open Source • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  • 33. Conclusions Mining big data stream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33