Real–time mining of evolving data streams in-
volves new challenges when targeting today’s application do-
mains such as the Internet of the Things: increasing volume,
velocity and volatility requires data to be processed on–the–
fly with fast reaction and adaptation to changes. This paper
presents a high performance scalable design for decision trees
and ensemble combinations that makes use of the vector SIMD
and multicore capabilities available in modern processors to
provide the required throughput and accuracy. The proposed
design offers very low latency and good scalability with the
number of cores on commodity hardware when compared to
other state–of–the art implementations. On an Intel i7-based
system, processing a single decision tree is 6x faster than
MOA (Java), and 7x faster than StreamDM (C++), two well-
known reference implementations. On the same system, the
use of the 6 cores (and 12 hardware threads) available allow
to process an ensemble of 100 learners 85x faster that MOA
while providing the same accuracy. Furthermore, our solution
is highly scalable: on an Intel Xeon socket with large core
counts, the proposed ensemble design achieves up to 16x speed-
up when employing 24 cores with respect to a single threaded
execution
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
1. Low-latency Multi-threaded
Ensemble Learning for
Dynamic Big Data Streams
Diego Marr´on (dmarron@ac.upc.edu)
Eduard Ayguad´e (eduard.ayguade@bsc.es)
Jos´e R. Herrero (josepr@ac.upc.edu)
Jesse Read (jesse.read@polytechnique.edu)
Albert Bifet (albert.bifet@telecom-paristech.fr)
2017 IEEE International Conference on Big Data
December 11-14, 2017, Boston, MA, USA
2. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Real–time mining of dynamic data streams
• Unprecedented amount of dynamic big data streams (Volume)
• Generating data at High Ratio (Velocity)
• Newly created data rapidly supersedes old data (Volatility)
• This increase in volume, velocity and volatility requires data
to be processed on–the–fly in real–time
2/24
3. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Real–time dynamic data streams classification
• Real-time classification imposes some challenges:
• Deal with potentially infinite streams
• Single pass on each instance
• React to changes on the stream (concept drifting)
• Bounded response-time:
• Low latency: Milliseconds (ms) per instance
• High latency: Few seconds (s) per instance
• Limited CPU-time to process each instance
• Preferred methods:
• Hoeffding Tree (HT)
• Ensemble of HT
3/24
4. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree
• Decision tree suitable for large data streams
• Easy–to–deploy
• They are usually able to keep up with the arrival rate
4/24
5. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree: Basics
• Build tree structure incrementally (on-the-fly)
• Tree structure uses attributes to route an instance to a leaf
node
• Leaf node:
• Contains the classifier (Naive Bayes)
• Statistics to decide next attribute to split (attribute counters)
• Split decision: needs per-attribute information gain
• Naive Bayes Classifier: calculates each attribute probability
5/24
6. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Ensembles: Random Forest of Hoeffding Trees
• Random Forest uses RandomHT (a variation of HT):
• Split decision uses a random subset of attributes
• For each RandomHT in the ensemble:
• Input: sampling with repetition
• Tree is reset if change (drift) is detected
• Responses are combined to form final prediction
• Ensembles require more work for each instance
6/24
7. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Exploiting CPU Parallelism
• Parallelism: more work on the same amount of time
• Improves throughput per time unit
• Or improves the accuracy by using more CPU-intensive
methods
• Modern CPU parallel features: multithreaded, SIMD
instructions
7/24
8. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Contributions
• Very low-latency response time
• few micro-seconds (µs) per instance
• Compared to an state-of-art implementation, MOA:
• Same accuracy
• Single HT, on average:
• Response time: 2 microseconds (µs) per instance
• 6.73x faster
• Multithreaded ensemble, on average:
• Response time: 10 microseconds (µs) per instance
• 85x faster
• Up to 70% parallel efficiency on a 24 cores CPU
• Highly scalable/adaptive design tested on:
• Intel platform: i7,Xeon
• ARM SoCs: from server range to low end (Raspberry Pi3)
8/24
9. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree
• Our implementation uses a binary tree
• Split into smaller sub-trees that fit on the CPU L1 cache
• L1 cache 64KB (8x64bits pointer)
• Max sub-tree height: 3
• SIMD instructions to accelerate calculations:
• Information Gain
• Naive Bayes Classifier
cache line
cache line
9/24
10. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Architecture
• N threads (up to the number of CPU hardware threads)
• Thread 1: data load/parser
• N-1 workers for L learners
• Common instance buffer (lockless ring buffer)
10/24
11. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Achieving low latency
• Lockless datas structures: at least one makes progress
• Key for scaling with low latency
• Lockless ring buffer:
• Single writer principle:
• Only the owner can write to it
• Everyone can read it
11/24
12. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Achieving low latency
• Lockless ring buffer:
• Buffer Head:
• Signals new instance on the buffer
• Owned by the parser
• Buffer Tail:
• Each worker owns its LastProcessed sequence number
• Buffer Tail: Lowest LastProcessed among all workers
12/24
14. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Single Hoeffding Tree Performance vs MOA
• Single thread
• Same accuracy as MOA
• Average throughput 525.65 instances per millisecond (ms)
• 6.73x faster than MOA
• 7x faster then StreamDM
• Including instance loading/parsing time
• Except on the RPI3: all instances already parsed in memory
• Data parsing is currently a bottleneck
• Using data from memory: 3x faster on the Intel i7
14/24
21. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Conclusions
• We presented a high performance scalable design for real-time
data streams classification
• Very low latency: few microseconds (µs) per instance
• Same accuracy than MOA
• Highly adaptive to a variety of hardware platforms
• From server to edge computing (ARM and Intel)
• Up to 70% parallel efficiency on a 24 cores CPU
• On Intel Platforms, on average:
• Single HT: 6.73x faster than MOA
• Multithreaded Ensemble: 85x faster than MOA
• On Arm-based SoCs, on average:
• Single HT: 2x faster than MOA (i7)
• Similar performance on a Raspberry Pi3 (ARM) than MOA(i7)
• Multithreaded ensemble: 24x faster than MOA (i7)
21/24
22. Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Future Work
• Parser thread can easily limit throughput
• Find the appropriate ratio of learners per parser
• Implement counters for all kinds of attributes
• Scaling to multi-socket nodes (NUMA architectures)
• Distribute ensemble across several nodes
22/24
24. Low-latency Multi-threaded
Ensemble Learning for
Dynamic Big Data Streams
Diego Marr´on (dmarron@ac.upc.edu)
Eduard Ayguad´e (eduard.ayguade@bsc.es)
Jos´e R. Herrero (josepr@ac.upc.edu)
Jesse Read (jesse.read@polytechnique.edu)
Albert Bifet (albert.bifet@telecom-paristech.fr)
2017 IEEE International Conference on Big Data
December 11-14, 2017, Boston, MA, USA