4. Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
5. Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
6. Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
(predicted) to increase zero to two days
(predicted) to increase this week
(predicted) to increase next week
8. Motivation
FRA – NYC
FRA - LON
FRA - MEX
Need attention
revenue decrease
Need attention
passenger
decrease
Need attention
revenue decrease,
cost increase
9. Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
10. Motivation The classifier is kept fresh
No need for separate batch learning/evaluation
The feedback is taken into account in real time,
regularly
The classifier can be introspected
Transparent model structure
(e.g. know the tree, information gain for each
split point)
Known expected performance (accuracy, precision,
recall, AUC)
Seamless support for workflow of
machine learning
Data preprocessing: up/down sampling, imputations, …
Feature selections
Model evaluation, cross validation,
MUST
11. Motivation
The classifier is immediately available
The classifier can already predict during learning
When learning phase is terminated, it starts another
cycle of learning
The classifier has a meta-learning
capability
The classifier has several models different parameters
It is possible to learn about the learning capability of
the models
NICE TO HAVE
15. DecisionTrees
From origin to recent developments
“Understand data by asking a sequence of questions ”
Classification and Regression Trees (CART) by Breiman et al. in 1984
“Pool decision trees to improve generalization”
Random Forests by Breiman in 1999
“Let’s play: pose estimation for XBox’s Kinect”
Shotton et al. 2011
16. DecisionTrees
Streaming Decision Trees
“A classifier for streaming data with a bound”
Hoeffding Tree (VFDT), Dominguez & Hulthen 2000
“Use of Approximate Histograms for Decision Tree”
Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov
2010
20. DecisionTrees
Streaming Decision Trees
The batch version of decision trees require
view of the full learning data set
In streaming
each point can only be seen once
the processing should be fast, can’t afford too much
access to disks
21. DecisionTrees
Streaming Decision Tree – get the intuition
!
Instead of using every point, the points are
compressed
The real position of each point is then
approximated
24. label 1 / feature 1
Count
0
2
4
6
8
Feature 1
2 5 7.5 9 11
DecisionTrees
Compressing Data
An approximate histogram is built for each label/feature
25. label n/feature 1
0
4
8
4 6 8.5 10 13
label n
feature 1
0
10
20
30
40
1 3.5 7 11 14
Total
label 1 / feature 1
0
4
8
2 5 7.5 9 11
label 0
DecisionTrees
For each feature, all histograms of th
feature are merged
Prepare Split Candidates (1/2)
26. Total
0
10
20
30
40
1 3.5 7 11 14
Total
Get the split candidates s.t. the interval between two split
candidates have same number of points (the colored square is as
large as each other )
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Prepare Split Candidates (2/2)
27. Find the split point that maximizes the information gain
using
the split points
histogram per feature/label
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Determine Split
29. A different data set is used for different iteration
DecisionTree
* If there are not enough data, the same data can be reinjected
instead, Kafka is very good for this
Subsequent Split – get the intuition !
32. Implementation
Stream Learner
We use two kafka streams:
• One for labeled data stream
• One for the tree developed so far
(the topic is also use by
classifying applications)
• Because we need to annotate
each message with the tree so far
33. Implementation
Code Outlines
val kafkaDataStream: DataStream[Point]=
val kafkaTreeStream: DataStream[Node] =
// annotate each message with the latest tree
val annotatedDataStream: DataStream[AnnotatedPoint] =
(kafkaDataStream connect kafkaTreeStream) flatMap (new
AnnotateMessageCoFlatMap(…))
// create histogram per feature / node
val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) }
.timeWindowAll(Time.of(1, TimeUnit.MINUTES))
.reduce{ (n1, n2) => mergeHistogram(n1, n2) }
// merge histogram
val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2)
}
val newTree = mergedHistograms
.filter(hs => haveEnoughPoints(hs) && toSplit(hs))
.map{ n => val splitPoint =
maxInformationGain( calculateSplitCandidates(n))
36. Conclusions
Summary
Streaming algorithms based on approximate
histograms are explained
The streaming decision trees algorithms open
possibilities to have interesting properties of
classifier: freshness and continuous learning
Flink together with Kafka allow an
implementation of the algorithm in a nice way
37. Conclusions
Next Steps
Random Forests:
Trees with randomly selected features at each
level
Trees with different span of data (trees with
more but old data might behave worse than
trees with less but more fresh data:
forgetting capabilities)
Providing information of what type of trees
behave better at a given period of time (meta
learning)