SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Random Forest for Big Data
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
& Robin Genuer, Jean-Michel Poggi & Christine Tuleau-Malot
7 Février 2017
Institut Élie Cartan, Université de Lorraine
Nathalie Villa-Vialaneix | RF for Big Data 1/39
Sommaire
1 Random Forest
2 Strategies to use random forest with big data
Bag of Little Bootstrap (BLB)
Map Reduce
Online learning
3 Application
Nathalie Villa-Vialaneix | RF for Big Data 2/39
Sommaire
1 Random Forest
2 Strategies to use random forest with big data
Bag of Little Bootstrap (BLB)
Map Reduce
Online learning
3 Application
Nathalie Villa-Vialaneix | RF for Big Data 3/39
A short introduction to random forest
Introduced by [Breiman, 2001], they are ensemble methods
[Dietterich, 2000], similarly as Bagging, Boosting, Randomizing
Outputs, Random Subspace
Statistical learning algorithm that can be used for classification and
regression. It has been used in many situations involving real data
with success:
microarray [Díaz-Uriarte and Alvarez de Andres, 2006]
ecology [Prasad et al., 2006]
pollution forecasting [Ghattas, 1999]
la génomique [Goldstein et al., 2010, Boulesteix et al., 2012]
for more references, [Verikas et al., 2011]
Nathalie Villa-Vialaneix | RF for Big Data 4/39
Description of RF
Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of
variables (X, Y) st
X ∈ Rp
(explanatory variables)
Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M}
(classification).
Nathalie Villa-Vialaneix | RF for Big Data 5/39
Description of RF
Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of
variables (X, Y) st
X ∈ Rp
(explanatory variables)
Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M}
(classification).
Purpose: define a predictor ˆf : Rp
→ Y from Ln.
Random forest from [Breiman, 2001]
ˆf(., Θb), 1 ≤ b ≤ B is a set of regression or classification trees.
(Θb)1≤b≤B are i.i.d. random variables, independent of Ln.
Nathalie Villa-Vialaneix | RF for Big Data 5/39
Description of RF
Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of
variables (X, Y) st
X ∈ Rp
(explanatory variables)
Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M}
(classification).
Purpose: define a predictor ˆf : Rp
→ Y from Ln.
Random forest from [Breiman, 2001]
ˆf(., Θb), 1 ≤ b ≤ B is a set of regression or classification trees.
(Θb)1≤b≤B are i.i.d. random variables, independent of Ln.
The random forest is obtained by aggregation of the set.
Aggregation:
regression: ˆf(x) = 1
B
B
b=1
ˆf(x, Θb)
classification: ˆf(x) = arg max1≤c≤M
B
b=1 1{ˆf(x,Θb )=c}
Nathalie Villa-Vialaneix | RF for Big Data 5/39
CART
Tree: piecewise constant
predictor obtained with recursive
binary partitioning of Rp
Constrains: splits are parallel to
the axes
At every step of the binary
partitioning, data in the current
node are split “at best” (i.e., to
have the greatest decrease in
heterogeneity in the two child
nodes
Figure: Regression tree
Nathalie Villa-Vialaneix | RF for Big Data 6/39
CART partitioning
Nathalie Villa-Vialaneix | RF for Big Data 7/39
Classification and regression frameworks
Figure: Regression tree Figure: Classification tree
Nathalie Villa-Vialaneix | RF for Big Data 8/39
Random forest with Bagging [Breiman, 1996]
(Xi, Yi)i=1,...,n
(Xi, Yi)i∈τ1
(Xi, Yi)i∈τb (Xi, Yi)i∈τB
ˆf1 ˆfb ˆfB
aggregation: ˆfbag = 1
B
B
b=1
ˆfb
CART
subsample with replacement B times
Nathalie Villa-Vialaneix | RF for Big Data 9/39
Trees in random forests
Variant of CART,
[Breiman et al., 1984]: piecewise
constant predictor, obtained by a
recursive partitioning of Rp
with
splits parallel to axes
But: At each step of the
partitioning, we seek the “best”
split of data among mtry
randomly picked directions
(variables).
No pruning (fully developed trees)
Figure: Regression tree
Nathalie Villa-Vialaneix | RF for Big Data 10/39
OOB error and estimation of the prediction error
OOB = Out Of Bag
OOB error
For predicting Yi, only predictors ˆfb
such that i τb are used ⇒ Yi
OOB error = 1
n
n
i=1 Yi − Yi
2
(regression)
OOB error = 1
n
n
i=1 1{Yi Yi} (classification)
estimation similar to standard cross validation estimation
... without splitting the training dataset because it is included in the
bootstrap sample generation
Warning: a different forest is used for the prediction of each Yi!
Nathalie Villa-Vialaneix | RF for Big Data 11/39
Variable importance
Definition
For j ∈ {1, . . . , p}, for every bootstrap sample b, permute values of the j
variable in the bootstrap sample.
Predict observation i (OOB prediction) for tree b:
in a standard way ˆfb
(Xi)
after permutation ˆfb
(X
(j)
i
)
Importance of variable j for tree ˆfb
is the average increase in accuracy
after permutation of variable j. Regression case:
I(Xj
) =
1
B
B
b=1
1
n
n
i=1
(ˆfb
(X
(j)
i
) − Yi)2
− (ˆfb
(Xi) − Yi)2
The greater the increase,
the more important the variable is.
Nathalie Villa-Vialaneix | RF for Big Data 12/39
Why RF and Big Data?
on one hand, bagging is appealing because easily computed in
parallel.
Nathalie Villa-Vialaneix | RF for Big Data 13/39
Why RF and Big Data?
on one hand, bagging is appealing because easily computed in
parallel.
on the dark side, each bootstrap sample has the same size than the
original dataset (i.e., n, which is supposed to be LARGE) and contains
approximately 0.63n different observations (which is also LARGe)!
Nathalie Villa-Vialaneix | RF for Big Data 13/39
Sommaire
1 Random Forest
2 Strategies to use random forest with big data
Bag of Little Bootstrap (BLB)
Map Reduce
Online learning
3 Application
Nathalie Villa-Vialaneix | RF for Big Data 14/39
Types of strategy to handle big data problems
Nathalie Villa-Vialaneix | RF for Big Data 15/39
Types of strategy to handle big data problems
Here: Bag of Little Bootstrap
Nathalie Villa-Vialaneix | RF for Big Data 15/39
Types of strategy to handle big data problems
Here: A MapReduce approach
Nathalie Villa-Vialaneix | RF for Big Data 15/39
Types of strategy to handle big data problems
Nathalie Villa-Vialaneix | RF for Big Data 15/39
Overview of BLB
[Kleiner et al., 2012, Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Nathalie Villa-Vialaneix | RF for Big Data 16/39
Overview of BLB
[Kleiner et al., 2012, Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simplified case of bagging (as for
random forest)
Nathalie Villa-Vialaneix | RF for Big Data 16/39
Overview of BLB
[Kleiner et al., 2012, Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simplified case of bagging (as for
random forest)
Framework: (Xi, Yi)i=1,...,n a learning set. We want to define a predictor of
Y ∈ R from X given the learning set.
Nathalie Villa-Vialaneix | RF for Big Data 16/39
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
Nathalie Villa-Vialaneix | RF for Big Data 17/39
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
Nathalie Villa-Vialaneix | RF for Big Data 17/39
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
But: The quality of the estimator strongly depends on m!
Nathalie Villa-Vialaneix | RF for Big Data 17/39
Problem with standard bagging
When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still
BIG!
First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap:
bootstrap samples have size m with m n
But: The quality of the estimator strongly depends on m!
Idea behind BLB
Use bootstrap samples having size n but with a very small number of
different observations in each of them.
Nathalie Villa-Vialaneix | RF for Big Data 17/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
Nathalie Villa-Vialaneix | RF for Big Data 18/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
(X
(1)
1
, Y
(1)
1
) . . . (X
(1)
m , Y
(1)
m )
(X
(B1)
1
, Y
(B1)
1
) . . . (X
(B1)
m , Y
(B1)
m )
...
sampling, no
replacement
(size m n)
Nathalie Villa-Vialaneix | RF for Big Data 18/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
(X
(1)
1
, Y
(1)
1
) . . . (X
(1)
m , Y
(1)
m )
(X
(B1)
1
, Y
(B1)
1
) . . . (X
(B1)
m , Y
(B1)
m )
...
n
(1,1)
1
. . . n
(1,1)
m
n
(1,B2)
1
. . . n
(1,B2)
m
n
(B1,1)
1
. . . n
(B1,1)
m
...
..
sampling, no
replacement
(size m n)
over-sampling
Nathalie Villa-Vialaneix | RF for Big Data 18/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
(X
(1)
1
, Y
(1)
1
) . . . (X
(1)
m , Y
(1)
m )
(X
(B1)
1
, Y
(B1)
1
) . . . (X
(B1)
m , Y
(B1)
m )
...
n
(1,1)
1
. . . n
(1,1)
m
n
(1,B2)
1
. . . n
(1,B2)
m
n
(B1,1)
1
. . . n
(B1,1)
m
...
..
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
sampling, no
replacement
(size m n)
over-sampling
Nathalie Villa-Vialaneix | RF for Big Data 18/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
(X
(1)
1
, Y
(1)
1
) . . . (X
(1)
m , Y
(1)
m )
(X
(B1)
1
, Y
(B1)
1
) . . . (X
(B1)
m , Y
(B1)
m )
...
n
(1,1)
1
. . . n
(1,1)
m
n
(1,B2)
1
. . . n
(1,B2)
m
n
(B1,1)
1
. . . n
(B1,1)
m
...
..
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf1
ˆfB1
sampling, no
replacement
(size m n)
over-sampling
mean
Nathalie Villa-Vialaneix | RF for Big Data 18/39
Presentation of BLB
(X1, Y1) . . . (Xn, Yn)
(X
(1)
1
, Y
(1)
1
) . . . (X
(1)
m , Y
(1)
m )
(X
(B1)
1
, Y
(B1)
1
) . . . (X
(B1)
m , Y
(B1)
m )
...
n
(1,1)
1
. . . n
(1,1)
m
n
(1,B2)
1
. . . n
(1,B2)
m
n
(B1,1)
1
. . . n
(B1,1)
m
...
..
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf1
ˆfB1
ˆfBLB
sampling, no
replacement
(size m n)
over-sampling
mean
mean
Nathalie Villa-Vialaneix | RF for Big Data 18/39
What is over-sampling and why is it working?
BLB steps:
1 create B1 samples (without replacement) of size m ∼ nγ (with
γ ∈ [0.5, 1]: for n = 106
and γ = 0.6, typical m is about 4000,
compared to 630 000 for standard bootstrap
Nathalie Villa-Vialaneix | RF for Big Data 19/39
What is over-sampling and why is it working?
BLB steps:
1 create B1 samples (without replacement) of size m ∼ nγ (with
γ ∈ [0.5, 1]
2 for every subsample τb, repeat B2 times:
over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1
m
1m to
observations in τb
Nathalie Villa-Vialaneix | RF for Big Data 19/39
What is over-sampling and why is it working?
BLB steps:
1 create B1 samples (without replacement) of size m ∼ nγ (with
γ ∈ [0.5, 1]
2 for every subsample τb, repeat B2 times:
over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1
m
1m to
observations in τb
estimation step: train an estimator with weighted observations (if the
learning algorithm allows a genuine processing of weights,
computational cost is low because of the small size of m)
Nathalie Villa-Vialaneix | RF for Big Data 19/39
What is over-sampling and why is it working?
BLB steps:
1 create B1 samples (without replacement) of size m ∼ nγ (with
γ ∈ [0.5, 1]
2 for every subsample τb, repeat B2 times:
over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1
m
1m to
observations in τb
estimation step: train an estimator with weighted observations
3 aggregate by averaging
Nathalie Villa-Vialaneix | RF for Big Data 19/39
What is over-sampling and why is it working?
BLB steps:
1 create B1 samples (without replacement) of size m ∼ nγ (with
γ ∈ [0.5, 1]
2 for every subsample τb, repeat B2 times:
over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1
m
1m to
observations in τb
estimation step: train an estimator with weighted observations
3 aggregate by averaging
Remark: Final sample size ( m
i=1 ni) is equal to n (with replacement) as in
standard bootstrap samples.
Nathalie Villa-Vialaneix | RF for Big Data 19/39
Overview of Map Reduce
Map Reduce is a generic method to deal with massive datasets stored on
a distributed filesystem.
It has been developped by Google
TM
[Dean and Ghemawat, 2004] (see also
[Chamandy et al., 2012] for example of use at Google).
Nathalie Villa-Vialaneix | RF for Big Data 20/39
Overview of Map Reduce
Data
Data 1
Data 2
Data 3
The data are broken into several bits.
Nathalie Villa-Vialaneix | RF for Big Data 20/39
Overview of Map Reduce
Data
Data 1
Data 2
Data 3
Map
Map
Map
{(key, value)}
{(key, value)}
{(key, value)}
Each bit is processed through ONE map step and gives pairs {(key, value)}.
Nathalie Villa-Vialaneix | RF for Big Data 20/39
Overview of Map Reduce
Data
Data 1
Data 2
Data 3
Map
Map
Map
{(key, value)}
{(key, value)}
{(key, value)}
Map jobs must be independent! Result: indexed data.
Nathalie Villa-Vialaneix | RF for Big Data 20/39
Overview of Map Reduce
Data
Data 1
Data 2
Data 3
Map
Map
Map
{(key, value)}
{(key, value)}
{(key, value)}
Reduce
key = keyk
Reduce
key = key1
OUTPUT
Each key is processed through ONE reduce step to produce the output.
Nathalie Villa-Vialaneix | RF for Big Data 20/39
MR implementation of random forest
A Map/Reduce implementation of random forest is included in Mahout
(Apache scalable machine learning library) which works as
[del Rio et al., 2014]:
data are split between Q bits sent to each Map job;
a Map job train a random forest with a small number of trees in it;
there is no Reduce step (the final forest is the combination of all trees
learned in the Map jobs).
Nathalie Villa-Vialaneix | RF for Big Data 21/39
MR implementation of random forest
A Map/Reduce implementation of random forest is included in Mahout
(Apache scalable machine learning library) which works as
[del Rio et al., 2014]:
data are split between Q bits sent to each Map job;
a Map job train a random forest with a small number of trees in it;
there is no Reduce step (the final forest is the combination of all trees
learned in the Map jobs).
Note that this implementation is not equivalent to the original random
forest algorithm because the forests are not built on bootstrap samples of
the original data set.
Nathalie Villa-Vialaneix | RF for Big Data 21/39
Drawbacks of MR implementation of random forest
Locality of data can yield to biased random forests in the different Map
jobs ⇒ the combined forest might have poor prediction performances
Nathalie Villa-Vialaneix | RF for Big Data 22/39
Drawbacks of MR implementation of random forest
Locality of data can yield to biased random forests in the different Map
jobs ⇒ the combined forest might have poor prediction performances
OOB error cannot be computed precisely because Map job are
independent. A proxy of this quantity is given by the average of OOB
errors obtained from the different Map tasks ⇒ again this quantity
must be biased due to data locality (similar problem with VI).
Nathalie Villa-Vialaneix | RF for Big Data 22/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B
random i.i.d. random variables from Poisson(1) nb
i
(b = 1, . . . , B).
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B
random i.i.d. random variables from Poisson(1) nb
i
(b = 1, . . . , B).
Output: (key, value) are (b, (i, nb
i
)) for all pairs (i, b) st nb
i
0
(indices i st nb
i
0 are in bootstrap sample number b nb
i
times);
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B
random i.i.d. random variables from Poisson(1) nb
i
(b = 1, . . . , B).
Output: (key, value) are (b, (i, nb
i
)) for all pairs (i, b) st nb
i
0
(indices i st nb
i
0 are in bootstrap sample number b nb
i
times);
2 Reduce step proceeds bootstrap sample number b: a tree is built
from indices i st nb
i
0 repeated nb
i
times.
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B
random i.i.d. random variables from Poisson(1) nb
i
(b = 1, . . . , B).
Output: (key, value) are (b, (i, nb
i
)) for all pairs (i, b) st nb
i
0
(indices i st nb
i
0 are in bootstrap sample number b nb
i
times);
2 Reduce step proceeds bootstrap sample number b: a tree is built
from indices i st nb
i
0 repeated nb
i
times.
Output: A tree... All trees are collected in a forest.
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Another MR implementation of random forest
... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact
that (for large n):
Binom n,
1
n
Poisson(1)
1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B
random i.i.d. random variables from Poisson(1) nb
i
(b = 1, . . . , B).
Output: (key, value) are (b, (i, nb
i
)) for all pairs (i, b) st nb
i
0
(indices i st nb
i
0 are in bootstrap sample number b nb
i
times);
2 Reduce step proceeds bootstrap sample number b: a tree is built
from indices i st nb
i
0 repeated nb
i
times.
Output: A tree... All trees are collected in a forest.
Closer to using RF directly on the entire dataset But: every Reduce job
should deal with approximately 0.63 × n different observations... (only the
bootstrap part is simplified)
Nathalie Villa-Vialaneix | RF for Big Data 23/39
Online learning framework
Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a
predictor ˆfn
New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the
entire dataset (Xi, Yi)i=1,...,n+m?
Nathalie Villa-Vialaneix | RF for Big Data 24/39
Online learning framework
Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a
predictor ˆfn
New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the
entire dataset (Xi, Yi)i=1,...,n+m?
Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m
More interesting approach: update ˆfn with the new information
(Xi, Yi)i=n+1,...,n+m
Nathalie Villa-Vialaneix | RF for Big Data 24/39
Online learning framework
Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a
predictor ˆfn
New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the
entire dataset (Xi, Yi)i=1,...,n+m?
Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m
More interesting approach: update ˆfn with the new information
(Xi, Yi)i=n+1,...,n+m
Why is it interesting?
computational gain if the update has a small computational cost (it
can even be interesting to deal directly with big data which do not
arrive in stream)
storage gain
Nathalie Villa-Vialaneix | RF for Big Data 24/39
Framework of online bagging
ˆfn =
1
B
B
b=1
ˆfb
n
in which
ˆfb
n has been built from a bootstrap sample in {1, . . . , n}
we know how to update ˆfb
n with new data online
Nathalie Villa-Vialaneix | RF for Big Data 25/39
Framework of online bagging
ˆfn =
1
B
B
b=1
ˆfb
n
in which
ˆfb
n has been built from a bootstrap sample in {1, . . . , n}
we know how to update ˆfb
n with new data online
Question: Can we update the bootstrap samples online when new data
(Xi, Yi)i=n+1,...,n+m arrive?
Nathalie Villa-Vialaneix | RF for Big Data 25/39
Online bootstrap using Poisson bootstrap
1 generate weights for every bootstrap samples and every new
observation: nb
i
∼ Poisson(1) for i = n + 1, . . . , n + m and
b = 1, . . . , B
Nathalie Villa-Vialaneix | RF for Big Data 26/39
Online bootstrap using Poisson bootstrap
1 generate weights for every bootstrap samples and every new
observation: nb
i
∼ Poisson(1) for i = n + 1, . . . , n + m and
b = 1, . . . , B
2 update ˆfb
n with the observations Xi such that nb
i
0, each repeated
nb
i
times
Nathalie Villa-Vialaneix | RF for Big Data 26/39
Online bootstrap using Poisson bootstrap
1 generate weights for every bootstrap samples and every new
observation: nb
i
∼ Poisson(1) for i = n + 1, . . . , n + m and
b = 1, . . . , B
2 update ˆfb
n with the observations Xi such that nb
i
0, each repeated
nb
i
times
3 update the predictor:
ˆfn+m =
1
B
B
b=1
ˆfb
n+m.
Nathalie Villa-Vialaneix | RF for Big Data 26/39
PRF
In Purely Random Forest [Biau et al., 2008], the splits are generated
independently from the data
splits are obtained by randomly choosing a variable and a splitting
point within the range of this variable
Nathalie Villa-Vialaneix | RF for Big Data 27/39
PRF
In Purely Random Forest [Biau et al., 2008], the splits are generated
independently from the data
splits are obtained by randomly choosing a variable and a splitting
point within the range of this variable
decision is made in a standard way
Nathalie Villa-Vialaneix | RF for Big Data 27/39
Online PRF
PRF is described by:
∀ b = 1, . . . , B, ˆfb
n : PR tree for bootstrap sample number b
∀ b = 1, . . . , B, for all terminal leaf l in ˆfb
n , obsb,l
n is the number of
observations in (Xi)i=1, ..., n which falls in leaf l and valb,l
n is the average
Y for these observations (regression framework)
Nathalie Villa-Vialaneix | RF for Big Data 28/39
Online PRF
PRF is described by:
∀ b = 1, . . . , B, ˆfb
n : PR tree for bootstrap sample number b
∀ b = 1, . . . , B, for all terminal leaf l in ˆfb
n , obsb,l
n is the number of
observations in (Xi)i=1, ..., n which falls in leaf l and valb,l
n is the average
Y for these observations (regression framework)
Online update with Poisson bootstrap:
∀ b = 1, . . . , B, ∀ i ∈ {n + 1, . . . , n + m} st nb
i
0 and for the terminal
leaf l of Xi:
valb,l
i
=
valb,l
i−1
× obsb,l
i−1
+ nb
i
× Yi
obsb,l
i−1
+ nb
i
(online update of the mean...)
Nathalie Villa-Vialaneix | RF for Big Data 28/39
Online PRF
PRF is described by:
∀ b = 1, . . . , B, ˆfb
n : PR tree for bootstrap sample number b
∀ b = 1, . . . , B, for all terminal leaf l in ˆfb
n , obsb,l
n is the number of
observations in (Xi)i=1, ..., n which falls in leaf l and valb,l
n is the average
Y for these observations (regression framework)
Online update with Poisson bootstrap:
∀ b = 1, . . . , B, ∀ i ∈ {n + 1, . . . , n + m} st nb
i
0 and for the terminal
leaf l of Xi:
valb,l
i
=
valb,l
i−1
× obsb,l
i−1
+ nb
i
× Yi
obsb,l
i−1
+ nb
i
(online update of the mean...)
obsb,l
i
= obsb,l
i−1
+ nb
i
Nathalie Villa-Vialaneix | RF for Big Data 28/39
Online RF
Developed to handle data streams (data arrive sequentially) in an
online manner (we can not keep all data from the past):
[Saffari et al., 2009]
Can deal with massive data streams (addressing both Volume and
Velocity characteristics), but also to handle massive (static) data, by
running through the data sequentially
In depth adaptation of Breiman’s RF: even the tree growing
mechanism is changed
Main idea: think only in terms of proportions of output classes,
instead of observations (classification framework)
Consistency results in [Denil et al., 2013]
Nathalie Villa-Vialaneix | RF for Big Data 29/39
Sommaire
1 Random Forest
2 Strategies to use random forest with big data
Bag of Little Bootstrap (BLB)
Map Reduce
Online learning
3 Application
Nathalie Villa-Vialaneix | RF for Big Data 30/39
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities
Nathalie Villa-Vialaneix | RF for Big Data 31/39
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities
[R Core Team, 2016, Kane et al., 2013]
R is not well-suited for working with data structures larger than
about 10–20% of a computer’s RAM. Data exceeding 50% of
available RAM are essentially unusable because the overhead of
all but the simplest of calculations quickly consumes all available
RAM. Based on these guidelines, we consider a data set large if
it exceeds 20% of the RAM on a given machine and massive if it
exceeds 50%.
Nathalie Villa-Vialaneix | RF for Big Data 31/39
When should we consider data as “big”?
We deal with Big Data when:
data are at google scale (rare)
data are big compared to our computing capacities ... and depending
on what we need to do with them
[R Core Team, 2016, Kane et al., 2013]
R is not well-suited for working with data structures larger than
about 10–20% of a computer’s RAM. Data exceeding 50% of
available RAM are essentially unusable because the overhead of
all but the simplest of calculations quickly consumes all available
RAM. Based on these guidelines, we consider a data set large if
it exceeds 20% of the RAM on a given machine and massive if it
exceeds 50%.
Nathalie Villa-Vialaneix | RF for Big Data 31/39
Implementation
standard randomForest R package (done)
bigmemory bigrf R package
MapReduce
at hand (done)
R packages rmr, rhadoop
Mahout implementation
Online RF: Python code from [Denil et al., 2013], and C++ code from
[Saffari et al., 2009].
Nathalie Villa-Vialaneix | RF for Big Data 32/39
MR-RF in practice: case study [Genuer et al., 2015]
15,000,000 observations generated from: Y with
P(Y = 1) = P(Y = −1) = 0.5 and the conditional distribution of the
(X(j))j=1,...,7 given Y = y
with probability equal to 0.7, X(j) ∼ N(jy, 1) for j ∈ {1, 2, 3} and
X(j) ∼ N(0, 1) for j ∈ {4, 5, 6};
with probability equal to 0.3, Xj
∼ N(0, 1) for j ∈ {1, 2, 3} and
X(j) ∼ N((j − 3)y, 1) for j ∈ {4, 5, 6};
X7
∼ N(0, 1).
Nathalie Villa-Vialaneix | RF for Big Data 33/39
MR-RF in practice: case study [Genuer et al., 2015]
15,000,000 observations generated from: Y with
P(Y = 1) = P(Y = −1) = 0.5 and the conditional distribution of the
(X(j))j=1,...,7 given Y = y
with probability equal to 0.7, X(j) ∼ N(jy, 1) for j ∈ {1, 2, 3} and
X(j) ∼ N(0, 1) for j ∈ {4, 5, 6};
with probability equal to 0.3, Xj
∼ N(0, 1) for j ∈ {1, 2, 3} and
X(j) ∼ N((j − 3)y, 1) for j ∈ {4, 5, 6};
X7
∼ N(0, 1).
Comparison of subsampling, BLB, MR with well distributed data within
Map jobs.
Nathalie Villa-Vialaneix | RF for Big Data 33/39
Airline data
Benchmark data in Big Data articles (e.g., [Wang et al., 2015])
containing more than 124 millions of observations and 29 variables
Aim: predict delay_status (1=delayed, 0=on time) of a flight using 4
explanatory variables (distance, night, week-end,
departure_time).
Not really massive data: 12 Go csv file
Still useful to illustrate some Big Data issues:
too large to fit in RAM (of most of nowadays laptops)
R struggles to perform complex computations unless data take less
than 10% − 20% of RAM (total memory size of manipulated objects
cannot exceed RAM limit)
long computation times to deal with this dataset
Preliminary experiments on a Linux 64 bits server with 8 processors,
32 cores and 256 Go of RAM
Nathalie Villa-Vialaneix | RF for Big Data 34/39
Comparison of different BDRF on a simulation study
sequential forest: took approximately 7 hours and the resulting OOB error
was equal to 4.564e−3
.
Method Comp. time BDerrForest errForest errTest
sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3)
sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3)
sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3)
sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3)
BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3)
BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3)
MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3)
MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3)
MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3)
MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3)
Nathalie Villa-Vialaneix | RF for Big Data 35/39
Comparison of different BDRF on a simulation study
sequential forest: took approximately 7 hours and the resulting OOB error
was equal to 4.564e−3
.
Method Comp. time BDerrForest errForest errTest
sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3)
sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3)
sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3)
sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3)
BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3)
BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3)
MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3)
MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3)
MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3)
MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3)
all methods provide satisfactory results comparable with sequential
RF
average OOB error over the Map forests can be a bad approximation
of true OOB error (sometimes optimistic, sometimes pessimistic)
Nathalie Villa-Vialaneix | RF for Big Data 35/39
What happens if Map Jobs are unbalanced between the
two submodels? between the two classes of Y?
Method Comp. time BDerrForest errForest errTest
unbalanced1 100/1 3 minutes 3.900e(-3) 5.470e(-3) 5.247e(-3)
unbalanced1 10/10 8 minutes 2.575e(-3) 4.714e(-3) 4.473e(-3)
unbalanced2 100/1/0.1 2 minutes 5.880e(-3) 4.502e(-3) 4.373e(-3)
unbalanced2 10/10/0.1 3 minutes 4.165e(-3) 4.465e(-3) 4.260e(-3)
unbalanced2 100/1/0.01 1 minute 1.926e(-3) 8.734e(-2) 4.484e(-2)
unbalanced2 10/10/0.01 4 minutes 9.087e(-4) 7.612e(-2) 7.299e(-2)
x-biases 100/1 3 minutes 3.504e(-3) 1.010e(-1) 1.006e(-1)
x-biases 100/10 3 minutes 2.082e(-3) 1.010e(-1) 1.008e(-1)
Nathalie Villa-Vialaneix | RF for Big Data 36/39
What happens if Map Jobs are unbalanced between the
two submodels? between the two classes of Y?
Method Comp. time BDerrForest errForest errTest
unbalanced1 100/1 3 minutes 3.900e(-3) 5.470e(-3) 5.247e(-3)
unbalanced1 10/10 8 minutes 2.575e(-3) 4.714e(-3) 4.473e(-3)
unbalanced2 100/1/0.1 2 minutes 5.880e(-3) 4.502e(-3) 4.373e(-3)
unbalanced2 10/10/0.1 3 minutes 4.165e(-3) 4.465e(-3) 4.260e(-3)
unbalanced2 100/1/0.01 1 minute 1.926e(-3) 8.734e(-2) 4.484e(-2)
unbalanced2 10/10/0.01 4 minutes 9.087e(-4) 7.612e(-2) 7.299e(-2)
x-biases 100/1 3 minutes 3.504e(-3) 1.010e(-1) 1.006e(-1)
x-biases 100/10 3 minutes 2.082e(-3) 1.010e(-1) 1.008e(-1)
unbalancing of Y in the different map jobs does not affect much the
performances
however, if the relation between (X, Y) is different in different subsets
of data, unbalancing these submodels can lead to strongly deteriorate
the performance
Nathalie Villa-Vialaneix | RF for Big Data 36/39
Airline dataset
Method Computational time BDerrForest errForest
sampling 10% 32 min 18.32% 18.32%
sampling 1% 2 min 18.35% 18.33%
sampling 0.1% 7 sec 18.36% 18.39%
sampling 0.01% 2 sec 18.44% 18.49%
BLB-RF 15/7 25 min 18.35% 18.33%
MR-RF 15/7 15 min 18.33% 18.27%
MR-RF 15/20 25 min 18.34% 18.20%
MR-RF 100/10 17 min 18.33% 18.20%
sequential forest: The RF took 16 hours to be obtained and its OOB error
was equal to 18.32% (only 19.3% of the flights are really late).
Nathalie Villa-Vialaneix | RF for Big Data 37/39
Perspectives
Sampling for MapReduce-RF:
Use a partition into map jobs stratified on Y, or at least a random
partition
Use the Bag of Little Bootstrap from [Kleiner et al., 2012]
Possible variants for MapReduce RF:
Use simplified RF, e.g., Extremly Randomized Trees,
[Geurts et al., 2006] (as in Online RF)
See the whole forest as a forest of forests and adapt the majority vote
scheme using weights
Nathalie Villa-Vialaneix | RF for Big Data 38/39
Have you survived to Big Data?
Questions?
Nathalie Villa-Vialaneix | RF for Big Data 39/39
References
Biau, G., Devroye, L., and Lugosi, G. (2008).
Consistency of random forests and other averaging classifiers.
The Journal of Machine Learning Research, 9:2015–2033.
Bickel, P., Götze, F., and van Zwet, W. (1997).
Resampling fewer than n observations: gains, losses and remedies for losses.
Statistica Sinica, 7(1):1–31.
Boulesteix, A., Kruppa, J., and König, I. (2012).
Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6):493–507.
Breiman, L. (1996).
Heuristics of instability in model selection.
Annals of Statistics, 24(6):2350–2383.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Chamandy, N., Muralidharan, O., Najmi, A., and Naidu, S. (2012).
Estimating uncertainty for massive data streams.
Technical report, Google.
Dean, J. and Ghemawat, S. (2004).
MapReduce: simplified data processing on large clusters.
In Proceedings of Sixth Symposium on Operating System Design and Implementation (OSDI 2004).
del Rio, S., López, V., Beniítez, J., and Herrera, F. (2014).
Nathalie Villa-Vialaneix | RF for Big Data 39/39
On the use of MapReduce for imbalanced big data using random forest.
Information Sciences, 285:112–137.
Denil, M., Matheson, D., and de Freitas, N. (2013).
Consistency of online random forests.
In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pages 1256–1264.
Díaz-Uriarte, R. and Alvarez de Andres, S. (2006).
Gene selection and classification of microarray data using random forest.
BMC Bioinformatics, 7(1):3.
Dietterich, T. (2000).
An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and
randomization.
Machine Learning, 40(2):139–157.
Genuer, R., Poggi, J., Tuleau-Malot, C., and Villa-Vialaneix, N. (2015).
Random forests for big data.
Preprint arXiv:1511.08327. Submitted for publication.
Geurts, P., Ernst, D., and Wehenkel, L. (2006).
Extremely randomized trees.
Machine Learning, 63(1):3–42.
Ghattas, B. (1999).
Prévisions des pics d’ozone par arbres de régression, simples et agrégés par bootstrap.
Revue de statistique appliquée, 47(2):61–80.
Goldstein, B., Hubbard, A., Cutler, A., and Barcellos, L. (2010).
An application of random forests to a genome-wide association dataset: methodological considerations & new findings.
BMC Genetics, 11(1):1.
Kane, M., Emerson, J., and Weston, S. (2013).
Scalable strategies for computing with massive data.
Nathalie Villa-Vialaneix | RF for Big Data 39/39
Journal of Statistical Software, 55(14).
Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. (2012).
The big data bootstrap.
In Proceedings of 29th International Conference on Machine Learning (ICML 2012), Edinburgh, Scotland, UK.
Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. (2014).
A scalable bootstrap for massive data.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816.
Prasad, A., Iverson, L., and Liaw, A. (2006).
Newer classification and regression tree techniques: bagging and random forests for ecological prediction.
Ecosystems, 9(2):181–199.
R Core Team (2016).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
Saffari, A., Leistner, C., Santner, J., Godec, M., and Bischof, H. (2009).
On-line random forests.
In Proceedings of IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pages 1393–1400.
IEEE.
Verikas, A., Gelzinis, A., and Bacauskiene, M. (2011).
Mining data with random forests: a survey and results of new tests.
Pattern Recognition, 44(2):330–349.
Wang, C., Chen, M., Schifano, E., Wu, J., and Yan, J. (2015).
A survey of statistical methods and computing for big data.
arXiv preprint arXiv:1502.07989.
Nathalie Villa-Vialaneix | RF for Big Data 39/39

Weitere ähnliche Inhalte

Was ist angesagt?

Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC datatuxette
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forestsChristian Robert
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biologytuxette
 
Reproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishReproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishtuxette
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learningtuxette
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherchesPierre Pudlo
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learningtuxette
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesFörderverein Technische Fakultät
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Christian Robert
 

Was ist angesagt? (20)

Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC data
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 
Reproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfishReproducibility and differential analysis with selfish
Reproducibility and differential analysis with selfish
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
Boston talk
Boston talkBoston talk
Boston talk
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherches
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
 
A short introduction to statistical learning
A short introduction to statistical learningA short introduction to statistical learning
A short introduction to statistical learning
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application Examples
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
 
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
QMC: Operator Splitting Workshop, Incremental Learning-to-Learn with Statisti...
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 

Andere mochten auch

Integrating Tara Oceans datasets using unsupervised multiple kernel learning
Integrating Tara Oceans datasets using unsupervised multiple kernel learningIntegrating Tara Oceans datasets using unsupervised multiple kernel learning
Integrating Tara Oceans datasets using unsupervised multiple kernel learningtuxette
 
Multiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasetsMultiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasetstuxette
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at ScaleSri Ambati
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysispotaters
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDeepak George
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3leorick lin
 
Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014tuxette
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOtuxette
 
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans RVisualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans Rtuxette
 
Mining co-expression network
Mining co-expression networkMining co-expression network
Mining co-expression networktuxette
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in ComputingMarko Rodriguez
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestMohamed Medhat Gaber
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOtuxette
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningtuxette
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOtuxette
 

Andere mochten auch (20)

Integrating Tara Oceans datasets using unsupervised multiple kernel learning
Integrating Tara Oceans datasets using unsupervised multiple kernel learningIntegrating Tara Oceans datasets using unsupervised multiple kernel learning
Integrating Tara Oceans datasets using unsupervised multiple kernel learning
 
Multiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasetsMultiple kernel learning applied to the integration of Tara oceans datasets
Multiple kernel learning applied to the integration of Tara oceans datasets
 
Random forest
Random forestRandom forest
Random forest
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysis
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 
Random forest
Random forestRandom forest
Random forest
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014Slides Lycée Jules Fil 2014
Slides Lycée Jules Fil 2014
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans RVisualiser et fouiller des réseaux - Méthodes et exemples dans R
Visualiser et fouiller des réseaux - Méthodes et exemples dans R
 
Mining co-expression network
Mining co-expression networkMining co-expression network
Mining co-expression network
 
Tree advanced
Tree advancedTree advanced
Tree advanced
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 

Ähnlich wie Random Forest for Big Data

Likelihood free computational statistics
Likelihood free computational statisticsLikelihood free computational statistics
Likelihood free computational statisticsPierre Pudlo
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMINGChristian Robert
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationChristian Robert
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesJuanPabloCarbajal3
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theorytuxette
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesAnne-Marie Tousch
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapterChristian Robert
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationArthur Mensch
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...tuxette
 

Ähnlich wie Random Forest for Big Data (20)

Likelihood free computational statistics
Likelihood free computational statisticsLikelihood free computational statistics
Likelihood free computational statistics
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theory
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Ab cancun
Ab cancunAb cancun
Ab cancun
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
Georgia Tech 2017 March Talk
Georgia Tech 2017 March TalkGeorgia Tech 2017 March Talk
Georgia Tech 2017 March Talk
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...
 

Mehr von tuxette

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathstuxette
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquestuxette
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-Ctuxette
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?tuxette
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquestuxette
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeantuxette
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation datatuxette
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?tuxette
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysistuxette
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricestuxette
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Predictiontuxette
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random foresttuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICStuxette
 

Mehr von tuxette (20)

Racines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en mathsRacines en haut et feuilles en bas : les arbres en maths
Racines en haut et feuilles en bas : les arbres en maths
 
Méthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènesMéthodes à noyaux pour l’intégration de données hétérogènes
Méthodes à noyaux pour l’intégration de données hétérogènes
 
Méthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiquesMéthodologies d'intégration de données omiques
Méthodologies d'intégration de données omiques
 
Projets autour de l'Hi-C
Projets autour de l'Hi-CProjets autour de l'Hi-C
Projets autour de l'Hi-C
 
Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?Can deep learning learn chromatin structure from sequence?
Can deep learning learn chromatin structure from sequence?
 
Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...Multi-omics data integration methods: kernel and other machine learning appro...
Multi-omics data integration methods: kernel and other machine learning appro...
 
ASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiquesASTERICS : une application pour intégrer des données omiques
ASTERICS : une application pour intégrer des données omiques
 
Autour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWeanAutour des projets Idefics et MetaboWean
Autour des projets Idefics et MetaboWean
 
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Journal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation dataJournal club: Validation of cluster analysis results on validation data
Journal club: Validation of cluster analysis results on validation data
 
Overfitting or overparametrization?
Overfitting or overparametrization?Overfitting or overparametrization?
Overfitting or overparametrization?
 
Selective inference and single-cell differential analysis
Selective inference and single-cell differential analysisSelective inference and single-cell differential analysis
Selective inference and single-cell differential analysis
 
SOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatricesSOMbrero : un package R pour les cartes auto-organisatrices
SOMbrero : un package R pour les cartes auto-organisatrices
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 
Explanable models for time series with random forest
Explanable models for time series with random forestExplanable models for time series with random forest
Explanable models for time series with random forest
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 
Présentation du projet ASTERICS
Présentation du projet ASTERICSPrésentation du projet ASTERICS
Présentation du projet ASTERICS
 

Kürzlich hochgeladen

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 

Kürzlich hochgeladen (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

Random Forest for Big Data

  • 1. Random Forest for Big Data Nathalie Villa-Vialaneix http://www.nathalievilla.org & Robin Genuer, Jean-Michel Poggi & Christine Tuleau-Malot 7 Février 2017 Institut Élie Cartan, Université de Lorraine Nathalie Villa-Vialaneix | RF for Big Data 1/39
  • 2. Sommaire 1 Random Forest 2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB) Map Reduce Online learning 3 Application Nathalie Villa-Vialaneix | RF for Big Data 2/39
  • 3. Sommaire 1 Random Forest 2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB) Map Reduce Online learning 3 Application Nathalie Villa-Vialaneix | RF for Big Data 3/39
  • 4. A short introduction to random forest Introduced by [Breiman, 2001], they are ensemble methods [Dietterich, 2000], similarly as Bagging, Boosting, Randomizing Outputs, Random Subspace Statistical learning algorithm that can be used for classification and regression. It has been used in many situations involving real data with success: microarray [Díaz-Uriarte and Alvarez de Andres, 2006] ecology [Prasad et al., 2006] pollution forecasting [Ghattas, 1999] la génomique [Goldstein et al., 2010, Boulesteix et al., 2012] for more references, [Verikas et al., 2011] Nathalie Villa-Vialaneix | RF for Big Data 4/39
  • 5. Description of RF Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of variables (X, Y) st X ∈ Rp (explanatory variables) Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M} (classification). Nathalie Villa-Vialaneix | RF for Big Data 5/39
  • 6. Description of RF Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of variables (X, Y) st X ∈ Rp (explanatory variables) Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M} (classification). Purpose: define a predictor ˆf : Rp → Y from Ln. Random forest from [Breiman, 2001] ˆf(., Θb), 1 ≤ b ≤ B is a set of regression or classification trees. (Θb)1≤b≤B are i.i.d. random variables, independent of Ln. Nathalie Villa-Vialaneix | RF for Big Data 5/39
  • 7. Description of RF Ln = {(X1, Y1), . . . , (Xn, Yn)} i.i.d. observations of a random pair of variables (X, Y) st X ∈ Rp (explanatory variables) Y ∈ Y (target variable). Y can be R (regression) or {1, . . . , M} (classification). Purpose: define a predictor ˆf : Rp → Y from Ln. Random forest from [Breiman, 2001] ˆf(., Θb), 1 ≤ b ≤ B is a set of regression or classification trees. (Θb)1≤b≤B are i.i.d. random variables, independent of Ln. The random forest is obtained by aggregation of the set. Aggregation: regression: ˆf(x) = 1 B B b=1 ˆf(x, Θb) classification: ˆf(x) = arg max1≤c≤M B b=1 1{ˆf(x,Θb )=c} Nathalie Villa-Vialaneix | RF for Big Data 5/39
  • 8. CART Tree: piecewise constant predictor obtained with recursive binary partitioning of Rp Constrains: splits are parallel to the axes At every step of the binary partitioning, data in the current node are split “at best” (i.e., to have the greatest decrease in heterogeneity in the two child nodes Figure: Regression tree Nathalie Villa-Vialaneix | RF for Big Data 6/39
  • 10. Classification and regression frameworks Figure: Regression tree Figure: Classification tree Nathalie Villa-Vialaneix | RF for Big Data 8/39
  • 11. Random forest with Bagging [Breiman, 1996] (Xi, Yi)i=1,...,n (Xi, Yi)i∈τ1 (Xi, Yi)i∈τb (Xi, Yi)i∈τB ˆf1 ˆfb ˆfB aggregation: ˆfbag = 1 B B b=1 ˆfb CART subsample with replacement B times Nathalie Villa-Vialaneix | RF for Big Data 9/39
  • 12. Trees in random forests Variant of CART, [Breiman et al., 1984]: piecewise constant predictor, obtained by a recursive partitioning of Rp with splits parallel to axes But: At each step of the partitioning, we seek the “best” split of data among mtry randomly picked directions (variables). No pruning (fully developed trees) Figure: Regression tree Nathalie Villa-Vialaneix | RF for Big Data 10/39
  • 13. OOB error and estimation of the prediction error OOB = Out Of Bag OOB error For predicting Yi, only predictors ˆfb such that i τb are used ⇒ Yi OOB error = 1 n n i=1 Yi − Yi 2 (regression) OOB error = 1 n n i=1 1{Yi Yi} (classification) estimation similar to standard cross validation estimation ... without splitting the training dataset because it is included in the bootstrap sample generation Warning: a different forest is used for the prediction of each Yi! Nathalie Villa-Vialaneix | RF for Big Data 11/39
  • 14. Variable importance Definition For j ∈ {1, . . . , p}, for every bootstrap sample b, permute values of the j variable in the bootstrap sample. Predict observation i (OOB prediction) for tree b: in a standard way ˆfb (Xi) after permutation ˆfb (X (j) i ) Importance of variable j for tree ˆfb is the average increase in accuracy after permutation of variable j. Regression case: I(Xj ) = 1 B B b=1 1 n n i=1 (ˆfb (X (j) i ) − Yi)2 − (ˆfb (Xi) − Yi)2 The greater the increase, the more important the variable is. Nathalie Villa-Vialaneix | RF for Big Data 12/39
  • 15. Why RF and Big Data? on one hand, bagging is appealing because easily computed in parallel. Nathalie Villa-Vialaneix | RF for Big Data 13/39
  • 16. Why RF and Big Data? on one hand, bagging is appealing because easily computed in parallel. on the dark side, each bootstrap sample has the same size than the original dataset (i.e., n, which is supposed to be LARGE) and contains approximately 0.63n different observations (which is also LARGe)! Nathalie Villa-Vialaneix | RF for Big Data 13/39
  • 17. Sommaire 1 Random Forest 2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB) Map Reduce Online learning 3 Application Nathalie Villa-Vialaneix | RF for Big Data 14/39
  • 18. Types of strategy to handle big data problems Nathalie Villa-Vialaneix | RF for Big Data 15/39
  • 19. Types of strategy to handle big data problems Here: Bag of Little Bootstrap Nathalie Villa-Vialaneix | RF for Big Data 15/39
  • 20. Types of strategy to handle big data problems Here: A MapReduce approach Nathalie Villa-Vialaneix | RF for Big Data 15/39
  • 21. Types of strategy to handle big data problems Nathalie Villa-Vialaneix | RF for Big Data 15/39
  • 22. Overview of BLB [Kleiner et al., 2012, Kleiner et al., 2014] method used to scale any bootstrap estimation consistency result demonstrated for a bootstrap estimation Nathalie Villa-Vialaneix | RF for Big Data 16/39
  • 23. Overview of BLB [Kleiner et al., 2012, Kleiner et al., 2014] method used to scale any bootstrap estimation consistency result demonstrated for a bootstrap estimation Here: we describe the approach in the simplified case of bagging (as for random forest) Nathalie Villa-Vialaneix | RF for Big Data 16/39
  • 24. Overview of BLB [Kleiner et al., 2012, Kleiner et al., 2014] method used to scale any bootstrap estimation consistency result demonstrated for a bootstrap estimation Here: we describe the approach in the simplified case of bagging (as for random forest) Framework: (Xi, Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈ R from X given the learning set. Nathalie Villa-Vialaneix | RF for Big Data 16/39
  • 25. Problem with standard bagging When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still BIG! Nathalie Villa-Vialaneix | RF for Big Data 17/39
  • 26. Problem with standard bagging When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still BIG! First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap: bootstrap samples have size m with m n Nathalie Villa-Vialaneix | RF for Big Data 17/39
  • 27. Problem with standard bagging When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still BIG! First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap: bootstrap samples have size m with m n But: The quality of the estimator strongly depends on m! Nathalie Villa-Vialaneix | RF for Big Data 17/39
  • 28. Problem with standard bagging When n is big, the number of different observations in τb is ∼ 0.63n ⇒ still BIG! First solution...: [Bickel et al., 1997] propose the “m-out-of-n” bootstrap: bootstrap samples have size m with m n But: The quality of the estimator strongly depends on m! Idea behind BLB Use bootstrap samples having size n but with a very small number of different observations in each of them. Nathalie Villa-Vialaneix | RF for Big Data 17/39
  • 29. Presentation of BLB (X1, Y1) . . . (Xn, Yn) Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 30. Presentation of BLB (X1, Y1) . . . (Xn, Yn) (X (1) 1 , Y (1) 1 ) . . . (X (1) m , Y (1) m ) (X (B1) 1 , Y (B1) 1 ) . . . (X (B1) m , Y (B1) m ) ... sampling, no replacement (size m n) Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 31. Presentation of BLB (X1, Y1) . . . (Xn, Yn) (X (1) 1 , Y (1) 1 ) . . . (X (1) m , Y (1) m ) (X (B1) 1 , Y (B1) 1 ) . . . (X (B1) m , Y (B1) m ) ... n (1,1) 1 . . . n (1,1) m n (1,B2) 1 . . . n (1,B2) m n (B1,1) 1 . . . n (B1,1) m ... .. sampling, no replacement (size m n) over-sampling Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 32. Presentation of BLB (X1, Y1) . . . (Xn, Yn) (X (1) 1 , Y (1) 1 ) . . . (X (1) m , Y (1) m ) (X (B1) 1 , Y (B1) 1 ) . . . (X (B1) m , Y (B1) m ) ... n (1,1) 1 . . . n (1,1) m n (1,B2) 1 . . . n (1,B2) m n (B1,1) 1 . . . n (B1,1) m ... .. ˆf(1,1) ˆf(1,B2) ˆf(B1,1) sampling, no replacement (size m n) over-sampling Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 33. Presentation of BLB (X1, Y1) . . . (Xn, Yn) (X (1) 1 , Y (1) 1 ) . . . (X (1) m , Y (1) m ) (X (B1) 1 , Y (B1) 1 ) . . . (X (B1) m , Y (B1) m ) ... n (1,1) 1 . . . n (1,1) m n (1,B2) 1 . . . n (1,B2) m n (B1,1) 1 . . . n (B1,1) m ... .. ˆf(1,1) ˆf(1,B2) ˆf(B1,1) ˆf1 ˆfB1 sampling, no replacement (size m n) over-sampling mean Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 34. Presentation of BLB (X1, Y1) . . . (Xn, Yn) (X (1) 1 , Y (1) 1 ) . . . (X (1) m , Y (1) m ) (X (B1) 1 , Y (B1) 1 ) . . . (X (B1) m , Y (B1) m ) ... n (1,1) 1 . . . n (1,1) m n (1,B2) 1 . . . n (1,B2) m n (B1,1) 1 . . . n (B1,1) m ... .. ˆf(1,1) ˆf(1,B2) ˆf(B1,1) ˆf1 ˆfB1 ˆfBLB sampling, no replacement (size m n) over-sampling mean mean Nathalie Villa-Vialaneix | RF for Big Data 18/39
  • 35. What is over-sampling and why is it working? BLB steps: 1 create B1 samples (without replacement) of size m ∼ nγ (with γ ∈ [0.5, 1]: for n = 106 and γ = 0.6, typical m is about 4000, compared to 630 000 for standard bootstrap Nathalie Villa-Vialaneix | RF for Big Data 19/39
  • 36. What is over-sampling and why is it working? BLB steps: 1 create B1 samples (without replacement) of size m ∼ nγ (with γ ∈ [0.5, 1] 2 for every subsample τb, repeat B2 times: over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1 m 1m to observations in τb Nathalie Villa-Vialaneix | RF for Big Data 19/39
  • 37. What is over-sampling and why is it working? BLB steps: 1 create B1 samples (without replacement) of size m ∼ nγ (with γ ∈ [0.5, 1] 2 for every subsample τb, repeat B2 times: over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1 m 1m to observations in τb estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights, computational cost is low because of the small size of m) Nathalie Villa-Vialaneix | RF for Big Data 19/39
  • 38. What is over-sampling and why is it working? BLB steps: 1 create B1 samples (without replacement) of size m ∼ nγ (with γ ∈ [0.5, 1] 2 for every subsample τb, repeat B2 times: over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1 m 1m to observations in τb estimation step: train an estimator with weighted observations 3 aggregate by averaging Nathalie Villa-Vialaneix | RF for Big Data 19/39
  • 39. What is over-sampling and why is it working? BLB steps: 1 create B1 samples (without replacement) of size m ∼ nγ (with γ ∈ [0.5, 1] 2 for every subsample τb, repeat B2 times: over-sampling: affect weights (n1, . . . , nm) simulated as M n, 1 m 1m to observations in τb estimation step: train an estimator with weighted observations 3 aggregate by averaging Remark: Final sample size ( m i=1 ni) is equal to n (with replacement) as in standard bootstrap samples. Nathalie Villa-Vialaneix | RF for Big Data 19/39
  • 40. Overview of Map Reduce Map Reduce is a generic method to deal with massive datasets stored on a distributed filesystem. It has been developped by Google TM [Dean and Ghemawat, 2004] (see also [Chamandy et al., 2012] for example of use at Google). Nathalie Villa-Vialaneix | RF for Big Data 20/39
  • 41. Overview of Map Reduce Data Data 1 Data 2 Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix | RF for Big Data 20/39
  • 42. Overview of Map Reduce Data Data 1 Data 2 Data 3 Map Map Map {(key, value)} {(key, value)} {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie Villa-Vialaneix | RF for Big Data 20/39
  • 43. Overview of Map Reduce Data Data 1 Data 2 Data 3 Map Map Map {(key, value)} {(key, value)} {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie Villa-Vialaneix | RF for Big Data 20/39
  • 44. Overview of Map Reduce Data Data 1 Data 2 Data 3 Map Map Map {(key, value)} {(key, value)} {(key, value)} Reduce key = keyk Reduce key = key1 OUTPUT Each key is processed through ONE reduce step to produce the output. Nathalie Villa-Vialaneix | RF for Big Data 20/39
  • 45. MR implementation of random forest A Map/Reduce implementation of random forest is included in Mahout (Apache scalable machine learning library) which works as [del Rio et al., 2014]: data are split between Q bits sent to each Map job; a Map job train a random forest with a small number of trees in it; there is no Reduce step (the final forest is the combination of all trees learned in the Map jobs). Nathalie Villa-Vialaneix | RF for Big Data 21/39
  • 46. MR implementation of random forest A Map/Reduce implementation of random forest is included in Mahout (Apache scalable machine learning library) which works as [del Rio et al., 2014]: data are split between Q bits sent to each Map job; a Map job train a random forest with a small number of trees in it; there is no Reduce step (the final forest is the combination of all trees learned in the Map jobs). Note that this implementation is not equivalent to the original random forest algorithm because the forests are not built on bootstrap samples of the original data set. Nathalie Villa-Vialaneix | RF for Big Data 21/39
  • 47. Drawbacks of MR implementation of random forest Locality of data can yield to biased random forests in the different Map jobs ⇒ the combined forest might have poor prediction performances Nathalie Villa-Vialaneix | RF for Big Data 22/39
  • 48. Drawbacks of MR implementation of random forest Locality of data can yield to biased random forests in the different Map jobs ⇒ the combined forest might have poor prediction performances OOB error cannot be computed precisely because Map job are independent. A proxy of this quantity is given by the average of OOB errors obtained from the different Map tasks ⇒ again this quantity must be biased due to data locality (similar problem with VI). Nathalie Villa-Vialaneix | RF for Big Data 22/39
  • 49. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 50. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) 1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B random i.i.d. random variables from Poisson(1) nb i (b = 1, . . . , B). Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 51. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) 1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B random i.i.d. random variables from Poisson(1) nb i (b = 1, . . . , B). Output: (key, value) are (b, (i, nb i )) for all pairs (i, b) st nb i 0 (indices i st nb i 0 are in bootstrap sample number b nb i times); Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 52. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) 1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B random i.i.d. random variables from Poisson(1) nb i (b = 1, . . . , B). Output: (key, value) are (b, (i, nb i )) for all pairs (i, b) st nb i 0 (indices i st nb i 0 are in bootstrap sample number b nb i times); 2 Reduce step proceeds bootstrap sample number b: a tree is built from indices i st nb i 0 repeated nb i times. Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 53. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) 1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B random i.i.d. random variables from Poisson(1) nb i (b = 1, . . . , B). Output: (key, value) are (b, (i, nb i )) for all pairs (i, b) st nb i 0 (indices i st nb i 0 are in bootstrap sample number b nb i times); 2 Reduce step proceeds bootstrap sample number b: a tree is built from indices i st nb i 0 repeated nb i times. Output: A tree... All trees are collected in a forest. Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 54. Another MR implementation of random forest ... using Poisson bootstrap [Chamandy et al., 2012] which is based on the fact that (for large n): Binom n, 1 n Poisson(1) 1 Map step ∀ r = 1, . . . , Q (chunk of data τr ). ∀ i ∈ τr , generate B random i.i.d. random variables from Poisson(1) nb i (b = 1, . . . , B). Output: (key, value) are (b, (i, nb i )) for all pairs (i, b) st nb i 0 (indices i st nb i 0 are in bootstrap sample number b nb i times); 2 Reduce step proceeds bootstrap sample number b: a tree is built from indices i st nb i 0 repeated nb i times. Output: A tree... All trees are collected in a forest. Closer to using RF directly on the entire dataset But: every Reduce job should deal with approximately 0.63 × n different observations... (only the bootstrap part is simplified) Nathalie Villa-Vialaneix | RF for Big Data 23/39
  • 55. Online learning framework Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a predictor ˆfn New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the entire dataset (Xi, Yi)i=1,...,n+m? Nathalie Villa-Vialaneix | RF for Big Data 24/39
  • 56. Online learning framework Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a predictor ˆfn New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the entire dataset (Xi, Yi)i=1,...,n+m? Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m More interesting approach: update ˆfn with the new information (Xi, Yi)i=n+1,...,n+m Nathalie Villa-Vialaneix | RF for Big Data 24/39
  • 57. Online learning framework Data stream: Observations (Xi, Yi)i=1,...,n have been used to obtain a predictor ˆfn New data arrive (Xi, Yi)i=n+1,...,n+m: How to obtain a predictor from the entire dataset (Xi, Yi)i=1,...,n+m? Naive approach: re-train a model from (Xi, Yi)i=1,...,n+m More interesting approach: update ˆfn with the new information (Xi, Yi)i=n+1,...,n+m Why is it interesting? computational gain if the update has a small computational cost (it can even be interesting to deal directly with big data which do not arrive in stream) storage gain Nathalie Villa-Vialaneix | RF for Big Data 24/39
  • 58. Framework of online bagging ˆfn = 1 B B b=1 ˆfb n in which ˆfb n has been built from a bootstrap sample in {1, . . . , n} we know how to update ˆfb n with new data online Nathalie Villa-Vialaneix | RF for Big Data 25/39
  • 59. Framework of online bagging ˆfn = 1 B B b=1 ˆfb n in which ˆfb n has been built from a bootstrap sample in {1, . . . , n} we know how to update ˆfb n with new data online Question: Can we update the bootstrap samples online when new data (Xi, Yi)i=n+1,...,n+m arrive? Nathalie Villa-Vialaneix | RF for Big Data 25/39
  • 60. Online bootstrap using Poisson bootstrap 1 generate weights for every bootstrap samples and every new observation: nb i ∼ Poisson(1) for i = n + 1, . . . , n + m and b = 1, . . . , B Nathalie Villa-Vialaneix | RF for Big Data 26/39
  • 61. Online bootstrap using Poisson bootstrap 1 generate weights for every bootstrap samples and every new observation: nb i ∼ Poisson(1) for i = n + 1, . . . , n + m and b = 1, . . . , B 2 update ˆfb n with the observations Xi such that nb i 0, each repeated nb i times Nathalie Villa-Vialaneix | RF for Big Data 26/39
  • 62. Online bootstrap using Poisson bootstrap 1 generate weights for every bootstrap samples and every new observation: nb i ∼ Poisson(1) for i = n + 1, . . . , n + m and b = 1, . . . , B 2 update ˆfb n with the observations Xi such that nb i 0, each repeated nb i times 3 update the predictor: ˆfn+m = 1 B B b=1 ˆfb n+m. Nathalie Villa-Vialaneix | RF for Big Data 26/39
  • 63. PRF In Purely Random Forest [Biau et al., 2008], the splits are generated independently from the data splits are obtained by randomly choosing a variable and a splitting point within the range of this variable Nathalie Villa-Vialaneix | RF for Big Data 27/39
  • 64. PRF In Purely Random Forest [Biau et al., 2008], the splits are generated independently from the data splits are obtained by randomly choosing a variable and a splitting point within the range of this variable decision is made in a standard way Nathalie Villa-Vialaneix | RF for Big Data 27/39
  • 65. Online PRF PRF is described by: ∀ b = 1, . . . , B, ˆfb n : PR tree for bootstrap sample number b ∀ b = 1, . . . , B, for all terminal leaf l in ˆfb n , obsb,l n is the number of observations in (Xi)i=1, ..., n which falls in leaf l and valb,l n is the average Y for these observations (regression framework) Nathalie Villa-Vialaneix | RF for Big Data 28/39
  • 66. Online PRF PRF is described by: ∀ b = 1, . . . , B, ˆfb n : PR tree for bootstrap sample number b ∀ b = 1, . . . , B, for all terminal leaf l in ˆfb n , obsb,l n is the number of observations in (Xi)i=1, ..., n which falls in leaf l and valb,l n is the average Y for these observations (regression framework) Online update with Poisson bootstrap: ∀ b = 1, . . . , B, ∀ i ∈ {n + 1, . . . , n + m} st nb i 0 and for the terminal leaf l of Xi: valb,l i = valb,l i−1 × obsb,l i−1 + nb i × Yi obsb,l i−1 + nb i (online update of the mean...) Nathalie Villa-Vialaneix | RF for Big Data 28/39
  • 67. Online PRF PRF is described by: ∀ b = 1, . . . , B, ˆfb n : PR tree for bootstrap sample number b ∀ b = 1, . . . , B, for all terminal leaf l in ˆfb n , obsb,l n is the number of observations in (Xi)i=1, ..., n which falls in leaf l and valb,l n is the average Y for these observations (regression framework) Online update with Poisson bootstrap: ∀ b = 1, . . . , B, ∀ i ∈ {n + 1, . . . , n + m} st nb i 0 and for the terminal leaf l of Xi: valb,l i = valb,l i−1 × obsb,l i−1 + nb i × Yi obsb,l i−1 + nb i (online update of the mean...) obsb,l i = obsb,l i−1 + nb i Nathalie Villa-Vialaneix | RF for Big Data 28/39
  • 68. Online RF Developed to handle data streams (data arrive sequentially) in an online manner (we can not keep all data from the past): [Saffari et al., 2009] Can deal with massive data streams (addressing both Volume and Velocity characteristics), but also to handle massive (static) data, by running through the data sequentially In depth adaptation of Breiman’s RF: even the tree growing mechanism is changed Main idea: think only in terms of proportions of output classes, instead of observations (classification framework) Consistency results in [Denil et al., 2013] Nathalie Villa-Vialaneix | RF for Big Data 29/39
  • 69. Sommaire 1 Random Forest 2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB) Map Reduce Online learning 3 Application Nathalie Villa-Vialaneix | RF for Big Data 30/39
  • 70. When should we consider data as “big”? We deal with Big Data when: data are at google scale (rare) data are big compared to our computing capacities Nathalie Villa-Vialaneix | RF for Big Data 31/39
  • 71. When should we consider data as “big”? We deal with Big Data when: data are at google scale (rare) data are big compared to our computing capacities [R Core Team, 2016, Kane et al., 2013] R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%. Nathalie Villa-Vialaneix | RF for Big Data 31/39
  • 72. When should we consider data as “big”? We deal with Big Data when: data are at google scale (rare) data are big compared to our computing capacities ... and depending on what we need to do with them [R Core Team, 2016, Kane et al., 2013] R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%. Nathalie Villa-Vialaneix | RF for Big Data 31/39
  • 73. Implementation standard randomForest R package (done) bigmemory bigrf R package MapReduce at hand (done) R packages rmr, rhadoop Mahout implementation Online RF: Python code from [Denil et al., 2013], and C++ code from [Saffari et al., 2009]. Nathalie Villa-Vialaneix | RF for Big Data 32/39
  • 74. MR-RF in practice: case study [Genuer et al., 2015] 15,000,000 observations generated from: Y with P(Y = 1) = P(Y = −1) = 0.5 and the conditional distribution of the (X(j))j=1,...,7 given Y = y with probability equal to 0.7, X(j) ∼ N(jy, 1) for j ∈ {1, 2, 3} and X(j) ∼ N(0, 1) for j ∈ {4, 5, 6}; with probability equal to 0.3, Xj ∼ N(0, 1) for j ∈ {1, 2, 3} and X(j) ∼ N((j − 3)y, 1) for j ∈ {4, 5, 6}; X7 ∼ N(0, 1). Nathalie Villa-Vialaneix | RF for Big Data 33/39
  • 75. MR-RF in practice: case study [Genuer et al., 2015] 15,000,000 observations generated from: Y with P(Y = 1) = P(Y = −1) = 0.5 and the conditional distribution of the (X(j))j=1,...,7 given Y = y with probability equal to 0.7, X(j) ∼ N(jy, 1) for j ∈ {1, 2, 3} and X(j) ∼ N(0, 1) for j ∈ {4, 5, 6}; with probability equal to 0.3, Xj ∼ N(0, 1) for j ∈ {1, 2, 3} and X(j) ∼ N((j − 3)y, 1) for j ∈ {4, 5, 6}; X7 ∼ N(0, 1). Comparison of subsampling, BLB, MR with well distributed data within Map jobs. Nathalie Villa-Vialaneix | RF for Big Data 33/39
  • 76. Airline data Benchmark data in Big Data articles (e.g., [Wang et al., 2015]) containing more than 124 millions of observations and 29 variables Aim: predict delay_status (1=delayed, 0=on time) of a flight using 4 explanatory variables (distance, night, week-end, departure_time). Not really massive data: 12 Go csv file Still useful to illustrate some Big Data issues: too large to fit in RAM (of most of nowadays laptops) R struggles to perform complex computations unless data take less than 10% − 20% of RAM (total memory size of manipulated objects cannot exceed RAM limit) long computation times to deal with this dataset Preliminary experiments on a Linux 64 bits server with 8 processors, 32 cores and 256 Go of RAM Nathalie Villa-Vialaneix | RF for Big Data 34/39
  • 77. Comparison of different BDRF on a simulation study sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e−3 . Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3) Nathalie Villa-Vialaneix | RF for Big Data 35/39
  • 78. Comparison of different BDRF on a simulation study sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e−3 . Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3) all methods provide satisfactory results comparable with sequential RF average OOB error over the Map forests can be a bad approximation of true OOB error (sometimes optimistic, sometimes pessimistic) Nathalie Villa-Vialaneix | RF for Big Data 35/39
  • 79. What happens if Map Jobs are unbalanced between the two submodels? between the two classes of Y? Method Comp. time BDerrForest errForest errTest unbalanced1 100/1 3 minutes 3.900e(-3) 5.470e(-3) 5.247e(-3) unbalanced1 10/10 8 minutes 2.575e(-3) 4.714e(-3) 4.473e(-3) unbalanced2 100/1/0.1 2 minutes 5.880e(-3) 4.502e(-3) 4.373e(-3) unbalanced2 10/10/0.1 3 minutes 4.165e(-3) 4.465e(-3) 4.260e(-3) unbalanced2 100/1/0.01 1 minute 1.926e(-3) 8.734e(-2) 4.484e(-2) unbalanced2 10/10/0.01 4 minutes 9.087e(-4) 7.612e(-2) 7.299e(-2) x-biases 100/1 3 minutes 3.504e(-3) 1.010e(-1) 1.006e(-1) x-biases 100/10 3 minutes 2.082e(-3) 1.010e(-1) 1.008e(-1) Nathalie Villa-Vialaneix | RF for Big Data 36/39
  • 80. What happens if Map Jobs are unbalanced between the two submodels? between the two classes of Y? Method Comp. time BDerrForest errForest errTest unbalanced1 100/1 3 minutes 3.900e(-3) 5.470e(-3) 5.247e(-3) unbalanced1 10/10 8 minutes 2.575e(-3) 4.714e(-3) 4.473e(-3) unbalanced2 100/1/0.1 2 minutes 5.880e(-3) 4.502e(-3) 4.373e(-3) unbalanced2 10/10/0.1 3 minutes 4.165e(-3) 4.465e(-3) 4.260e(-3) unbalanced2 100/1/0.01 1 minute 1.926e(-3) 8.734e(-2) 4.484e(-2) unbalanced2 10/10/0.01 4 minutes 9.087e(-4) 7.612e(-2) 7.299e(-2) x-biases 100/1 3 minutes 3.504e(-3) 1.010e(-1) 1.006e(-1) x-biases 100/10 3 minutes 2.082e(-3) 1.010e(-1) 1.008e(-1) unbalancing of Y in the different map jobs does not affect much the performances however, if the relation between (X, Y) is different in different subsets of data, unbalancing these submodels can lead to strongly deteriorate the performance Nathalie Villa-Vialaneix | RF for Big Data 36/39
  • 81. Airline dataset Method Computational time BDerrForest errForest sampling 10% 32 min 18.32% 18.32% sampling 1% 2 min 18.35% 18.33% sampling 0.1% 7 sec 18.36% 18.39% sampling 0.01% 2 sec 18.44% 18.49% BLB-RF 15/7 25 min 18.35% 18.33% MR-RF 15/7 15 min 18.33% 18.27% MR-RF 15/20 25 min 18.34% 18.20% MR-RF 100/10 17 min 18.33% 18.20% sequential forest: The RF took 16 hours to be obtained and its OOB error was equal to 18.32% (only 19.3% of the flights are really late). Nathalie Villa-Vialaneix | RF for Big Data 37/39
  • 82. Perspectives Sampling for MapReduce-RF: Use a partition into map jobs stratified on Y, or at least a random partition Use the Bag of Little Bootstrap from [Kleiner et al., 2012] Possible variants for MapReduce RF: Use simplified RF, e.g., Extremly Randomized Trees, [Geurts et al., 2006] (as in Online RF) See the whole forest as a forest of forests and adapt the majority vote scheme using weights Nathalie Villa-Vialaneix | RF for Big Data 38/39
  • 83. Have you survived to Big Data? Questions? Nathalie Villa-Vialaneix | RF for Big Data 39/39
  • 84. References Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Bickel, P., Götze, F., and van Zwet, W. (1997). Resampling fewer than n observations: gains, losses and remedies for losses. Statistica Sinica, 7(1):1–31. Boulesteix, A., Kruppa, J., and König, I. (2012). Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6):493–507. Breiman, L. (1996). Heuristics of instability in model selection. Annals of Statistics, 24(6):2350–2383. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984). Classification and Regression Trees. Chapman and Hall, Boca Raton, Florida, USA. Chamandy, N., Muralidharan, O., Najmi, A., and Naidu, S. (2012). Estimating uncertainty for massive data streams. Technical report, Google. Dean, J. and Ghemawat, S. (2004). MapReduce: simplified data processing on large clusters. In Proceedings of Sixth Symposium on Operating System Design and Implementation (OSDI 2004). del Rio, S., López, V., Beniítez, J., and Herrera, F. (2014). Nathalie Villa-Vialaneix | RF for Big Data 39/39
  • 85. On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285:112–137. Denil, M., Matheson, D., and de Freitas, N. (2013). Consistency of online random forests. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pages 1256–1264. Díaz-Uriarte, R. and Alvarez de Andres, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1):3. Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2):139–157. Genuer, R., Poggi, J., Tuleau-Malot, C., and Villa-Vialaneix, N. (2015). Random forests for big data. Preprint arXiv:1511.08327. Submitted for publication. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ghattas, B. (1999). Prévisions des pics d’ozone par arbres de régression, simples et agrégés par bootstrap. Revue de statistique appliquée, 47(2):61–80. Goldstein, B., Hubbard, A., Cutler, A., and Barcellos, L. (2010). An application of random forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genetics, 11(1):1. Kane, M., Emerson, J., and Weston, S. (2013). Scalable strategies for computing with massive data. Nathalie Villa-Vialaneix | RF for Big Data 39/39
  • 86. Journal of Statistical Software, 55(14). Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. (2012). The big data bootstrap. In Proceedings of 29th International Conference on Machine Learning (ICML 2012), Edinburgh, Scotland, UK. Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816. Prasad, A., Iverson, L., and Liaw, A. (2006). Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2):181–199. R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Saffari, A., Leistner, C., Santner, J., Godec, M., and Bischof, H. (2009). On-line random forests. In Proceedings of IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pages 1393–1400. IEEE. Verikas, A., Gelzinis, A., and Bacauskiene, M. (2011). Mining data with random forests: a survey and results of new tests. Pattern Recognition, 44(2):330–349. Wang, C., Chen, M., Schifano, E., Wu, J., and Yan, J. (2015). A survey of statistical methods and computing for big data. arXiv preprint arXiv:1502.07989. Nathalie Villa-Vialaneix | RF for Big Data 39/39