Random forest using apache mahout

CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Gaurav Kasliwal

Outline
 RandomForest Model
 Mahout Overview
 RandomForest using Mahout
 Problem Description
 Working Environment
 Data Preparation
 ML Model Generation
 Demo
 Using Gini Index

RandomForest Model
 Random forests are an ensemble learning method
for classification that operate by constructing a
multitude of decision trees at training time and
outputting the class that is the mode of
the classes output by individual trees.
 Developed by Leo Breiman and Adele Cutler.

Mahout
 Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop
and using the MapReduce paradigm.
 Scalable to large data sets

RandomForest using Mahout
 Generate a file descriptor for the dataset.
 Run the example with train data and build Decision
Forest model.
 Use the Decision Forest model to Classify test data and
get results.
 Tuning the model to get better results.

Problem Definition
 To Benchmark machine learning model for Page-Rank
 Yahoo! Learning to Rank
 Train Data : 34815 Records
 Test Data : 130166 Records
 Data Description :
 {R} | {q_id} | {List: feature_id -> feature_value}
 where R = {0, 1, 2, 3, 4}
 q_id = query id (number)
 feature_id = number feature_value = 0 to 1

Working Environment
 Ubuntu
 Hadoop 1.2.1
 Mahout 0.9

Prepare Dataset
 Take data from input text file
 Make a .csv file
 Make directory in HDFS and upload train.csv and
test.csv to the folder.
 Data Loading (Load data to HDFS)
 #hadoop fs -put train.arff final_data
 #hadoop fs -put test.arff final_data
 #hadoop fs -ls final_data (check by ls command )

Using Mahout
make metadata:
#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p
final_data/train.csv -f final_data/train.info1 -d 702 N L
 It creates a metadata train.info1 in final_data folder.

Create Model
make model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -sl 5 -p -t 100 -o final-forest

Test Model
test model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -p -t 1000 -o final-forest

Results
Summary results : Confusion Matrix and statistics

Tuning
 (change the parameters -t and -sl) and check the
results.
 --nbtrees (-t) nbtrees Number of trees to grow
 --selection (-sl) m Number of variables to
select randomly at each tree-node.

Results
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o
final-forest2
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i
final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2

RF Split selection
 Typically we select about square root (K) when there
are K is the total number of predictors available
 If we have 500 columns of predictors we will select
only about 23
 We split our node with the best variable among the 23,
not the best variable among the 500

Using Gini Index
 If a dataset T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
(T) is defined as:
 **The attribute value that provides the smallest SPLIT Gini (T) is chosen to
split the node.

Example
 The example below shows the construction of a single
tree using the dataset .
 Only two of the original four attributes are chosen for
this tree construction.

 tabulates the gini index value for the HOME_TYPE
attribute at all possible splits.
 the split HOME_TYPE <= 10 has the lowest value
Gini SPILT Value
Gini SPILT(HOME_TYPE<=6) 0.4000

Random forest using apache mahout

Random forest using apache mahout

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Random forest using apache mahout

Ähnlich wie Random forest using apache mahout (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Random forest using apache mahout