SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Gaurav Kasliwal
Outline
 RandomForest Model
 Mahout Overview
 RandomForest using Mahout
 Problem Description
 Working Environment
 Data Preparation
 ML Model Generation
 Demo
 Using Gini Index
RandomForest Model
 Random forests are an ensemble learning method
for classification that operate by constructing a
multitude of decision trees at training time and
outputting the class that is the mode of
the classes output by individual trees.
 Developed by Leo Breiman and Adele Cutler.
Mahout
 Mahout is a library of scalable machine-learning
algorithms, implemented on top of Apache Hadoop
and using the MapReduce paradigm.
 Scalable to large data sets
RandomForest using Mahout
 Generate a file descriptor for the dataset.
 Run the example with train data and build Decision
Forest model.
 Use the Decision Forest model to Classify test data and
get results.
 Tuning the model to get better results.
Problem Definition
 To Benchmark machine learning model for Page-Rank
 Yahoo! Learning to Rank
 Train Data : 34815 Records
 Test Data : 130166 Records
 Data Description :
 {R} | {q_id} | {List: feature_id -> feature_value}
 where R = {0, 1, 2, 3, 4}
 q_id = query id (number)
 feature_id = number feature_value = 0 to 1
Working Environment
 Ubuntu
 Hadoop 1.2.1
 Mahout 0.9
Prepare Dataset
 Take data from input text file
 Make a .csv file
 Make directory in HDFS and upload train.csv and
test.csv to the folder.
 Data Loading (Load data to HDFS)
 #hadoop fs -put train.arff final_data
 #hadoop fs -put test.arff final_data
 #hadoop fs -ls final_data (check by ls command )
Using Mahout
make metadata:
#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p
final_data/train.csv -f final_data/train.info1 -d 702 N L
 It creates a metadata train.info1 in final_data folder.
Create Model
make model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -sl 5 -p -t 100 -o final-forest
Test Model
test model
#hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.arff -ds
final_data/train.info -p -t 1000 -o final-forest
Results
Summary results : Confusion Matrix and statistics
Tuning
 (change the parameters -t and -sl) and check the
results.
 --nbtrees (-t) nbtrees Number of trees to grow
 --selection (-sl) m Number of variables to
select randomly at each tree-node.
Results
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o
final-forest2
 #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i
final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
RF Split selection
 Typically we select about square root (K) when there
are K is the total number of predictors available
 If we have 500 columns of predictors we will select
only about 23
 We split our node with the best variable among the 23,
not the best variable among the 500
Using Gini Index
 If a dataset T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
(T) is defined as:
 **The attribute value that provides the smallest SPLIT Gini (T) is chosen to
split the node.
Example
 The example below shows the construction of a single
tree using the dataset .
 Only two of the original four attributes are chosen for
this tree construction.
 tabulates the gini index value for the HOME_TYPE
attribute at all possible splits.
 the split HOME_TYPE <= 10 has the lowest value
Gini SPILT Value
Gini SPILT(HOME_TYPE<=6) 0.4000
Gini SPILT(HOME_TYPE<=10) 0.2671
Gini SPILT(HOME_TYPE<=15) 0.4671
Gini SPILT(HOME_TYPE<=30) 0.3000
Gini SPILT(HOME_TYPE<=31) 0.4800
Random forest using apache mahout

Weitere ähnliche Inhalte

Was ist angesagt?

Data cube computation
Data cube computationData cube computation
Data cube computation
Rashmi Sheikh
 

Was ist angesagt? (20)

Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Open mp
Open mpOpen mp
Open mp
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Object model
Object modelObject model
Object model
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Using Control Flow for Generating Dynamic Content
Using Control Flow for Generating Dynamic ContentUsing Control Flow for Generating Dynamic Content
Using Control Flow for Generating Dynamic Content
 
Artificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based ReasoningArtificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based Reasoning
 
Unified Process
Unified ProcessUnified Process
Unified Process
 
Online algorithms and their applications
Online algorithms and their applicationsOnline algorithms and their applications
Online algorithms and their applications
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Java awt (abstract window toolkit)
Java awt (abstract window toolkit)Java awt (abstract window toolkit)
Java awt (abstract window toolkit)
 
A greedy algorithms
A greedy algorithmsA greedy algorithms
A greedy algorithms
 
Data analytics with python introductory
Data analytics with python introductoryData analytics with python introductory
Data analytics with python introductory
 
DOM and Events
DOM and EventsDOM and Events
DOM and Events
 
N queens using backtracking
N queens using backtrackingN queens using backtracking
N queens using backtracking
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Andere mochten auch

Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
Ted Dunning
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
Daniel Glauser
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
Ertunga Arsal
 

Andere mochten auch (20)

Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random ForestUnsupervised Learning Techniques to Diversifying and Pruning Random Forest
Unsupervised Learning Techniques to Diversifying and Pruning Random Forest
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
Build Your Strategy and Projections with Azure Machine Learning (Sergey Popla...
 
VPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
VPN Types, Vulnerabilities & Solutions - Tareq HanayshaVPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
VPN Types, Vulnerabilities & Solutions - Tareq Hanaysha
 
Data Science for Cyber Risk
Data Science for Cyber RiskData Science for Cyber Risk
Data Science for Cyber Risk
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
SAP Security - Real life Attacks to Business Processes - Hack in Paris 2015
 
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
Combining R With Java For Data Analysis (Devoxx UK 2015 Session)
 
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
AI&BigData Lab. Маргарита Остапчук "Алгоритмы в Azure Machine Learning и где ...
 
Building an Analytics - Enabled SOC Breakout Session
Building an Analytics - Enabled SOC Breakout Session Building an Analytics - Enabled SOC Breakout Session
Building an Analytics - Enabled SOC Breakout Session
 
Building an Analytics Enables SOC
Building an Analytics Enables SOCBuilding an Analytics Enables SOC
Building an Analytics Enables SOC
 
Introducing OpenText Auto-Classification
Introducing OpenText Auto-ClassificationIntroducing OpenText Auto-Classification
Introducing OpenText Auto-Classification
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Ähnlich wie Random forest using apache mahout

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Cisco
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
Asaf Ben Gal
 

Ähnlich wie Random forest using apache mahout (20)

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache ...
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 

Kürzlich hochgeladen (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 

Random forest using apache mahout

  • 1. CS 267 : Data Mining Presentation Guided by : Dr. Tran -Gaurav Kasliwal
  • 2. Outline  RandomForest Model  Mahout Overview  RandomForest using Mahout  Problem Description  Working Environment  Data Preparation  ML Model Generation  Demo  Using Gini Index
  • 3. RandomForest Model  Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.  Developed by Leo Breiman and Adele Cutler.
  • 4. Mahout  Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.  Scalable to large data sets
  • 5. RandomForest using Mahout  Generate a file descriptor for the dataset.  Run the example with train data and build Decision Forest model.  Use the Decision Forest model to Classify test data and get results.  Tuning the model to get better results.
  • 6. Problem Definition  To Benchmark machine learning model for Page-Rank  Yahoo! Learning to Rank  Train Data : 34815 Records  Test Data : 130166 Records  Data Description :  {R} | {q_id} | {List: feature_id -> feature_value}  where R = {0, 1, 2, 3, 4}  q_id = query id (number)  feature_id = number feature_value = 0 to 1
  • 7. Working Environment  Ubuntu  Hadoop 1.2.1  Mahout 0.9
  • 8. Prepare Dataset  Take data from input text file  Make a .csv file  Make directory in HDFS and upload train.csv and test.csv to the folder.  Data Loading (Load data to HDFS)  #hadoop fs -put train.arff final_data  #hadoop fs -put test.arff final_data  #hadoop fs -ls final_data (check by ls command )
  • 9. Using Mahout make metadata: #hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p final_data/train.csv -f final_data/train.info1 -d 702 N L  It creates a metadata train.info1 in final_data folder.
  • 10. Create Model make model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -sl 5 -p -t 100 -o final-forest
  • 11. Test Model test model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -p -t 1000 -o final-forest
  • 12. Results Summary results : Confusion Matrix and statistics
  • 13. Tuning  (change the parameters -t and -sl) and check the results.  --nbtrees (-t) nbtrees Number of trees to grow  --selection (-sl) m Number of variables to select randomly at each tree-node.
  • 14. Results  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o final-forest2  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
  • 15. RF Split selection  Typically we select about square root (K) when there are K is the total number of predictors available  If we have 500 columns of predictors we will select only about 23  We split our node with the best variable among the 23, not the best variable among the 500
  • 16. Using Gini Index  If a dataset T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:  **The attribute value that provides the smallest SPLIT Gini (T) is chosen to split the node.
  • 17. Example  The example below shows the construction of a single tree using the dataset .  Only two of the original four attributes are chosen for this tree construction.
  • 18.
  • 19.  tabulates the gini index value for the HOME_TYPE attribute at all possible splits.  the split HOME_TYPE <= 10 has the lowest value Gini SPILT Value Gini SPILT(HOME_TYPE<=6) 0.4000 Gini SPILT(HOME_TYPE<=10) 0.2671 Gini SPILT(HOME_TYPE<=15) 0.4671 Gini SPILT(HOME_TYPE<=30) 0.3000 Gini SPILT(HOME_TYPE<=31) 0.4800