SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
RED WINGED BLACK BIRD
Sighting PREDICTION
DATA MINING in Spark Mllib USING RANDOM FOREST CLASSIFIER
Ayush Singh & Gautam Vashisht
CS 6240: Parallel big data processing with MapReduce
Prof: Mirek Riedewald
Spring 2017, Northeastern University
Overview
• Predict sightings
• Accuracy on hold out set: 88%
• With minimal pre processing
• Random Forest Classifier Model is
used to predict the final output
EDA
Integration
Selection
Preprocess
Cleaning
Pivot
Engineering
Training
RFC
Evaluation
Model
Selection
Prediction
Accuracy
Exploratory Data Analysis: Tools
Feature Engineering and Findings
• Stratified sampling correlative variable evaluation
• Imputed with minimal values to not disturb distribution
• Binning and Discretization on continuous data
• Pivot on First Order statistics
• Integrate Weather API to fill remaining GIS values
• Replacing by mean reduced accuracy
• Generalized low-rank modelling for compressing
Random Forest Classifier
• Why Spark?
• Random Forest is best suited for parallelism
• Can handle large variance
• Ensemble of randomly built trees
• Parallel Implementation in spark mllib
Evaluation and Selection
• Increase trees only if accuracy increases
• Increasing depth highly correlated with performance
• Cross Validation traded off due to OOB error of RFC
• Optimal Parameters: 50 trees, 20 depth
• Maximum number of bins = 1024
Different AWS Instance Run
AWS Instance Prediction on data Set Time Output Test Error
m4.2xlarge with 16 nodes On split test set from labeled
data.
6.20 hour and still running Aborted the program as it was
exceeding expected time limit.
NA
M4.2xlarge with 11 nodes On unlabeled test set 3.48 hour and still running Aborted the program as it was
exceeding expected time limit.
NA
r4.2xlarge On split test set from labeled
data.
2.28 hour Successful 0.1261512360
r4.2xlarge On unlabeled test set 1.18 hour Successful NA
Conclusion
• Resources vs Performance is a tradeoff
• Occam’s Razor still stands true
• Big data is scary

Weitere ähnliche Inhalte

Ähnlich wie Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1khairulhuda242
 
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...Brainrich Technology
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
 
Fancy resumeziweimeng0502
Fancy resumeziweimeng0502Fancy resumeziweimeng0502
Fancy resumeziweimeng0502Ziwei(Zoe) Meng
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningWenfan Xu
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...Reza Sadeghi
 

Ähnlich wie Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier (20)

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
IDMP
IDMPIDMP
IDMP
 
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...
Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 
AbhijitTripathy
AbhijitTripathyAbhijitTripathy
AbhijitTripathy
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
CV_Abhishek
CV_AbhishekCV_Abhishek
CV_Abhishek
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Fancy resumeziweimeng0502
Fancy resumeziweimeng0502Fancy resumeziweimeng0502
Fancy resumeziweimeng0502
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...A review on Exploiting experts’ knowledge for structure learning of bayesian ...
A review on Exploiting experts’ knowledge for structure learning of bayesian ...
 

Kürzlich hochgeladen

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier

  • 1. RED WINGED BLACK BIRD Sighting PREDICTION DATA MINING in Spark Mllib USING RANDOM FOREST CLASSIFIER Ayush Singh & Gautam Vashisht CS 6240: Parallel big data processing with MapReduce Prof: Mirek Riedewald Spring 2017, Northeastern University
  • 2. Overview • Predict sightings • Accuracy on hold out set: 88% • With minimal pre processing • Random Forest Classifier Model is used to predict the final output
  • 5. Feature Engineering and Findings • Stratified sampling correlative variable evaluation • Imputed with minimal values to not disturb distribution • Binning and Discretization on continuous data • Pivot on First Order statistics • Integrate Weather API to fill remaining GIS values • Replacing by mean reduced accuracy • Generalized low-rank modelling for compressing
  • 6. Random Forest Classifier • Why Spark? • Random Forest is best suited for parallelism • Can handle large variance • Ensemble of randomly built trees • Parallel Implementation in spark mllib
  • 7. Evaluation and Selection • Increase trees only if accuracy increases • Increasing depth highly correlated with performance • Cross Validation traded off due to OOB error of RFC • Optimal Parameters: 50 trees, 20 depth • Maximum number of bins = 1024
  • 8. Different AWS Instance Run AWS Instance Prediction on data Set Time Output Test Error m4.2xlarge with 16 nodes On split test set from labeled data. 6.20 hour and still running Aborted the program as it was exceeding expected time limit. NA M4.2xlarge with 11 nodes On unlabeled test set 3.48 hour and still running Aborted the program as it was exceeding expected time limit. NA r4.2xlarge On split test set from labeled data. 2.28 hour Successful 0.1261512360 r4.2xlarge On unlabeled test set 1.18 hour Successful NA
  • 9. Conclusion • Resources vs Performance is a tradeoff • Occam’s Razor still stands true • Big data is scary