Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier

•

1 gefällt mir•186 views

We try to build a model around predicting bird sightings from a big and wide dataset of 1700 columns and 8 milllion rows using distributed computing using Spark Scala and Mliib's Random Forest Classifier because of it's ability to handle high variance in data. - For Exploratory data analysis, we put our data in HDFS and analyze it using Hive and also pull that data in for cleaning, feature engineering in RapidMiner. - We also explored running R on 60GB EC2 node on AWS to use some of R packages to build features to be used in final model training. - We use H2O sparkling water on AWS EMR cluster to train our model. https://github.com/singhay/ms-courses-code/blob/master/CS6240-Parallel-Data-Processing-in-MapReduce-Spark/Project/SinghVashisht.pdf

Daten & Analysen

RED WINGED BLACK BIRD
Sighting PREDICTION
DATA MINING in Spark Mllib USING RANDOM FOREST CLASSIFIER
Ayush Singh & Gautam Vashisht
CS 6240: Parallel big data processing with MapReduce
Prof: Mirek Riedewald
Spring 2017, Northeastern University

Overview
• Predict sightings
• Accuracy on hold out set: 88%
• With minimal pre processing
• Random Forest Classifier Model is
used to predict the final output

EDA
Integration
Selection
Preprocess
Cleaning
Pivot
Engineering
Training
RFC
Evaluation
Model
Selection
Prediction
Accuracy

Feature Engineering and Findings
• Stratified sampling correlative variable evaluation
• Imputed with minimal values to not disturb distribution
• Binning and Discretization on continuous data
• Pivot on First Order statistics
• Integrate Weather API to fill remaining GIS values
• Replacing by mean reduced accuracy
• Generalized low-rank modelling for compressing

Random Forest Classifier
• Why Spark?
• Random Forest is best suited for parallelism
• Can handle large variance
• Ensemble of randomly built trees
• Parallel Implementation in spark mllib

Evaluation and Selection
• Increase trees only if accuracy increases
• Increasing depth highly correlated with performance
• Cross Validation traded off due to OOB error of RFC
• Optimal Parameters: 50 trees, 20 depth
• Maximum number of bins = 1024

Different AWS Instance Run
AWS Instance Prediction on data Set Time Output Test Error
m4.2xlarge with 16 nodes On split test set from labeled
data.
6.20 hour and still running Aborted the program as it was
exceeding expected time limit.
NA
M4.2xlarge with 11 nodes On unlabeled test set 3.48 hour and still running Aborted the program as it was
exceeding expected time limit.
NA
r4.2xlarge On split test set from labeled
data.
2.28 hour Successful 0.1261512360
r4.2xlarge On unlabeled test set 1.18 hour Successful NA

Conclusion
• Resources vs Performance is a tradeoff
• Occam’s Razor still stands true
• Big data is scary

Empfohlen

Sitka_GeoOptix_Diagram_031816_FNLdkinpdx

JML_WeatherResumeJohn L'Heureux

A rough set-based incremental approach for updating approximations under dyna...Ecway Technologies

R_Demystifying_DECK (1)Roger Fried

Robust Video Denoising and Singing-Voice Separation using Low-rank matrix com...Ayush Singh, MS

Deep learning with FPGAAyush Singh, MS

Bridging the Completeness of Big Data on DatabricksDatabricks

Trinity of AI: data, algorithms and cloudAnima Anandkumar

Empfohlen

Sitka_GeoOptix_Diagram_031816_FNLdkinpdx

JML_WeatherResumeJohn L'Heureux

A rough set-based incremental approach for updating approximations under dyna...Ecway Technologies

R_Demystifying_DECK (1)Roger Fried

Robust Video Denoising and Singing-Voice Separation using Low-rank matrix com...Ayush Singh, MS

Deep learning with FPGAAyush Singh, MS

Bridging the Completeness of Big Data on DatabricksDatabricks

Trinity of AI: data, algorithms and cloudAnima Anandkumar

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu

Week 12 Dimensionality Reduction Bagian 1khairulhuda242

IDMPaccenture

Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...Brainrich Technology

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3

AbhijitTripathyAbhijit Tripathy

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics

Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald

The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com

CV_AbhishekAbhishek Aryan

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain

Fancy resumeziweimeng0502Ziwei(Zoe) Meng

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu

Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann

Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha

Predicting Moscow Real Estate Prices with Azure Machine LearningWenfan Xu

A review on Exploiting experts’ knowledge for structure learning of bayesian ...Reza Sadeghi

Discover Why Less is More in B2B Researchmichael115558

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Weitere ähnliche Inhalte

Ähnlich wie Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu

Week 12 Dimensionality Reduction Bagian 1khairulhuda242

IDMPaccenture

Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...Brainrich Technology

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3

AbhijitTripathyAbhijit Tripathy

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics

Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald

The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com

CV_AbhishekAbhishek Aryan

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain

Fancy resumeziweimeng0502Ziwei(Zoe) Meng

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu

Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann

Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha

Predicting Moscow Real Estate Prices with Azure Machine LearningWenfan Xu

A review on Exploiting experts’ knowledge for structure learning of bayesian ...Reza Sadeghi

Ähnlich wie Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier (20)

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...

Week 12 Dimensionality Reduction Bagian 1

IDMP

Best IEEE Projects 2017 -2018 Titles - IEEE Final Year Projects @ Brainrich T...

PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx

AbhijitTripathy

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Understanding your Data - Data Analytics Lifecycle and Machine Learning

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...

The Analytics Frontier of the Hadoop Eco-System

CV_Abhishek

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

Fancy resumeziweimeng0502

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019

Predicting Moscow Real Estate Prices with Azure Machine Learning

A review on Exploiting experts’ knowledge for structure learning of bayesian ...

Kürzlich hochgeladen

Discover Why Less is More in B2B Researchmichael115558

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Digital Transformation Playbook by Graham WareGraham Ware

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg

Gartner's Data Analytics Maturity Model.pptxchadhar227

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Ranking and Scoring Exercises for ResearchRajesh Mondal

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Digital Transformation Playbook by Graham Ware

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...

Gartner's Data Analytics Maturity Model.pptx

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

Ranking and Scoring Exercises for Research

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Classifier

1. RED WINGED BLACK BIRD Sighting PREDICTION DATA MINING in Spark Mllib USING RANDOM FOREST CLASSIFIER Ayush Singh & Gautam Vashisht CS 6240: Parallel big data processing with MapReduce Prof: Mirek Riedewald Spring 2017, Northeastern University

2. Overview • Predict sightings • Accuracy on hold out set: 88% • With minimal pre processing • Random Forest Classifier Model is used to predict the final output

3. EDA Integration Selection Preprocess Cleaning Pivot Engineering Training RFC Evaluation Model Selection Prediction Accuracy

4. Exploratory Data Analysis: Tools

5. Feature Engineering and Findings • Stratified sampling correlative variable evaluation • Imputed with minimal values to not disturb distribution • Binning and Discretization on continuous data • Pivot on First Order statistics • Integrate Weather API to fill remaining GIS values • Replacing by mean reduced accuracy • Generalized low-rank modelling for compressing

6. Random Forest Classifier • Why Spark? • Random Forest is best suited for parallelism • Can handle large variance • Ensemble of randomly built trees • Parallel Implementation in spark mllib

7. Evaluation and Selection • Increase trees only if accuracy increases • Increasing depth highly correlated with performance • Cross Validation traded off due to OOB error of RFC • Optimal Parameters: 50 trees, 20 depth • Maximum number of bins = 1024

8. Different AWS Instance Run AWS Instance Prediction on data Set Time Output Test Error m4.2xlarge with 16 nodes On split test set from labeled data. 6.20 hour and still running Aborted the program as it was exceeding expected time limit. NA M4.2xlarge with 11 nodes On unlabeled test set 3.48 hour and still running Aborted the program as it was exceeding expected time limit. NA r4.2xlarge On split test set from labeled data. 2.28 hour Successful 0.1261512360 r4.2xlarge On unlabeled test set 1.18 hour Successful NA

9. Conclusion • Resources vs Performance is a tradeoff • Occam’s Razor still stands true • Big data is scary