We try to build a model around predicting bird sightings from a big and wide dataset of 1700 columns and 8 milllion rows using distributed computing using Spark Scala and Mliib's Random Forest Classifier because of it's ability to handle high variance in data. - For Exploratory data analysis, we put our data in HDFS and analyze it using Hive and also pull that data in for cleaning, feature engineering in RapidMiner. - We also explored running R on 60GB EC2 node on AWS to use some of R packages to build features to be used in final model training. - We use H2O sparkling water on AWS EMR cluster to train our model. https://github.com/singhay/ms-courses-code/blob/master/CS6240-Parallel-Data-Processing-in-MapReduce-Spark/Project/SinghVashisht.pdf