The Paper was done in a group of three for the class project of CS 273: Introduction to Machine Learning at UC Irvine. The group members were Prolok Sundaresan, Varad Meru, and Prateek Jain.
Regression is an approach for modeling the relationship between data X and the dependent variable y. In this report, we present our experiments with multiple approaches, ranging from Ensemble of Learning to Deep Learning Networks on the weather modeling data to predict the rainfall. The competition was held on the online data science competition portal ‘Kaggle’. The results for weighted ensemble of learners gave us a top-10 ranking, with the testing root-mean-squared error being 0.5878.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Predicting rainfall using ensemble of ensembles
1. Predicting rainfall using ensemble of Ensembles.∗†
Prolok Sundaresan, Varad Meru, and Prateek Jain‡
University of California, Irvine
{sunderap,vmeru,prateekj}@uci.edu
Abstract
Regression is an approach for modeling the relationship between data X
and the dependent variable y. In this report, we present our experiments
with multiple approaches, ranging from Ensemble of Learning to Deep
Learning Networks on the weather modeling data to predict the rainfall.
The competition was held on the online data science competition portal
‘Kaggle’. The results for weighted ensemble of learners gave us a top-10
ranking, with the testing root-mean-squared error being 0.5878.
1 Introduction
The task of this in-class Kaggle competition was to predict the amount of rainfall
at a particular location using satellite data. We wanted to try various algorithms
and ensembles for regression to experiment and learn. The report is structured
in the following manner. The section 2 describes the dataset contents and the
latent structure found using latent variable analysis and clustering. This was
done by Prolok and Prateek. The section 3 describes various models used in
the project in detail. The Neural Network/Deep Learning section was done
by Varad. Random Forests was done by Prolok and Prateek. The work on
Gradient Boosting was done by Prateek and Varad. The section 4 described
the ensemble of ensembles technique used by us. The ensemble sits on top of
different ensembles and learners which were done in section 3. The work on the
final ensemble was done by all the three members. The section 5 presents our
learning and conclusion.
2 Understanding The Data
Visualizing the data was a difficult task since the data was in 91 dimensions.
In order to look for patterns in the data and visualized it, we applied SVD
technique to reduce the dimensionality of the features to 2 principle dimensions.
Then we applied k means clustering with k=5 on the data with 91 dimensions
∗The online competition is available at the Kaggle website https://inclass.kaggle.com/
c/how-s-the-weather. The name of the team was skynet
†This work was does as a part of the project for CS 273: Machine Learning, Fall 2014,
taught by Prof. Alexander Ihler.
‡Prolok Sundaresan: Student# 66008474, Varad Meru: Student# 26648958, Prateek Jain:
Student# 28321844
1
2. and plotted the assignments in the 2 dimensional transformed feature space.
We saw patterns in the data. Especially some points were densely clustered and
some were sparse.
To visualize it better, we transformed the feature in 3 dimensional space,
with the first 3 principle components, and saw that the points were clustered
around 3 planes.
Figure 1: Visualizing the data in 3 dimensions
3 Machine Learning Models
3.1 Mixture of Experts
As seen from our visualization in Figure 1, we could identify two highly dense
areas of the feature data on either side of a region of sparsely distributed data.
The idea behind using the mixture of experts approach was, that intuitively, it
would be difficult for a single regressor to fit the dataset, since the distribution
is non-uniform. We decided to split the data into clusters. To cluster the data,
we used several initialization of the k means algorithm with the kmeans++. We
used number of clusters as one of the parameters of our model which we tried
to change.
Since each of the clusters got a subset of a points from the original dataset,
number of data points per cluster was not a very large number. Our concern
with this was that any model we chose would overfit the data in its cluster.
Therefore, we used the ensemble method of gradient boosting for each of the
clusters. Since, in gradient boosting, we start with an underfitting model and
2
3. (a) Cluster assignments of Data Points
(b) Mixture of Experts Error
Figure 2: Visualizing the principle components of Data
3
4. then gradually add complexity, the chances of overfitting would be less in this
model. We decided to use Decision stumps as our regressors for the boosting
algorithm.
For evaluating the prediction for the validation split and the test data, we
first check which cluster the data point belongs to. We did this, by creating a K
nearest neighbor classifier on the center of the 3 clusters created in the previous
step. Then, the classifier predicts the cluster assignment for each test point,
and we use the array of boosting regressors corresponding to that cluster on the
data point, to get its corresponding prediction.
The parameters of the model we modified were the number of clusters and
the number of regressors used for boosting. We found that though the test error
reduced considerably on increasing the regressors for boosting, the validation
error increased after a certain point as can be seen from Figure 3. We got
minimum validation error for 700 regressors.
3.2 Neural Networks
We implemented various types of neural networks, ranging from single layer
networks to 3-layer sigmoidal neural networks.
Single Layer Network
Figure 3: Single Layer Architecture.
We build the neural network using the MATLAB’s Neural-Network-Toolkit
and PyBrain library implemented in Python. For the MATLAB implementa-
tion, there were various runs made for different number of neurons in the hidden
layer. The architecture of the neural network can be seen in Figure 3. The Fig-
ure 4 show the train-test-validation plots for different network architectures.
The dataset was distributed into 70% (Training), 20% (Validation) and 10%
(Testing) section for the neural network to run. The subsection 3.4 shows the
performance of the models learned. It was seen that the neural networks started
to overfit as the number of neurons were increased more than 40.
# of Neurons Training Error (RMSE) Testing Error (RMSE)
10 0.5986 0.61341
20 0.5875 0.61301
50 0.5852 0.62889
Table 1: RMSE Error rates for different network architectures.
It was observed that the learner could not learn very accurately as the data
a lot as the data was not much for the neural network to learn on.
4
5. (a) Train-Validation-Test error plot for 10
neuron hidden layer
(b) Error distribution histogram for 10 neuron
hidden layer
(c) Train-Validation-Test error plot for 20
neuron hidden layer
(d) Error distribution histogram for 20 neuron
hidden layer
(e) Train-Validation-Test error plot for 50
neuron hidden layer
(f) Error distribution histogram for 50 neuron
hidden layer
Figure 4: Plots of various Train-Validation-Test error for number of neurons =
[10, 20, 50]
5
6. Deep Networks
For this project, we tried using deep networks as well. The deep network was
made using PyBrain. We tried using different activation functions and archi-
tectures to understand how deep networks would work. The architecture shown
in Figure 5 had 3 layers - visible later contains 91 neurons, the first hidden
layer (tanh) had 91 neurons, the second hidden layer (sigmoid) had 50 neu-
rons, the third hidden layer (sigmoid) had 20 neurons, and the output layer
had 1 linear node. The testing error was 0.83643 was very high compared to
other approaches. We concluded that the network was learning the data well,
but was overfitting.
Input
layer
Hidden layer
(Hyperbolic
Tangent)
Hidden
layer(Sigmoid)
y1
y2
y3
Output
layer
3.3 Gradient Boosting
In parallel, we worked on training the gradient boosting model with varying
parameters to get the best fit for the data. We started with basic decision
stumps with number of regressors ranging from 1 to 2000. We also varied the
maximum Depth for the decision tree used as the regression model from 3 to 7.
We used alpha 0.9 for our algorithm. We observed that we got best performance
with 2000 boosters and depth as 7.
3.4 Random Forests
Several aspects of Random Forest technique was explored. The major funda-
mental behind Random Forest is to take a model, that overfits, the data, then
use feature and data bagging to bring down the complexity to fit the data bet-
ter. The usual model that is used in Random Forest is a high depth Regression
tree. We tried to explore other models, that overfitted the data.
The first option was to consider simple linear regression with feature trans-
formation. The data from X1 was transformed into X1 and X12
features and
6
7. Figure 5: Train and Test error plot for Gradient Boosting vs number of learners
linear regression was done on that. Significantly better results were obtained in
this transformation( a test error of 0.4322 compared to 0.4181) , but it signifi-
cantly worsened with an addition of X13
features to the feature list. This was
used as the regressor for the Random Forests, but the results were better for
a Tree Regressor. The major take away from this analysis was the use of X22
features into the feature list for tree regression. Several other regressors were
also tried like knn regressor was used, but tree regressor came out on top.
Since Decision Tree regression was significantly better than linear regression
in Random Forest, we decided to proceed with that with the X22
features also
in place(a total of 182 features). nFeatures was chosen as 150, and the depth was
set as 13,14,15,16,17, of which a maxDepth of 14 obtained optimal performance.
150 decision trees were learned and the optimum results were obtained for 90
learners.
Learner Training Error (MSE) Testing Error (MSE)
Linear Regressor 0.4068 0.4243
Linear Regressor with X12
feature 0.3996 0.4140
Tree Regressor 0.1951 0.3822
Table 2: MSE Error rates for Random Forests
4 Ensemble of all Learners
At the end, since we trained a lot of learners separately, some of which were
ensembles themselves, we thought of aggregating the results of the learners
to improve our prediction.We also analyzed the variance between the results
of our learners, and an average variance of 0.0204 was obtained. Since the
7
8. variance was noticeable, a weighted average aggregation of the results seemed
the best approach. We chose the model parameters for the best performing
models from each category to get a consolidated result. The section 4 shows
the architecture of our ensember. Initially, we chose a very simple approach of
assigning all models with the same weights to get a prediction. We got a some
improvement with MSE of 0.5908. We, saw that this was performing just below
our best individual prediction model. So, we decided to bump the weight of our
best learner in the ensemble. This helped improve our accumulated prediction,
providing an MSE of 0.5878.
Figure 6: Ensemble of Learners
5 Conclusion
This project gave a us glimpse on how machine learning techniques are applied
to real world problems. We applied a variety of techniques including neural
networks, decision trees, random forests, gradient boosting, kmeans clustering,
and PCA. Testing out various parameters of the different learner types helped us
identify where each of the models under-fitted and over-fitted the data. Finally,
while modifying the parameters of each model helped us reduce the variance in
the models, we used a final weighted ensemble of various learners to reduce the
bias of individual learners.
8