SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 1/43
Lab 2: Zillow Dataset Classification and
Regression Prediction Models
MSDS 7331 Data Mining - Section 403 - Lab 2
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Contents
Imports
Define and Prepare Class Variables
Classification Variables
Regression Variables
Describe the Final Dataset
Classification Dataset
Regression Dataset
Explain Evaluation Metrics
Classification Metrics
Regression Metrics
Training and Testing Splits
For Classification
For Regression
Three Different Classification/Regression Models
Classification Models
K Nearest Neighbors
Random Forest
Naive Bayes
Regression Models
K Nearest Neighbors
Random Forest
Gaussian Regression
Visualizations of Results and Analysis
Analysis of Classification Models
Analysis of K Nearest Neighbors
Analysis of Random Forest
Analysis of Naive Bayes
Regression Models
Analysis of K Nearest Neighbors
Analysis of Random Forest
Analysis of Gaussian Regression
Advantages of Each Model
Classification Models
Regression Models
Important Attributes
Classification Models
Regression Models
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 2/43
Deployment
Exceptional Work
Approaches Considered for Balanced Classification
Feature Elimination
Two dimensional Linear Discriminant Analysis
References
Imports & Custom Functions
We chose to use the same Zillow dataset from Lab 1 for this exploration in regression and
classification. For the origin and purpose of dataset as well as a detailed description of the dataset,
refer to https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb
(https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb).
The function output_variables_table shows if the variable is nominal or ordinal for further use
on classification or regression. The functions per_class_accuracy and confusion_matrix show
the confusion table for correctly and incorrectly identified classification prediction results. The
function plot_class_acc shows the visual accuracies of classification. The function
plot_feature_importance shows the feature importance of classification values. The function
print_accuracy shows the accuracy scores of the classification models. The function
get_dataset_subset obtains a subset of the full dataset for modeling and prediction.
We will be using a seed of 0. Due to our dataset being extremely large, we are using 5 folds for the
CPU usage and runtime to be more manageable to run through the prediction models for both
classification and regression.
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 3/43
In [1]: %matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
from sklearn import metrics as mt
# classification imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA, SparsePCA
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score
# regression imports
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from scipy.ndimage import imread
import warnings
warnings.filterwarnings("ignore")
def output_variables_table(variables):
variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th>
for vname, atts in variables.iterrows():
if vname not in dataset.columns:
continue
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = dataset[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 4/43
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d)' % (dataset[vname].min(), dataset[vname
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row
return HTML('<table>%s</table>' % ''.join(rows))
# Define an accuracy plot
def per_class_accuracy(ytrue, yhat):
conf = mt.confusion_matrix(ytrue,yhat)
norm_conf = conf.astype('float') / conf.sum(axis=1)[:, np.newaxis]
return np.diag(norm_conf)
def plot_class_acc(ytrue, yhat, classes, title=''):
acc_list = per_class_accuracy(y, yhat)
pd.DataFrame(acc_list, index=pd.Index(classes, name='Classes')).plot(kind='ba
plt.xlabel('Class value (one per face)')
plt.ylabel('Accuracy within class')
plt.title(title+", Total Acc=%.1f"%(100*mt.accuracy_score(ytrue,yhat)))
plt.grid()
plt.ylim([0,1])
plt.show()
# Plot the feature importances of the forest
def plot_feature_importance(ytrue, yhat, rt, title=''):
importances = rt.feature_importances_
std = np.std([tree.feature_importances_ for tree in rt.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
def print_accuracy(model_name, y_test, yhat, scores):
scores = np.array(scores)
print('----------------- %s Evaluation -----------------' % model_name)
print(" F1 Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(' Accuracy', mt.accuracy_score(y_test, yhat))
print(' Precision', mt.precision_score(y_test, yhat, average='weighted'))
print(' Recall', mt.recall_score(y_test, yhat, average='weighted'))
def confusion_matrix(ytrue, yhat, classes):
index = pd.MultiIndex.from_product([['True Class'], classes])
columns = pd.MultiIndex.from_product([['Predicted Class'], classes])
return pd.DataFrame(mt.confusion_matrix(y, yhat), index=index, columns=column
def roc_curve(ytrue, yhat, clf):
for i, label in enumerate(clf.classes_):
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 5/43
Define and Prepare Class Variables
10 points
Description:
Define and prepare your class variables. Use proper variable representations (int, float, one-hot,
etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove
variables that are not needed/useful for the analysis.
⏫ Back to Top
Classification Datasets:
The classification dataset removes logerror and transactiondate because they were for the
purposes of the Kaggle competition and were not complete for the training set. The column that was
created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake
of simplicity of only using original data for the prediction process. The table generated shows the
type of data used for classification purposes.
The dataset has 58380 rows and 1757 columns. All variables and details about the variables are
printed on the table below.
/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: Deprecat
ionWarning: This module was deprecated in version 0.18 in favor of the model_se
lection module into which all the refactored classes and functions are moved. A
lso note that the interface of the new CV iterators are different from that of
this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
fpr, tpr, _ = mt.roc_curve(y, yhat_score[:, i], pos_label=label)
roc_auc = mt.auc(fpr, tpr)
plt.plot(fpr, tpr, label='class {0} with {1} instances (area = {2:0.2f})'
''.format(label, sum(y==label), roc_auc))
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
def get_dataset_subset(dataset, n=1000):
return {
'X': dataset['X'].iloc[:n],
'y': dataset['y'].iloc[:n]
}
seed = 0
n_splits = 5
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 6/43
The target class "regionidcounty" has three possible values: 1286, 2061 or 3101, representing three
different county codes. The distribution is skewed with code 1286 having 17749 observations, 3101
has 35563, and 2061 only 5068 observations.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 7/43
In [5]:
Dataset shape: (58380, 1757)
regionidcounty
1286 17749
2061 5068
3101 35563
Name: regionidcounty, dtype: int64
variables = pd.read_csv('../../datasets/variables.csv').set_index('name')
dataset = pd.read_csv('../../datasets/train.csv', low_memory=False)
# remove unneeded variables
del dataset['Unnamed: 0']
del dataset['logerror']
del dataset['transactiondate']
del dataset['city']
del dataset['price_per_sqft']
# delete all location information because we want to predict the couty
# and those feature will give it up to easy
y = dataset['regionidcounty'].copy()
del dataset['regionidcounty']
del dataset['regionidcity']
del dataset['regionidzip']
del dataset['regionidneighborhood']
del dataset['rawcensustractandblock']
del dataset['latitude']
del dataset['longitude']
output_variables = output_variables_table(variables)
nominal = variables[variables['type'].isin(['nominal'])]
nominal = nominal[nominal.index.isin(dataset.columns)]
continuous = variables[~variables['type'].isin(['nominal'])]
continuous = continuous[continuous.index.isin(dataset.columns)]
nominal_data = dataset[nominal.index]
nominal_data = pd.get_dummies(nominal_data, drop_first=True)
nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina
continuous_data = dataset[continuous.index]
dataset = pd.concat([continuous_data, nominal_data], axis=1)
columns = dataset.columns
variables = variables[variables.index.isin(dataset.columns)]
# shuffle the dataset (just in case)
X = dataset.sample(frac=1, random_state=seed)
dataset_class = {
'X': X,
'y': y
}
print('Dataset shape:', X.shape)
print(y.groupby(y).size())
output_variables
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 8/43
Out[5]:
Variable Type Scale Description
airconditioningtypeid nominal [0, 1, 13, 5, 11, 3, 9]
Type of cooling system present in the
any)
assessmentyear interval (2015, 2015) The year of the property tax assessm
bathroomcnt ordinal
[1.0, 3.5, 2.5, 3.0,
2.0, ... (22 More)]
Number of bathrooms in home includi
fractional bathrooms
bedroomcnt ordinal
[1, 5, 4, 3, 2, ... (16
More)]
Number of bedrooms in home
buildingqualitytypeid ordinal [7, 4, 1, 10, 12, 8]
Overall assessment of condition of the
from best (lowest) to worst (highest)
calculatedbathnbr ordinal
[1.0, 3.5, 2.5, 3.0,
2.0, ... (22 More)]
Number of bathrooms in home includi
fractional bathroom
calculatedfinishedsquarefeet ratio (0, 10925)
Calculated total finished living area of
home
censustractandblock nominal
[60372040024100.0,
60590991081500.0,
60374078455800.0,
61110052978700.0,
60379010957300.0,
... (445 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
finishedsquarefeet12 ratio (0, 6615) Finished living area
finishedsquarefeet50 ratio (0, 8352)
Size of the finished living area on the
(entry) floor of the home
fips nominal [6037, 6059, 6111]
Federal Information Processing Stand
- see
https://en.wikipedia.org/wiki/FIPS_cou
for more details
fireplacecnt ordinal [0, 1, 2, 3, 5, 4] Number of fireplaces in a home (if any
fullbathcnt ordinal
[1.0, 3.0, 2.0, 6.0,
4.0, ... (17 More)]
Number of full bathrooms (sink, show
bathtub, and toilet) present in home
garagecarcnt ordinal
[0.0, 2.0, 1.0, 4.0,
3.0, ... (14 More)]
Total number of garages on the lot inc
attached garage
garagetotalsqft ratio (0, 1610)
Total number of square feet of all gara
lot including an attached garage
hashottuborspa ordinal [0, 1] Does the home have a hot tub or spa
heatingorsystemtypeid nominal
[7, 0, 2, 6, 24, ... (12
More)]
Type of home heating system
landtaxvaluedollarcnt ratio (22, 2477536)
The assessed value of the land area o
parcel
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 9/43
location_type nominal
[PRIMARY, nan,
NOT ACCEPTABLE,
ACCEPTABLE]
Primary, Acceptable, Not Acceptable
lotsizesquarefeet ratio (0, 1710750) Area of the lot in square feet
numberofstories ordinal [1, 2, 3, 4] Number of stories or levels the home
parcelid nominal
[11800329,
14058566,
14636635,
17138404,
11270723, ... (49678
More)]
Unique identifier for parcels (lots)
poolcnt ordinal [0.0, 1.0] Number of pools on the lot (if any)
poolsizesum ratio (0, 1476) Total square footage of all pools on pr
pooltypeid10 nominal [0, 1] Spa or Hot Tub
pooltypeid2 nominal [0, 1] Pool with Spa/Hot Tub
pooltypeid7 nominal [0, 1] Pool without hot tub
propertycountylandusecode nominal
[0100, 122, 1, 1111,
010C, ... (71 More)]
County land use code i.e. it's zoning a
county level
propertylandusetypeid nominal
[261, 266, 246, 265,
269, ... (13 More)]
Type of land use the property is zoned
propertyzoningdesc nominal
[LAR2, 0,
LRRA7000*, TOPR-
MD, LCA11*, ...
(1655 More)]
Description of the allowed land uses (
for that property
roomcnt ordinal
[0, 9, 8, 4, 7, ... (16
More)]
Total number of rooms in the principal
residence
structuretaxvaluedollarcnt ratio (100, 2181198)
The assessed value of the built struct
the parcel
taxamount ratio (49, 51292)
The total property tax assessed for th
assessment year
taxdelinquencyflag nominal [0, 1]
Property taxes for this parcel are past
of 2015
taxdelinquencyyear interval (0, 26) Year
taxvaluedollarcnt ratio (22, 4052186) The total tax assessed value of the pa
threequarterbathnbr ordinal [0, 1, 2, 3, 4]
Number of 3/4 bathrooms in house (s
sink + toilet)
unitcnt ordinal [1, 2, 3, 4, 9, 6]
Number of units the structure is built i
= duplex, 3 = triplex, etc...)
yardbuildingsqft17 interval (0, 1485) Patio in yard
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 10/43
Regression Datasets:
The regression dataset removes logerror and transactiondate because they were for the
purposes of the Kaggle competition and were not complete for the training set. The column that was
created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake
of simplicity of only using original data for the prediction process. We are only using nominal and
continuous data types for regression purposes.
The dataset has 58380 rows and 1758 columns. All variables and details about the variables are
printed on the table below.
⏫ Back to Top
yardbuildingsqft26 interval (0, 1366) Storage shed/building in yard
yearbuilt interval (1885, 2015) The Year the principal residence was
zipcode_type nominal
[STANDARD, nan,
PO BOX, MILITARY,
UNIQUE]
Standard, PO BOX Only, Unique,
Military(implies APO or FPO)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 11/43
In [13]:
Dataset shape: (58380, 1758)
dataset = pd.read_csv('../../datasets/train.csv', low_memory=False)
variables = pd.read_csv('../../datasets/variables.csv').set_index('name')
# remove unneeded variables
del dataset['logerror']
del dataset['transactiondate']
del dataset['city']
del dataset['price_per_sqft']
output_variables = output_variables_table(variables)
nominal = variables[variables['type'].isin(['nominal'])]
nominal = nominal[nominal.index.isin(dataset.columns)]
continuous = variables[~variables['type'].isin(['nominal'])]
continuous = continuous[continuous.index.isin(dataset.columns)]
nominal_data = dataset[nominal.index]
nominal_data = pd.get_dummies(nominal_data, drop_first=True)
nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina
continuous_data = dataset[continuous.index]
dataset = pd.concat([continuous_data, nominal_data], axis=1)
columns = dataset.columns
variables = variables[variables.index.isin(dataset.columns)]
# shuffle the dataset (just in case)
X = dataset.sample(frac=1, random_state=seed)
y = X['taxamount'].copy()
del X['taxamount']
dataset_reg = {
'X': X,
'y': y
}
print('Dataset shape:', X.shape)
plt.title('Distribution of the target variable: taxamount')
y.plot(kind='box')
output_variables
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 12/43
Describe the Final Dataset
5 points
Description:
Describe the final dataset that is used for classification/regression (include a description of any
newly formed variables you created).
⏫ Back to Top
Classification Datasets:
⏫ Back to Top
Since we are using the same Zillow dataset that we used in the previous lab, most of the data was
already cleaned up. However, the purpose of our classification dataset is to predict the county each
property is located in. Therefore our final model removed all columns relating to location such as
latitude, longitude, city, and zipcode. We also removed variables we did not need such as
logerror, transactiondate, and price_per_sqft.
We did not create any new columns for the classification dataset but we did transform the
categorical variables into indicator variables.
The final shape of our classification dataset is 58380 instances and 1757 columns. The three
counties we are trying to predict have sizes of about 18k, 5k, and 36k so an accuracy below 0.61
will mean that we are better off classifying each with the latter county.
Regression Datasets:
The regression dataset removes logerror and transactiondate because they were for the purposes
of the Kaggle competition and were not complete for the training set. The column that was created
for "New Features" from Lab 1 (city and price_per_sqft) were also removed for the sake of
simplicity of only using original data for the prediction process.
We are only using nominal and continuous data types for regression purposes. The final shape of
our classification dataset is 58380 instances and 1758 columns. The varaiable that we are
predicting, taxamount, is right skewed, with outlier property costing more than the standard
deviation.
⏫ Back to Top
Out[13]: Variable Type Scale Description
Type of cooling system present in th
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 13/43
Explain Evaluation Metrics
10 points
Description:
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-
measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the
results of your modeling? Give a detailed explanation backing up any assertions.
⏫ Back to Top
Classification Metrics:
⏫ Back to Top
Because of our class distribution is very skewed, we will be optimizing the models base on F1
score.
For our evaluation, we will be taking into account the accuracy and F-measure. In order to compute
the F-measure, we will need the precision and recall. Because F-measure is a weighted average of
these, we think a better F-measure score means the model has a better precision and recall.
Accuracy is the ratio of correct predictions to the total number of observations. It is calculated as:
(TP+TN) / (TP+FP+FN+TN). The closer accuracy is to 1, the more accurate the model is, with one
caveat. For high accuracy to be a reliable indicator, the dataset has to be symmetric, i.e. total false
positives are about equal to false negatives. Otherwise, we need to review other parameters as
well.
Precision is the ratio of correctly predicted positive observations to the total positive observations. It
is calculated as: TP / (TP+FP).
Recall is the ratio of correctly predicted positive observations to all actual positives. It is calculated
as TP / (TP+FN). The consequences of type 2 errors, predicting a false negative, are not extreme
so we think recall is an appropriate measure of completeness.
Finally, we will also use F-measure which is essentially a weighted average of the precision and
recall into one simple statistic. F-measure will be a number between 0 and 1 where closer to 1 is
better and approaching 0 is worse. It overcomes the limitations of accuracy whenever false
positives and false negatives are not about equal or symmetric.
Regression Metrics:
For our evaluation of regression prediction models, we are looking at mean squared error (MSE)
and R^2. With the large data size and right skew of taxamount, we are trying to minimize MSE and
have a R^2 value close to 1. Whichever model with optimal parameters that could reduce MSE and
increase R^2 while using less CPU and less runtime would be the best regression model to use for
the prediction of the dataset.
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 14/43
⏫ Back to Top
Training and Testing Splits
10 points
Description:
Choose the method you will use for dividing your data into training and testing splits (i.e., are you
using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or
use more than one method as appropriate. For example, if you are using time series data then you
should be using continuous training and testing sets across time.
⏫ Back to Top
Due to our dataset being extremely large, we are using 5 folds for the CPU usage and runtime to be
more managable to run through the prediction models for both classification and regression. Our
data is not a time series so we did not need to train and test over time with a moving time window.
Classification Splits:
⏫ Back to Top
For the classification task we choose to use Stratified K-Fold cross validation with 5 folds. We chose
stratified in order to preserve the percentage of samples in each class. We also had a very large
dataset so splitting into more than 5 folds would have been computationally expensive with not a
large enough return on value. We felt that splitting the data into 5 folds would be enough splits to
reduce the weight of any outliers or noise.
Regression Splits:
For the regression task we choose to use K-Fold cross validation with 5 folds. We chose K-Fold in
order to preserve the percentage of samples in each class. We also had a very large dataset so
splitting into more than 5 folds would have been computationally expensive with not a large enough
return on value. We felt that splitting the data into 5 folds would be enough splits to reduce the
weight of any outliers or noise.
⏫ Back to Top
Three Different Classification/Regression
Models
20 points
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 15/43
Description:
Create three different classification/regression models for each task (e.g., random forest, KNN, and
SVM for task one and the same or different algorithms for task two). Two modeling techniques must
be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to
increase generalization performance using your chosen metric. You must investigate different
parameters of the algorithms!
⏫ Back to Top
Classification Models:
⏫ Back to Top
Definition and optimization of K Nearest Neighbors (KD Tree)
K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a
certain n_neighbors count in order to form classification models to predict the y_hat for the test set.
Optimization result for different values of "n_neighbors" is printed below.
When n_neighbors is 12, the f1_score is highest at 0.492.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 16/43
In [5]:
Definition and optimization of Random Forest
Random forest is for the prediction of values based on training decision trees by by a certain max
depth in order to form classification models to predict the y_hat for the test set. Optimization result
for different test values of "max_depth" is printed below.
When max_depth is 351, the f1_score starts to plateau at 0.49062.
⏫ Back to Top
n_neighbors: 2 , f1_score: 0.470579117819
n_neighbors: 7 , f1_score: 0.491355550804
n_neighbors: 12 , f1_score: 0.492699617518
n_neighbors: 17 , f1_score: 0.492039256254
n_neighbors: 22 , f1_score: 0.490571365928
n_neighbors: 27 , f1_score: 0.488490047836
X = dataset_class['X']
y = dataset_class['y']
result = []
scores = []
for n_neighbors in range(2, 30)[::5]:
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# in order to reduce the time for training KNeighborsClassifier
# we reduce the dimetions of the data from 1717 to 100 and we use kd_tree
pca = PCA(n_components=100, random_state=seed)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
clf = KNeighborsClassifier(n_neighbors=n_neighbors, algorithm='kd_tree', w
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
f1_score = mt.f1_score(y, yhat, average='weighted')
print ('n_neighbors:', n_neighbors, ', f1_score:', f1_score)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 17/43
In [6]:
Definition and optimization of Naive Bayes (Gaussian)
The Naive Bayes doesn't have any parameters to optimize and uses maximum likelihood training to
classify and predict for the test set. We will just show a F1 score with confidence interval with an
interval and inspect the result of this model in more details in the next section.
max_depth: 1 F1 score: 0.46120951624
max_depth: 51 F1 score: 0.465099779665
max_depth: 101 F1 score: 0.481927336387
max_depth: 151 F1 score: 0.489103832227
max_depth: 201 F1 score: 0.490156681956
max_depth: 251 F1 score: 0.49059336761
max_depth: 301 F1 score: 0.490213068185
max_depth: 351 F1 score: 0.490623495801
X = dataset_class['X']
y = dataset_class['y']
result = []
index = []
for max_depth in range(1, 401)[::50]:
yhat = np.zeros(y.shape, dtype=int) # we will fill this with predictions
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = RandomForestClassifier(max_depth=max_depth, random_state=seed, n_est
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
f1_score = mt.f1_score(y, yhat, average='weighted')
print ('max_depth:', max_depth, 'F1 score:', f1_score)
result.append(f1_score)
index.append(max_depth)
plt.title('F1 score for different max_depth')
pd.Series(result, index=pd.Index(index, name='max_depth'), name='f1_score').plot(
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 18/43
Naive Bayes has no optimal parameters to adjust and has a F1 score of 0.46.
⏫ Back to Top
In [8]:
Definition and optimization of Regression models:
⏫ Back to Top
Definition and optimization of K Nearest Neighbors
K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a
certain n_neighbors in order to form regression models to predict the y_hat for the test set. In order
to fit the model in a reasonable amount of time, we shrank the dimension of the dataset to a 100
features with a PCA. The result of the optimization is printed below. We choose the hyper-
parameters with the highest R2 score to be the optimal parameters.
When n_neighbors is 11, MSE is at it lowest at 3159735 and R^2 peaks at 0.913.
⏫ Back to Top
F1 score: 0.46 (+/- 0.00)
X = dataset_class['X']
y = dataset_class['y']
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
scores = []
for train_index, test_index in cv.split(X, y):
clf = GaussianNB()
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
scores = np.array(scores)
print("F1 score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 19/43
In [17]:
Definition and optimization of Random Forest
Random forest is for the prediction of values based on training decision trees by by a certain max
depth in order to form regression models to predict the y_hat for the test set. The result of the
optimization is printed below. We choose the hyper-parameters with the highest R2 score to be the
optimal parameters.
When max_depth is 26, MSE is at it lowest at 2684615 and R^2 peaks at 0.9257.
⏫ Back to Top
n_neighbors: 1, MSE: 4097283, R^2: 0.887
n_neighbors: 6, MSE: 3169526, R^2: 0.912
n_neighbors: 11, MSE: 3159735, R^2: 0.913
n_neighbors: 16, MSE: 3223971, R^2: 0.911
n_neighbors: 21, MSE: 3236105, R^2: 0.910
X = dataset_reg['X']
y = dataset_reg['y']
yhat = np.zeros(y.shape)
cv = KFold(n_splits=n_splits, random_state=seed)
for n_neighbors in range(1, 22)[::5]:
for train_index, test_index in cv.split(X, y):
clf = KNeighborsRegressor(n_neighbors=n_neighbors)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
pca = PCA(n_components=100, random_state=seed)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
print("n_neighbors: %.f, MSE: %.f, R^2: %0.3f" % (n_neighbors, mean_squared_e
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 20/43
In [21]:
Definition and optimization of Gaussian Regression
Gaussian Regression is for the prediction of values based on normally distributed variables where
alpha could be optimized in order to form regression models to predict the y_hat for the test set.
The result of the optimization is printed below. We choose the hyper-parameters with the lowest
MSE to be the optimal parameters.
When alpha is 1e-15, MSE is at it lowest at 35488348 and R^2 peaks at 0.0177.
⏫ Back to Top
max_depth: 1, MSE: 15160579, R^2: 0.5804
max_depth: 6, MSE: 3376013, R^2: 0.9066
max_depth: 11, MSE: 2848552, R^2: 0.9212
max_depth: 16, MSE: 2715775, R^2: 0.9248
max_depth: 21, MSE: 2712884, R^2: 0.9249
max_depth: 26, MSE: 2684615, R^2: 0.9257
max_depth: 31, MSE: 2708345, R^2: 0.9250
max_depth: 36, MSE: 2689061, R^2: 0.9256
max_depth: 41, MSE: 2708431, R^2: 0.9250
X = dataset_reg['X']
y = dataset_reg['y']
yhat = np.zeros(y.shape)
cv = KFold(n_splits=n_splits, random_state=seed)
for max_depth in range(1, 42)[::5]:
for train_index, test_index in cv.split(X, y):
clf = RandomForestRegressor(max_depth=max_depth, n_estimators=5, random_st
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
print("max_depth: %.f, MSE: %.f, R^2: %0.4f" % (max_depth, mean_squared_error
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 21/43
In [6]:
Visualizations of Results and Analysis
10 points
Description:
Analyze the results using your chosen method of evaluation. Use visualizations of the results to
bolster the analysis. Explain any visuals and analyze why they are interesting to someone that
might use this model.
⏫ Back to Top
Analysis of Classification model:
For our visualizations for each classification model, we display a bar graph of the count of each
class that was predicted. If the model is good, this should easily show a very small amount for the
center bar (~5%) and a majority in the last bar (61% of values).
The other visualization we display is an ROC curve which maps the false positive rate on the x axis
and true positive rate on the y axis. In an ROC plot, the accuracy is the area under the curve so we
can quickly determine which model has curves higher above the y = x line.
⏫ Back to Top
alpha: 0.000000, MSE: 35488348, R^2: 0.0177
alpha: 0.000250, MSE: 35488410, R^2: 0.0177
alpha: 0.000500, MSE: 35488472, R^2: 0.0177
alpha: 0.000750, MSE: 35488534, R^2: 0.0177
alpha: 0.001000, MSE: 35488596, R^2: 0.0177
X = dataset_reg['X']
y = dataset_reg['y']
yhat = np.zeros(y.shape)
cv = KFold(n_splits=n_splits, random_state=seed)
for alpha in np.linspace(1e-15, 0.001, 5):
for train_index, test_index in cv.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# have to train work on a subset of the training data because it otherwise
X_train = X_train.iloc[:2000]
y_train = y_train.iloc[:2000]
clf = GaussianProcessRegressor(normalize_y=True, alpha=alpha, random_state
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
print("alpha: %f, MSE: %.f, R^2: %0.4f" % (alpha, mean_squared_error(y, yhat)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 22/43
Results and Analysis of a Dummy model
This model is only predicting the most frequent class. It is used for a base line to compare other
models to. The dummy model of 3101 is better at predicting classifiers than some of the
classification methods.
In [43]:
Results and Analysis of K Nearest Neighbors (KD Tree)
All metrics and analysis of the optimized K Nearest Neighbors (KD Tree) are printed below.
For K Neighbors Classifier when n_neighbors is 12, the F1 Score is 0.49 (+/- 0.01), Accuracy is
0.5422, Precision is 0.4709, and Recall is 0.5422.
⏫ Back to Top
----------------- Dummy Evaluation -----------------
F1 Score: 0.46 (+/- 0.00)
Accuracy 0.609164097294
Precision 0.371080897432
Recall 0.609164097294
X = dataset_class['X']
y = dataset_class['y']
f1_score = mt.f1_score(y, [3101] * len(y), average='weighted')
print_accuracy('Dummy', y, [3101] * len(y), [f1_score])
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 23/43
In [12]:
----------------- KD Tree Classifier Evaluation -----------------
F1 Score: 0.49 (+/- 0.01)
Accuracy 0.54224049332
Precision 0.470945268342
Recall 0.54224049332
X = dataset_class['X']
y = dataset_class['y']
scores = []
yhat = np.zeros(y.shape)
yhat_score = np.zeros((len(y), 3))
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = KNeighborsClassifier(n_neighbors=12, algorithm='kd_tree', weights='dista
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
pca = PCA(n_components=100, random_state=seed)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
yhat_score[test_index] = clf.predict_proba(X_test)
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
print_accuracy('KD Tree Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier")
confusion_matrix(y, yhat, clf.classes_)
roc_curve(y, yhat, clf)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 24/43
Results and Analysis of Random Forest
All metrics and analysis of the optimized Random Forest Classifier are printed below.
For Random Forest Classifier when max_depth is 250 and n_estimators is 40, the F1 Score is 0.49
(+/- 0.00), Accuracy is 0.5596, Precision is 0.4708, and Recall is 0.5596.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 25/43
In [13]:
----------------- KD Tree Classifier Evaluation -----------------
F1 Score: 0.49 (+/- 0.00)
Accuracy 0.559643713601
Precision 0.470878065706
Recall 0.559643713601
X = dataset_class['X']
y = dataset_class['y']
scores = []
yhat = np.zeros(y.shape)
yhat_score = np.zeros((len(y), 3))
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
yhat_score[test_index] = clf.predict_proba(X_test)
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
print_accuracy('Random Forest Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="Random Forest Classifier")
confusion_matrix(y, yhat, clf.classes_)
roc_curve(y, yhat, clf)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 26/43
Results and Analysis Naive Bayes
All metrics and analysis of the Naive Bayes Classifier are printed below.
For Naive Bayes Classifier, the F1 Score is 0.46 (+/- 0.00), Accuracy is 0.6047, Precision is 0.4596,
and Recall is 0.6047.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 27/43
In [14]:
----------------- KD Tree Classifier Evaluation -----------------
F1 Score: 0.46 (+/- 0.00)
Accuracy 0.604727646454
Precision 0.459602898787
Recall 0.604727646454
from sklearn.naive_bayes import GaussianNB
X = dataset_class['X']
y = dataset_class['y']
scores = []
yhat = np.zeros(y.shape)
yhat_score = np.zeros((len(y), 3))
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = GaussianNB()
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
yhat_score[test_index] = clf.predict_proba(X_test)
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
print_accuracy('GaussianNB Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="GaussianNB Classifier")
confusion_matrix(y, yhat, clf.classes_)
roc_curve(y, yhat, clf)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 28/43
Analysis of Regression models:
⏫ Back to Top
Results and Analysis K Nearest Neighbors
The evaluation metrics for the optimized model are printed below.
After PCA, K Nearest Neighbors Regression when n_neighbors=16 has a MSE of 3223970. (+/-
395957) and R^2 of 0.91 (+/- 0.01).
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 29/43
In [16]:
Results and Analysis Random Forest
The evaluation metrics for the optimized model are printed below.
Random Forest Regression when max_depth is 26 and n_estimators is 5 has a MSE of 2684614
(+/- 514568) and R^2 of 0.93 (+/- 0.02).
⏫ Back to Top
Evaluation metrics:
MSE: 3223970.51 (+/- 395957.39)
R2: 0.91 (+/- 0.01)
X = dataset_reg['X']
y = dataset_reg['y']
mses = []
r2s = []
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = KFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
pca = PCA(n_components=100, random_state=seed)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
clf = KNeighborsRegressor(n_neighbors=16)
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
mses.append(mean_squared_error(y_test, clf.predict(X_test)))
r2s.append(r2_score(y_test, clf.predict(X_test)))
print('Evaluation metrics:')
print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2))
print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 30/43
In [4]:
Residuals distibution plot
The plot shows the residuals for predicting the target variable "taxamount"
In [5]:
Results and Analysis Gaussian Regression
Evaluation metrics:
MSE: 2684614.88 (+/- 514568.63)
R2: 0.93 (+/- 0.02)
X = dataset_reg['X']
y = dataset_reg['y']
mses = []
r2s = []
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = KFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
mses.append(mean_squared_error(y_test, clf.predict(X_test)))
r2s.append(r2_score(y_test, clf.predict(X_test)))
print('Evaluation metrics:')
print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2))
print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))
f, ax = plt.subplots(nrows=1, ncols=2, figsize=[15, 7])
residuals = yhat - y
sns.distplot(residuals, ax=ax[0])
sns.boxplot(data=residuals, ax=ax[1]);
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 31/43
The evaluation metrics for the optimized model are printed below.
Gaussian Regression when alpha is 1e-15 and normalize_y is True has a MSE if 35917993 (+/-
1636329) and R^2 of 0.01 (+/- 0.00).
⏫ Back to Top
In [18]:
Advantages of Each Model
10 points
Description:
Discuss the advantages of each model for each classification task, if any. If there are not
advantages, explain why. Is any model better than another? Is the difference significant with 95%
confidence? Use proper statistical comparison methods. You must use statistical comparison
techniques—be sure they are appropriate for your chosen method of validation as discussed in unit
7 of the course.
⏫ Back to Top
Advantages of Classification model:
K Nearest Neighbors - KD tree
MSE: 35917993.55 (+/- 1636329.34)
R2: 0.01 (+/- 0.00)
X = dataset_reg['X']
y = dataset_reg['y']
mses = []
r2s = []
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = KFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = GaussianProcessRegressor(alpha=1e-15, normalize_y=True, random_state=see
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# we train on a subset because it otherwise requires too much memory
X_train = X_train.iloc[:1000]
y_train = y_train.iloc[:1000]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
mses.append(mean_squared_error(y_test, clf.predict(X_test)))
r2s.append(r2_score(y_test, clf.predict(X_test)))
print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2))
print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 32/43
K nearest neighbors classification is different than other classification models in that it does not
attempt to make a model but only stores instances of the training data. The classification will build a
tree which at each node is a rule based on the dimensions of its k nearest neighbors and leading to
each leaf representing a class. The advantages of K nearest neighbors are that it is simple and
converges to the correct decision surface as data goes to infinity. It can be used with multiclass data
sets as well as more complex algorithms such as the KD tree. In our dataset we used the KD tree
algorithm in order to speed up the KNN classification by indexing the tree.
Why we did PCA with KNN
KNN computes the distance between each neighbor's dimensions. We have so many dimensions in
our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long
time. We wanted to continue to increase the number of neighbors until the accuracy increase
plateaued but without reducing the dimensions using PCA, it would have taken too long.
Random forest
Random forest is an ensemble classification algorithm which in nature is a huge advantage
because because predicting off of one decision tree vs an ensemble of them the ensemble will
usually win. Another advantage is that the forest can often correct a tree's overfitting of the training
set.
Gaussian Naive Bayes
Naive Bayes largest advantage is that it is extremely simple and it is just counting up probabilities.
When training sets are small, Naive Bayes is good because of its high bias and low variance which
will not overfit the training data. However, as datasets grow larger such as our dataset, the high bias
will prevent the model from being powerful enough to have a high accuracy. Gaussian Naive Bayes
is a normally distributed NB classifier. Its advantages are that it is fast and can make probabilistic
decisions
Model Comparisons
As stated above, we will be comparing our models based on the F1 and accuracy values. Starting
with the F1 values, GaussianNB is statistically significantly lower than the random forest and the KD
tree. GaussianNB has an F1 of 0.46 (+/- 0.00) while the other models had an F1 of 0.49 (+/- 0.01).
There is no significant difference in the F1 of between the KD tree and the random forest.
Next, we compare the accuracy between the KD tree and the random forest, each were run on the
same number of instances, 58380, and the final accuracy values were 0.542 and 0.559 respectively.
The variances would be (0.542)(.458) / 58380 = 0.00000425 and (0.559)(0.441) / 58380 =
0.00000422. Meaning that the final accuracy for each model is:
KD Tree accuracy: 0.54224049332 +/- 0.00000425 = [0.5422,0.5422]
Random forest accuracy: 0.559643713601 +/- 0.00000422 = [0.5596,0.5596]
So our final winner is the Random Forest which is significantly better in F1 as the Gaussian Naive
Bayes and significantly better in terms of accuracy to the KD tree.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 33/43
Advantages of Regression model:
K Nearest Neighbors
The advantages of K nearest neighbors are that it is non-parametric and can address missing and
unusual data for regression prediction. Dimensionality reduction can be used to speed up the
prediction modeling process because the model could be trained with nearest neighbors and leaf
size from the results of PCA.
Why we did PCA with KNN
KNN computes the distance between each neighbor's dimensions. We have so many dimensions in
our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long
time. We wanted to continue to increase the number of neighbors until the accuracy increase
plateaued but without reducing the dimensions using PCA, it would have taken too long.
Random forest
The advantages of Random forest are that by averaging multiple trees, it reduces overfitting,
reduces variance from outliers, and is therefore more accurate. It is unbiased in the estimate of the
generalization error for the forest building progress and provides effective methods for estimating
missing data. Random forest can extended to unlabeled data, leading to unsupervised clustering.
Gaussia Regression
The advantage of Gaussian Regression are that it is fast and uses less CPU and runtime. However,
it is more used towards data that have normal distributions. It provides the full probabilistic
prediction and interpolates the observations for faster prediction.
Model Comparisons
As stated above, we will be comparing our models based on the MSE and R^2 values. Random
Forest Regression with a max_depth of 26 and n_estimators of 5 yeilded the lowest MSE of
2684614 (+/- 514568) and the highest R^2 of 0.93 (+/- 0.02).
So our final winner is the Random Forest which is significantly better in MSE and R^2 than both K
Nearest Neighbors and Gaussian Regression.
⏫ Back to Top
Important Attributes
10 points
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 34/43
Description:
Which attributes from your analysis are most important? Use proper methods discussed in class to
evaluate the importance of different attributes. Discuss the results and hypothesize about why
certain attributes are more important than others for a given classification task.
⏫ Back to Top
Feature importance for classification dataset according to
Random Forest
The top feature was tax amount slightly above 0.08. We think this is the most important feature to
classifying county because in the state of California, where each of the counties are located the tax
rates are set at the county and city level. The next 3 important features also relate to taxes and
each have an importance slightly below 0.08.
The next 3 important features are all related to square footage and year built which we think goes
back to builders and the demographic of the area. Each county could have either one specific
builder for all of their neighborhoods or the builders matched the styles of the homes around them.
The number of bedrooms and bathrooms is probably significant because each county could have
their own demographic of family sizes. If it is close to a larger city, we may see more singles or
couples with fewer numbers of bedrooms and baths and counties farther into suburbia may have
more kids thus more bedrooms and bathrooms.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 35/43
In [15]:
Feature importance for regression dataset according to
Random Forest
The top feature for taxamount was tax value dollar count with an importance of just below 0.9. The
next three important features were longitude, latitude, and calculated finished square feet at
significantly lower importance levels (less than 0.1). The tax value is set at the time of the
assessment and tax value dollar count is calculated from the actual taxes and the assessed taxes.
So the tax value to the dollar is important to the tax amount because we would assume these
should be fairly similar.
The longitude and latitude could be of higher importance because in California tax rates are set at
county and city levels so this could vary based on location. The calculated square feet is also
important because a higher square footage could mean a larger home or mansion which is more
likely to have higher tax value than a smaller home.
⏫ Back to Top
Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x104148ac8>
X = dataset_class['X']
y = dataset_class['y']
clf = RandomForestClassifier(random_state=seed, max_depth=250)
clf.fit(X, y)
importances = clf.feature_importances_
importances = pd.Series(importances, index=X.columns)
importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 36/43
In [16]:
Deployment
5 points
Description:
How useful is your model for interested parties (i.e., the companies or organizations that might want
to use it for prediction)? How would you measure the model's value if it was used by these parties?
How would your deploy your model for interested parties? What other data should be collected?
How often would the model need to be updated, etc.?
⏫ Back to Top
Zillow began offering publicly available real estate data from disparate sources into a single
platform, the gap between sellers’ prices and buyers’ offer prices has significantly decreased.
The Zillow dataset was provided for the purpose of evaluating Zestimate’s accuracy based upon the
variable logerror which is the difference of log(Zestimate) - log(SalePrice). For purposes of this
lab assignment, we developed regression models with taxamount as the response. In our
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x10a707278>
X = dataset_reg['X']
y = dataset_reg['y']
clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed)
clf.fit(X, y)
importances = clf.feature_importances_
importances = pd.Series(importances, index=X.columns)
importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 37/43
classification model, we determine important features for the regionidcounty.
For companies in the real estate space, classification models based on physical attributes, which
provide valuable insight for buying, selling, and investment decisions. Our classification model can
be adapted to more granular levels as cities and municipalities. Buyers, sellers, and investors alike
can gain insights into which features have the highest importance to specific locations. This may
drive investment decisions knowing how important certain attributes are for targeted locations.
Knowing which features are highly important in certain locations can drive remodeling decisions to
make properties more attractive to potential buyers. The value-add of this model for these
companies can be measured in terms of returns on investment.
Deployment of the model can be valuable for the rental market as well, where Airbnb can direct
marketing efforts to areas with specific property attributes. Deployment of the model can also be
used to provide the break-even horizon for making rent versus own decisions. In addition, loan
refinancing companies can utilize this model along with Zillow’s liens and taxes database to target
homeowners in specific areas.
To further improve the effectiveness of the model, we should expand the model to include sales
prices, liens, taxes, as well as identify biased data such as short sales, foreclosures, and “arms-
length” transactions (i.e. sales to relatives). All these are readily available from Zillow, as they
collect an enormous amount of data which are updated with high frequency. For our models to be
relevant in this space, they should be updated daily just as Zillow does with their 7 to 11 million
models.
Exceptional Work
10 points
Description:
You have free reign to provide additional analyses. One idea: grid search parameters in a
parallelized fashion and visualize the performances across attributes. Which parameters are most
significant for making a good model for each classification algorithm?
⏫ Back to Top
Approaches Considered for Balanced Classification
⏫ Back to Top
One of the shortcomings of classification, K-nearest neighbors in particular, is the tendency to bias
in favor of the majority class. To eliminate the bias, a number of approaches can be utilized
including StratifiedKFold, which is the approach we use for our classification models. To be
thorough, we explored a set of “imbalanced learn” algorithms, imblearn.RandomOverSampler,
imblearn.SMOTE, and imblearn.ADASYN.
1. StratifiedKFold - A variant of KFold, this ensures each class is represented equally (i.e.
equal weights) as the algorithm performs each fold. Stratification is performed on the
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 38/43
training dataset "on the fly" as opposed to performing it as part of data preprocessing.
2. imblearn.RandomOverSampler - As a separate package, imblearn was developed to
address the problem of imbalanced data sets; it is performed at data preprocessing.
RandomOverSampler, in particular, performs a naive over sampling with replacement,
duplicating original samples from the minority class. (Under sampling is the alternate
approach).
3. imblearn.SMOTE - SMOTE compensates for classes that are difficult to separate by
performing over and under sampling using Tomek’s link or edited nearest neighbors
cleaning methods.
4. imblearn.ADASYN - Adaptive Synthetic Sampling Approach (ADASYN) is similar to
SMOTE as it generates samples by interpolation but it focuses on the wrongly classified k-
nearest neighbors.
After considering these methods, we settled on the StratifiedKFold for simplicity since accuracies
across the different approaches were practically equivalent.
Below is an illustration of imblearn's RandomOverSampler algorithm in action.
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 39/43
In [9]:
Feature ranking with recursive feature elimination.
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear
model), the goal of recursive feature elimination (RFE) is to select features by recursively
considering smaller and smaller sets of features. First, the estimator is trained on the initial set of
features and the importance of each feature is obtained either through a coef_ attribute or
through a feature_importances_ attribute. Then, the least important features are pruned from
current set of features. That procedure is recursively repeated on the pruned set until the desired
number of features to select is eventually reached.
In this example we will first select top 20 features and will train a Random Forest with only that
features. The performance of the model is printed below.
⏫ Back to Top
Out[9]: <matplotlib.image.AxesImage at 0x11adee128>
plt.figure(figsize=(12,16))
plt.imshow(imread('../../input/imblearn.png')) # just in case you dont see the ima
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 40/43
In [36]:
----------------- KD Tree Classifier Evaluation -----------------
F1 Score: 0.49 (+/- 0.00)
Accuracy 0.559164097294
Precision 0.469948908187
Recall 0.559164097294
X = dataset_class['X'].iloc[:2000]
y = dataset_class['y'].iloc[:2000]
estimator = RandomForestClassifier(max_depth=10, random_state=seed, n_estimators=1
selector = RFE(estimator, n_features_to_select=20, step=1)
selector = selector.fit(X, y)
X = dataset_class['X']
y = dataset_class['y']
X = X[X.columns[selector.support_]]
scores = []
yhat = np.zeros(y.shape)
yhat_score = np.zeros((len(y), 3))
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
yhat_score[test_index] = clf.predict_proba(X_test)
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
print_accuracy('KD Tree Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier")
confusion_matrix(y, yhat, clf.classes_)
roc_curve(y, yhat, clf)
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 41/43
Two dimentional Linear Discriminant Analysis
The idea is to see if there are separatable clusters by class. The colors green, blue, and red
separates the 3 counties by LDA to see if there are any unique clusters or definite patterns that form
on a 2D plane.
⏫ Back to Top
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 42/43
In [37]:
References:
Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels
(https://www.kaggle.com/c/zillow-prize-1/kernels)
Scikitlearn logistic regression: http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
(http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
Scikitlearn linear SVC: http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas
(https://stackoverflow.com/questions/tagged/pandas)
Deployment Reference: http://www.zdnet.com/article/zillow-machine-learning-and-data-in-
real-estate/ (http://www.zdnet.com/article/zillow-machine-learning-and-data-in-real-estate/)
Advantages of GaussianProcessRegression http://scikit-
learn.org/stable/modules/gaussian_process.html (http://scikit-
learn.org/stable/modules/gaussian_process.html)
Advantages of GaussianProcessRegression
https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian-process-
models (https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian-
process-models)
X = dataset_class['X']
y = dataset_class['y']
lde = LDA(n_components=2)
X_lde = lde.fit(X, y).transform(X)
colors = y.astype(str)
colors[colors=='3101'] = 'g'
colors[colors=='2061'] = 'b'
colors[colors=='1286'] = 'r'
plt.scatter(X_lde[:, 1], X_lde[:, 0], s=2, c=colors);
1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 43/43
Advantages of GaussianProcessRegression https://www.quora.com/What-are-some-
advantages-of-using-Gaussian-Process-Models-vs-SVMs (https://www.quora.com/What-
are-some-advantages-of-using-Gaussian-Process-Models-vs-SVMs)
Advantages of RandomForestRegression https://www.quora.com/What-are-some-
advantages-of-using-a-random-forest-over-a-decision-tree-given-that-a-decision-tree-is-
simpler (https://www.quora.com/What-are-some-advantages-of-using-a-random-forest-
over-a-decision-tree-given-that-a-decision-tree-is-simpler)
Advantages of RandomForestRegression
https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm
(https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm)
Advantages of KNeighborsRegression
https://stats.stackexchange.com/questions/104255/why-would-anyone-use-knn-for-
regression (https://stats.stackexchange.com/questions/104255/why-would-anyone-use-
knn-for-regression)
Advantages of KNeighborsRegression https://machinelearningmastery.com/k-nearest-
neighbors-for-machine-learning/ (https://machinelearningmastery.com/k-nearest-
neighbors-for-machine-learning/)
Advantages of KNeighborsRegression https://kevinzakka.github.io/2016/07/13/k-nearest-
neighbor/#pros-and-cons-of-knn (https://kevinzakka.github.io/2016/07/13/k-nearest-
neighbor/#pros-and-cons-of-knn)
Imbalanced Learn http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html
(http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html)
⏫ Back to Top

Weitere ähnliche Inhalte

Was ist angesagt?

R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RRsquared Academy
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentationManchireddy Reddy
 
Preparation Data Structures 03 abstract data_types
Preparation Data Structures 03 abstract data_typesPreparation Data Structures 03 abstract data_types
Preparation Data Structures 03 abstract data_typesAndres Mendez-Vazquez
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
 
Matlab Introduction
Matlab IntroductionMatlab Introduction
Matlab Introductionideas2ignite
 
Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Raman Kannan
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data VisualizationSakthi Dasans
 
Importance of matlab
Importance of matlabImportance of matlab
Importance of matlabkrajeshk1980
 

Was ist angesagt? (20)

R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Programming in R
Programming in RProgramming in R
Programming in R
 
R programming language
R programming languageR programming language
R programming language
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentation
 
R studio
R studio R studio
R studio
 
Preparation Data Structures 03 abstract data_types
Preparation Data Structures 03 abstract data_typesPreparation Data Structures 03 abstract data_types
Preparation Data Structures 03 abstract data_types
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Matlab Introduction
Matlab IntroductionMatlab Introduction
Matlab Introduction
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Array i imp
Array  i impArray  i imp
Array i imp
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data Visualization
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
Importance of matlab
Importance of matlabImportance of matlab
Importance of matlab
 

Ähnlich wie Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsOmkar Rane
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3Max Kleiner
 
Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-meansAndrei Novikov
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceLviv Startup Club
 
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Comsysto Reply GmbH
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Max Kleiner
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 

Ähnlich wie Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance (20)

maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Database programming
Database programmingDatabase programming
Database programming
 
wk5ppt2_Iris
wk5ppt2_Iriswk5ppt2_Iris
wk5ppt2_Iris
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissions
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
Pyclustering tutorial - K-means
Pyclustering tutorial - K-meansPyclustering tutorial - K-means
Pyclustering tutorial - K-means
 
Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 

Mehr von Yao Yao

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearYao Yao
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYao Yao
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYao Yao
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionYao Yao
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataYao Yao
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionYao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesYao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 

Mehr von Yao Yao (19)

Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
 
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
 
Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Kürzlich hochgeladen (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance

  • 1. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 1/43 Lab 2: Zillow Dataset Classification and Regression Prediction Models MSDS 7331 Data Mining - Section 403 - Lab 2 Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion Contents Imports Define and Prepare Class Variables Classification Variables Regression Variables Describe the Final Dataset Classification Dataset Regression Dataset Explain Evaluation Metrics Classification Metrics Regression Metrics Training and Testing Splits For Classification For Regression Three Different Classification/Regression Models Classification Models K Nearest Neighbors Random Forest Naive Bayes Regression Models K Nearest Neighbors Random Forest Gaussian Regression Visualizations of Results and Analysis Analysis of Classification Models Analysis of K Nearest Neighbors Analysis of Random Forest Analysis of Naive Bayes Regression Models Analysis of K Nearest Neighbors Analysis of Random Forest Analysis of Gaussian Regression Advantages of Each Model Classification Models Regression Models Important Attributes Classification Models Regression Models
  • 2. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 2/43 Deployment Exceptional Work Approaches Considered for Balanced Classification Feature Elimination Two dimensional Linear Discriminant Analysis References Imports & Custom Functions We chose to use the same Zillow dataset from Lab 1 for this exploration in regression and classification. For the origin and purpose of dataset as well as a detailed description of the dataset, refer to https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb (https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb). The function output_variables_table shows if the variable is nominal or ordinal for further use on classification or regression. The functions per_class_accuracy and confusion_matrix show the confusion table for correctly and incorrectly identified classification prediction results. The function plot_class_acc shows the visual accuracies of classification. The function plot_feature_importance shows the feature importance of classification values. The function print_accuracy shows the accuracy scores of the classification models. The function get_dataset_subset obtains a subset of the full dataset for modeling and prediction. We will be using a seed of 0. Due to our dataset being extremely large, we are using 5 folds for the CPU usage and runtime to be more manageable to run through the prediction models for both classification and regression.
  • 3. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 3/43 In [1]: %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display, HTML from sklearn.model_selection import train_test_split from sklearn import metrics as mt # classification imports from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.cross_validation import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.decomposition import PCA, SparsePCA from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.model_selection import train_test_split, cross_val_score, KFold from sklearn.metrics import mean_squared_error, r2_score # regression imports from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsRegressor from sklearn.model_selection import KFold from sklearn.metrics import mean_squared_error, r2_score from sklearn.ensemble import RandomForestRegressor from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.datasets import make_regression from sklearn.feature_selection import RFE from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from scipy.ndimage import imread import warnings warnings.filterwarnings("ignore") def output_variables_table(variables): variables = variables.sort_index() rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th> for vname, atts in variables.iterrows(): if vname not in dataset.columns: continue atts = atts.to_dict() # add scale if TBD if atts['scale'] == 'TBD': if atts['type'] in ['nominal', 'ordinal']: uniques = dataset[vname].unique() uniques = list(uniques.astype(str)) if len(uniques) < 10: atts['scale'] = '[%s]' % ', '.join(uniques)
  • 4. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 4/43 else: atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d if atts['type'] in ['ratio', 'interval']: atts['scale'] = '(%d, %d)' % (dataset[vname].min(), dataset[vname row = (vname, atts['type'], atts['scale'], atts['description']) rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row return HTML('<table>%s</table>' % ''.join(rows)) # Define an accuracy plot def per_class_accuracy(ytrue, yhat): conf = mt.confusion_matrix(ytrue,yhat) norm_conf = conf.astype('float') / conf.sum(axis=1)[:, np.newaxis] return np.diag(norm_conf) def plot_class_acc(ytrue, yhat, classes, title=''): acc_list = per_class_accuracy(y, yhat) pd.DataFrame(acc_list, index=pd.Index(classes, name='Classes')).plot(kind='ba plt.xlabel('Class value (one per face)') plt.ylabel('Accuracy within class') plt.title(title+", Total Acc=%.1f"%(100*mt.accuracy_score(ytrue,yhat))) plt.grid() plt.ylim([0,1]) plt.show() # Plot the feature importances of the forest def plot_feature_importance(ytrue, yhat, rt, title=''): importances = rt.feature_importances_ std = np.std([tree.feature_importances_ for tree in rt.estimators_], axis=0) indices = np.argsort(importances)[::-1] for f in range(X.shape[1]): print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]] plt.figure() plt.title("Feature importances") plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center") plt.xticks(range(X.shape[1]), indices) plt.xlim([-1, X.shape[1]]) plt.show() def print_accuracy(model_name, y_test, yhat, scores): scores = np.array(scores) print('----------------- %s Evaluation -----------------' % model_name) print(" F1 Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) print(' Accuracy', mt.accuracy_score(y_test, yhat)) print(' Precision', mt.precision_score(y_test, yhat, average='weighted')) print(' Recall', mt.recall_score(y_test, yhat, average='weighted')) def confusion_matrix(ytrue, yhat, classes): index = pd.MultiIndex.from_product([['True Class'], classes]) columns = pd.MultiIndex.from_product([['Predicted Class'], classes]) return pd.DataFrame(mt.confusion_matrix(y, yhat), index=index, columns=column def roc_curve(ytrue, yhat, clf): for i, label in enumerate(clf.classes_):
  • 5. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 5/43 Define and Prepare Class Variables 10 points Description: Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. ⏫ Back to Top Classification Datasets: The classification dataset removes logerror and transactiondate because they were for the purposes of the Kaggle competition and were not complete for the training set. The column that was created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake of simplicity of only using original data for the prediction process. The table generated shows the type of data used for classification purposes. The dataset has 58380 rows and 1757 columns. All variables and details about the variables are printed on the table below. /usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: Deprecat ionWarning: This module was deprecated in version 0.18 in favor of the model_se lection module into which all the refactored classes and functions are moved. A lso note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) fpr, tpr, _ = mt.roc_curve(y, yhat_score[:, i], pos_label=label) roc_auc = mt.auc(fpr, tpr) plt.plot(fpr, tpr, label='class {0} with {1} instances (area = {2:0.2f})' ''.format(label, sum(y==label), roc_auc)) plt.title('ROC Curve') plt.legend(loc="lower right") plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.show() def get_dataset_subset(dataset, n=1000): return { 'X': dataset['X'].iloc[:n], 'y': dataset['y'].iloc[:n] } seed = 0 n_splits = 5
  • 6. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 6/43 The target class "regionidcounty" has three possible values: 1286, 2061 or 3101, representing three different county codes. The distribution is skewed with code 1286 having 17749 observations, 3101 has 35563, and 2061 only 5068 observations. ⏫ Back to Top
  • 7. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 7/43 In [5]: Dataset shape: (58380, 1757) regionidcounty 1286 17749 2061 5068 3101 35563 Name: regionidcounty, dtype: int64 variables = pd.read_csv('../../datasets/variables.csv').set_index('name') dataset = pd.read_csv('../../datasets/train.csv', low_memory=False) # remove unneeded variables del dataset['Unnamed: 0'] del dataset['logerror'] del dataset['transactiondate'] del dataset['city'] del dataset['price_per_sqft'] # delete all location information because we want to predict the couty # and those feature will give it up to easy y = dataset['regionidcounty'].copy() del dataset['regionidcounty'] del dataset['regionidcity'] del dataset['regionidzip'] del dataset['regionidneighborhood'] del dataset['rawcensustractandblock'] del dataset['latitude'] del dataset['longitude'] output_variables = output_variables_table(variables) nominal = variables[variables['type'].isin(['nominal'])] nominal = nominal[nominal.index.isin(dataset.columns)] continuous = variables[~variables['type'].isin(['nominal'])] continuous = continuous[continuous.index.isin(dataset.columns)] nominal_data = dataset[nominal.index] nominal_data = pd.get_dummies(nominal_data, drop_first=True) nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina continuous_data = dataset[continuous.index] dataset = pd.concat([continuous_data, nominal_data], axis=1) columns = dataset.columns variables = variables[variables.index.isin(dataset.columns)] # shuffle the dataset (just in case) X = dataset.sample(frac=1, random_state=seed) dataset_class = { 'X': X, 'y': y } print('Dataset shape:', X.shape) print(y.groupby(y).size()) output_variables
  • 8. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 8/43 Out[5]: Variable Type Scale Description airconditioningtypeid nominal [0, 1, 13, 5, 11, 3, 9] Type of cooling system present in the any) assessmentyear interval (2015, 2015) The year of the property tax assessm bathroomcnt ordinal [1.0, 3.5, 2.5, 3.0, 2.0, ... (22 More)] Number of bathrooms in home includi fractional bathrooms bedroomcnt ordinal [1, 5, 4, 3, 2, ... (16 More)] Number of bedrooms in home buildingqualitytypeid ordinal [7, 4, 1, 10, 12, 8] Overall assessment of condition of the from best (lowest) to worst (highest) calculatedbathnbr ordinal [1.0, 3.5, 2.5, 3.0, 2.0, ... (22 More)] Number of bathrooms in home includi fractional bathroom calculatedfinishedsquarefeet ratio (0, 10925) Calculated total finished living area of home censustractandblock nominal [60372040024100.0, 60590991081500.0, 60374078455800.0, 61110052978700.0, 60379010957300.0, ... (445 More)] Census tract and block ID combined - contains blockgroup assignment by ex finishedsquarefeet12 ratio (0, 6615) Finished living area finishedsquarefeet50 ratio (0, 8352) Size of the finished living area on the (entry) floor of the home fips nominal [6037, 6059, 6111] Federal Information Processing Stand - see https://en.wikipedia.org/wiki/FIPS_cou for more details fireplacecnt ordinal [0, 1, 2, 3, 5, 4] Number of fireplaces in a home (if any fullbathcnt ordinal [1.0, 3.0, 2.0, 6.0, 4.0, ... (17 More)] Number of full bathrooms (sink, show bathtub, and toilet) present in home garagecarcnt ordinal [0.0, 2.0, 1.0, 4.0, 3.0, ... (14 More)] Total number of garages on the lot inc attached garage garagetotalsqft ratio (0, 1610) Total number of square feet of all gara lot including an attached garage hashottuborspa ordinal [0, 1] Does the home have a hot tub or spa heatingorsystemtypeid nominal [7, 0, 2, 6, 24, ... (12 More)] Type of home heating system landtaxvaluedollarcnt ratio (22, 2477536) The assessed value of the land area o parcel
  • 9. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 9/43 location_type nominal [PRIMARY, nan, NOT ACCEPTABLE, ACCEPTABLE] Primary, Acceptable, Not Acceptable lotsizesquarefeet ratio (0, 1710750) Area of the lot in square feet numberofstories ordinal [1, 2, 3, 4] Number of stories or levels the home parcelid nominal [11800329, 14058566, 14636635, 17138404, 11270723, ... (49678 More)] Unique identifier for parcels (lots) poolcnt ordinal [0.0, 1.0] Number of pools on the lot (if any) poolsizesum ratio (0, 1476) Total square footage of all pools on pr pooltypeid10 nominal [0, 1] Spa or Hot Tub pooltypeid2 nominal [0, 1] Pool with Spa/Hot Tub pooltypeid7 nominal [0, 1] Pool without hot tub propertycountylandusecode nominal [0100, 122, 1, 1111, 010C, ... (71 More)] County land use code i.e. it's zoning a county level propertylandusetypeid nominal [261, 266, 246, 265, 269, ... (13 More)] Type of land use the property is zoned propertyzoningdesc nominal [LAR2, 0, LRRA7000*, TOPR- MD, LCA11*, ... (1655 More)] Description of the allowed land uses ( for that property roomcnt ordinal [0, 9, 8, 4, 7, ... (16 More)] Total number of rooms in the principal residence structuretaxvaluedollarcnt ratio (100, 2181198) The assessed value of the built struct the parcel taxamount ratio (49, 51292) The total property tax assessed for th assessment year taxdelinquencyflag nominal [0, 1] Property taxes for this parcel are past of 2015 taxdelinquencyyear interval (0, 26) Year taxvaluedollarcnt ratio (22, 4052186) The total tax assessed value of the pa threequarterbathnbr ordinal [0, 1, 2, 3, 4] Number of 3/4 bathrooms in house (s sink + toilet) unitcnt ordinal [1, 2, 3, 4, 9, 6] Number of units the structure is built i = duplex, 3 = triplex, etc...) yardbuildingsqft17 interval (0, 1485) Patio in yard
  • 10. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 10/43 Regression Datasets: The regression dataset removes logerror and transactiondate because they were for the purposes of the Kaggle competition and were not complete for the training set. The column that was created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake of simplicity of only using original data for the prediction process. We are only using nominal and continuous data types for regression purposes. The dataset has 58380 rows and 1758 columns. All variables and details about the variables are printed on the table below. ⏫ Back to Top yardbuildingsqft26 interval (0, 1366) Storage shed/building in yard yearbuilt interval (1885, 2015) The Year the principal residence was zipcode_type nominal [STANDARD, nan, PO BOX, MILITARY, UNIQUE] Standard, PO BOX Only, Unique, Military(implies APO or FPO)
  • 11. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 11/43 In [13]: Dataset shape: (58380, 1758) dataset = pd.read_csv('../../datasets/train.csv', low_memory=False) variables = pd.read_csv('../../datasets/variables.csv').set_index('name') # remove unneeded variables del dataset['logerror'] del dataset['transactiondate'] del dataset['city'] del dataset['price_per_sqft'] output_variables = output_variables_table(variables) nominal = variables[variables['type'].isin(['nominal'])] nominal = nominal[nominal.index.isin(dataset.columns)] continuous = variables[~variables['type'].isin(['nominal'])] continuous = continuous[continuous.index.isin(dataset.columns)] nominal_data = dataset[nominal.index] nominal_data = pd.get_dummies(nominal_data, drop_first=True) nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina continuous_data = dataset[continuous.index] dataset = pd.concat([continuous_data, nominal_data], axis=1) columns = dataset.columns variables = variables[variables.index.isin(dataset.columns)] # shuffle the dataset (just in case) X = dataset.sample(frac=1, random_state=seed) y = X['taxamount'].copy() del X['taxamount'] dataset_reg = { 'X': X, 'y': y } print('Dataset shape:', X.shape) plt.title('Distribution of the target variable: taxamount') y.plot(kind='box') output_variables
  • 12. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 12/43 Describe the Final Dataset 5 points Description: Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). ⏫ Back to Top Classification Datasets: ⏫ Back to Top Since we are using the same Zillow dataset that we used in the previous lab, most of the data was already cleaned up. However, the purpose of our classification dataset is to predict the county each property is located in. Therefore our final model removed all columns relating to location such as latitude, longitude, city, and zipcode. We also removed variables we did not need such as logerror, transactiondate, and price_per_sqft. We did not create any new columns for the classification dataset but we did transform the categorical variables into indicator variables. The final shape of our classification dataset is 58380 instances and 1757 columns. The three counties we are trying to predict have sizes of about 18k, 5k, and 36k so an accuracy below 0.61 will mean that we are better off classifying each with the latter county. Regression Datasets: The regression dataset removes logerror and transactiondate because they were for the purposes of the Kaggle competition and were not complete for the training set. The column that was created for "New Features" from Lab 1 (city and price_per_sqft) were also removed for the sake of simplicity of only using original data for the prediction process. We are only using nominal and continuous data types for regression purposes. The final shape of our classification dataset is 58380 instances and 1758 columns. The varaiable that we are predicting, taxamount, is right skewed, with outlier property costing more than the standard deviation. ⏫ Back to Top Out[13]: Variable Type Scale Description Type of cooling system present in th
  • 13. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 13/43 Explain Evaluation Metrics 10 points Description: Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F- measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions. ⏫ Back to Top Classification Metrics: ⏫ Back to Top Because of our class distribution is very skewed, we will be optimizing the models base on F1 score. For our evaluation, we will be taking into account the accuracy and F-measure. In order to compute the F-measure, we will need the precision and recall. Because F-measure is a weighted average of these, we think a better F-measure score means the model has a better precision and recall. Accuracy is the ratio of correct predictions to the total number of observations. It is calculated as: (TP+TN) / (TP+FP+FN+TN). The closer accuracy is to 1, the more accurate the model is, with one caveat. For high accuracy to be a reliable indicator, the dataset has to be symmetric, i.e. total false positives are about equal to false negatives. Otherwise, we need to review other parameters as well. Precision is the ratio of correctly predicted positive observations to the total positive observations. It is calculated as: TP / (TP+FP). Recall is the ratio of correctly predicted positive observations to all actual positives. It is calculated as TP / (TP+FN). The consequences of type 2 errors, predicting a false negative, are not extreme so we think recall is an appropriate measure of completeness. Finally, we will also use F-measure which is essentially a weighted average of the precision and recall into one simple statistic. F-measure will be a number between 0 and 1 where closer to 1 is better and approaching 0 is worse. It overcomes the limitations of accuracy whenever false positives and false negatives are not about equal or symmetric. Regression Metrics: For our evaluation of regression prediction models, we are looking at mean squared error (MSE) and R^2. With the large data size and right skew of taxamount, we are trying to minimize MSE and have a R^2 value close to 1. Whichever model with optimal parameters that could reduce MSE and increase R^2 while using less CPU and less runtime would be the best regression model to use for the prediction of the dataset.
  • 14. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 14/43 ⏫ Back to Top Training and Testing Splits 10 points Description: Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time. ⏫ Back to Top Due to our dataset being extremely large, we are using 5 folds for the CPU usage and runtime to be more managable to run through the prediction models for both classification and regression. Our data is not a time series so we did not need to train and test over time with a moving time window. Classification Splits: ⏫ Back to Top For the classification task we choose to use Stratified K-Fold cross validation with 5 folds. We chose stratified in order to preserve the percentage of samples in each class. We also had a very large dataset so splitting into more than 5 folds would have been computationally expensive with not a large enough return on value. We felt that splitting the data into 5 folds would be enough splits to reduce the weight of any outliers or noise. Regression Splits: For the regression task we choose to use K-Fold cross validation with 5 folds. We chose K-Fold in order to preserve the percentage of samples in each class. We also had a very large dataset so splitting into more than 5 folds would have been computationally expensive with not a large enough return on value. We felt that splitting the data into 5 folds would be enough splits to reduce the weight of any outliers or noise. ⏫ Back to Top Three Different Classification/Regression Models 20 points
  • 15. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 15/43 Description: Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms! ⏫ Back to Top Classification Models: ⏫ Back to Top Definition and optimization of K Nearest Neighbors (KD Tree) K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a certain n_neighbors count in order to form classification models to predict the y_hat for the test set. Optimization result for different values of "n_neighbors" is printed below. When n_neighbors is 12, the f1_score is highest at 0.492. ⏫ Back to Top
  • 16. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 16/43 In [5]: Definition and optimization of Random Forest Random forest is for the prediction of values based on training decision trees by by a certain max depth in order to form classification models to predict the y_hat for the test set. Optimization result for different test values of "max_depth" is printed below. When max_depth is 351, the f1_score starts to plateau at 0.49062. ⏫ Back to Top n_neighbors: 2 , f1_score: 0.470579117819 n_neighbors: 7 , f1_score: 0.491355550804 n_neighbors: 12 , f1_score: 0.492699617518 n_neighbors: 17 , f1_score: 0.492039256254 n_neighbors: 22 , f1_score: 0.490571365928 n_neighbors: 27 , f1_score: 0.488490047836 X = dataset_class['X'] y = dataset_class['y'] result = [] scores = [] for n_neighbors in range(2, 30)[::5]: yhat = np.zeros(y.shape) # we will fill this with predictions cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # in order to reduce the time for training KNeighborsClassifier # we reduce the dimetions of the data from 1717 to 100 and we use kd_tree pca = PCA(n_components=100, random_state=seed) pca.fit(X_train) X_train = pca.transform(X_train) X_test = pca.transform(X_test) clf = KNeighborsClassifier(n_neighbors=n_neighbors, algorithm='kd_tree', w clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) f1_score = mt.f1_score(y, yhat, average='weighted') print ('n_neighbors:', n_neighbors, ', f1_score:', f1_score)
  • 17. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 17/43 In [6]: Definition and optimization of Naive Bayes (Gaussian) The Naive Bayes doesn't have any parameters to optimize and uses maximum likelihood training to classify and predict for the test set. We will just show a F1 score with confidence interval with an interval and inspect the result of this model in more details in the next section. max_depth: 1 F1 score: 0.46120951624 max_depth: 51 F1 score: 0.465099779665 max_depth: 101 F1 score: 0.481927336387 max_depth: 151 F1 score: 0.489103832227 max_depth: 201 F1 score: 0.490156681956 max_depth: 251 F1 score: 0.49059336761 max_depth: 301 F1 score: 0.490213068185 max_depth: 351 F1 score: 0.490623495801 X = dataset_class['X'] y = dataset_class['y'] result = [] index = [] for max_depth in range(1, 401)[::50]: yhat = np.zeros(y.shape, dtype=int) # we will fill this with predictions cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = RandomForestClassifier(max_depth=max_depth, random_state=seed, n_est X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) f1_score = mt.f1_score(y, yhat, average='weighted') print ('max_depth:', max_depth, 'F1 score:', f1_score) result.append(f1_score) index.append(max_depth) plt.title('F1 score for different max_depth') pd.Series(result, index=pd.Index(index, name='max_depth'), name='f1_score').plot(
  • 18. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 18/43 Naive Bayes has no optimal parameters to adjust and has a F1 score of 0.46. ⏫ Back to Top In [8]: Definition and optimization of Regression models: ⏫ Back to Top Definition and optimization of K Nearest Neighbors K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a certain n_neighbors in order to form regression models to predict the y_hat for the test set. In order to fit the model in a reasonable amount of time, we shrank the dimension of the dataset to a 100 features with a PCA. The result of the optimization is printed below. We choose the hyper- parameters with the highest R2 score to be the optimal parameters. When n_neighbors is 11, MSE is at it lowest at 3159735 and R^2 peaks at 0.913. ⏫ Back to Top F1 score: 0.46 (+/- 0.00) X = dataset_class['X'] y = dataset_class['y'] yhat = np.zeros(y.shape) # we will fill this with predictions cv = StratifiedKFold(n_splits=n_splits, random_state=seed) scores = [] for train_index, test_index in cv.split(X, y): clf = GaussianNB() X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted') scores.append(f1_score) scores = np.array(scores) print("F1 score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  • 19. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 19/43 In [17]: Definition and optimization of Random Forest Random forest is for the prediction of values based on training decision trees by by a certain max depth in order to form regression models to predict the y_hat for the test set. The result of the optimization is printed below. We choose the hyper-parameters with the highest R2 score to be the optimal parameters. When max_depth is 26, MSE is at it lowest at 2684615 and R^2 peaks at 0.9257. ⏫ Back to Top n_neighbors: 1, MSE: 4097283, R^2: 0.887 n_neighbors: 6, MSE: 3169526, R^2: 0.912 n_neighbors: 11, MSE: 3159735, R^2: 0.913 n_neighbors: 16, MSE: 3223971, R^2: 0.911 n_neighbors: 21, MSE: 3236105, R^2: 0.910 X = dataset_reg['X'] y = dataset_reg['y'] yhat = np.zeros(y.shape) cv = KFold(n_splits=n_splits, random_state=seed) for n_neighbors in range(1, 22)[::5]: for train_index, test_index in cv.split(X, y): clf = KNeighborsRegressor(n_neighbors=n_neighbors) X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] pca = PCA(n_components=100, random_state=seed) pca.fit(X_train) X_train = pca.transform(X_train) X_test = pca.transform(X_test) clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) print("n_neighbors: %.f, MSE: %.f, R^2: %0.3f" % (n_neighbors, mean_squared_e
  • 20. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 20/43 In [21]: Definition and optimization of Gaussian Regression Gaussian Regression is for the prediction of values based on normally distributed variables where alpha could be optimized in order to form regression models to predict the y_hat for the test set. The result of the optimization is printed below. We choose the hyper-parameters with the lowest MSE to be the optimal parameters. When alpha is 1e-15, MSE is at it lowest at 35488348 and R^2 peaks at 0.0177. ⏫ Back to Top max_depth: 1, MSE: 15160579, R^2: 0.5804 max_depth: 6, MSE: 3376013, R^2: 0.9066 max_depth: 11, MSE: 2848552, R^2: 0.9212 max_depth: 16, MSE: 2715775, R^2: 0.9248 max_depth: 21, MSE: 2712884, R^2: 0.9249 max_depth: 26, MSE: 2684615, R^2: 0.9257 max_depth: 31, MSE: 2708345, R^2: 0.9250 max_depth: 36, MSE: 2689061, R^2: 0.9256 max_depth: 41, MSE: 2708431, R^2: 0.9250 X = dataset_reg['X'] y = dataset_reg['y'] yhat = np.zeros(y.shape) cv = KFold(n_splits=n_splits, random_state=seed) for max_depth in range(1, 42)[::5]: for train_index, test_index in cv.split(X, y): clf = RandomForestRegressor(max_depth=max_depth, n_estimators=5, random_st X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) print("max_depth: %.f, MSE: %.f, R^2: %0.4f" % (max_depth, mean_squared_error
  • 21. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 21/43 In [6]: Visualizations of Results and Analysis 10 points Description: Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model. ⏫ Back to Top Analysis of Classification model: For our visualizations for each classification model, we display a bar graph of the count of each class that was predicted. If the model is good, this should easily show a very small amount for the center bar (~5%) and a majority in the last bar (61% of values). The other visualization we display is an ROC curve which maps the false positive rate on the x axis and true positive rate on the y axis. In an ROC plot, the accuracy is the area under the curve so we can quickly determine which model has curves higher above the y = x line. ⏫ Back to Top alpha: 0.000000, MSE: 35488348, R^2: 0.0177 alpha: 0.000250, MSE: 35488410, R^2: 0.0177 alpha: 0.000500, MSE: 35488472, R^2: 0.0177 alpha: 0.000750, MSE: 35488534, R^2: 0.0177 alpha: 0.001000, MSE: 35488596, R^2: 0.0177 X = dataset_reg['X'] y = dataset_reg['y'] yhat = np.zeros(y.shape) cv = KFold(n_splits=n_splits, random_state=seed) for alpha in np.linspace(1e-15, 0.001, 5): for train_index, test_index in cv.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # have to train work on a subset of the training data because it otherwise X_train = X_train.iloc[:2000] y_train = y_train.iloc[:2000] clf = GaussianProcessRegressor(normalize_y=True, alpha=alpha, random_state clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) print("alpha: %f, MSE: %.f, R^2: %0.4f" % (alpha, mean_squared_error(y, yhat)
  • 22. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 22/43 Results and Analysis of a Dummy model This model is only predicting the most frequent class. It is used for a base line to compare other models to. The dummy model of 3101 is better at predicting classifiers than some of the classification methods. In [43]: Results and Analysis of K Nearest Neighbors (KD Tree) All metrics and analysis of the optimized K Nearest Neighbors (KD Tree) are printed below. For K Neighbors Classifier when n_neighbors is 12, the F1 Score is 0.49 (+/- 0.01), Accuracy is 0.5422, Precision is 0.4709, and Recall is 0.5422. ⏫ Back to Top ----------------- Dummy Evaluation ----------------- F1 Score: 0.46 (+/- 0.00) Accuracy 0.609164097294 Precision 0.371080897432 Recall 0.609164097294 X = dataset_class['X'] y = dataset_class['y'] f1_score = mt.f1_score(y, [3101] * len(y), average='weighted') print_accuracy('Dummy', y, [3101] * len(y), [f1_score])
  • 23. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 23/43 In [12]: ----------------- KD Tree Classifier Evaluation ----------------- F1 Score: 0.49 (+/- 0.01) Accuracy 0.54224049332 Precision 0.470945268342 Recall 0.54224049332 X = dataset_class['X'] y = dataset_class['y'] scores = [] yhat = np.zeros(y.shape) yhat_score = np.zeros((len(y), 3)) cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = KNeighborsClassifier(n_neighbors=12, algorithm='kd_tree', weights='dista X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] pca = PCA(n_components=100, random_state=seed) pca.fit(X_train) X_train = pca.transform(X_train) X_test = pca.transform(X_test) clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) yhat_score[test_index] = clf.predict_proba(X_test) f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted') scores.append(f1_score) print_accuracy('KD Tree Classifier', y, yhat, scores) plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier") confusion_matrix(y, yhat, clf.classes_) roc_curve(y, yhat, clf)
  • 24. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 24/43 Results and Analysis of Random Forest All metrics and analysis of the optimized Random Forest Classifier are printed below. For Random Forest Classifier when max_depth is 250 and n_estimators is 40, the F1 Score is 0.49 (+/- 0.00), Accuracy is 0.5596, Precision is 0.4708, and Recall is 0.5596. ⏫ Back to Top
  • 25. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 25/43 In [13]: ----------------- KD Tree Classifier Evaluation ----------------- F1 Score: 0.49 (+/- 0.00) Accuracy 0.559643713601 Precision 0.470878065706 Recall 0.559643713601 X = dataset_class['X'] y = dataset_class['y'] scores = [] yhat = np.zeros(y.shape) yhat_score = np.zeros((len(y), 3)) cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40 X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) yhat_score[test_index] = clf.predict_proba(X_test) f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted') scores.append(f1_score) print_accuracy('Random Forest Classifier', y, yhat, scores) plot_class_acc(y, yhat, clf.classes_, title="Random Forest Classifier") confusion_matrix(y, yhat, clf.classes_) roc_curve(y, yhat, clf)
  • 26. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 26/43 Results and Analysis Naive Bayes All metrics and analysis of the Naive Bayes Classifier are printed below. For Naive Bayes Classifier, the F1 Score is 0.46 (+/- 0.00), Accuracy is 0.6047, Precision is 0.4596, and Recall is 0.6047. ⏫ Back to Top
  • 27. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 27/43 In [14]: ----------------- KD Tree Classifier Evaluation ----------------- F1 Score: 0.46 (+/- 0.00) Accuracy 0.604727646454 Precision 0.459602898787 Recall 0.604727646454 from sklearn.naive_bayes import GaussianNB X = dataset_class['X'] y = dataset_class['y'] scores = [] yhat = np.zeros(y.shape) yhat_score = np.zeros((len(y), 3)) cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = GaussianNB() X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) yhat_score[test_index] = clf.predict_proba(X_test) f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted') scores.append(f1_score) print_accuracy('GaussianNB Classifier', y, yhat, scores) plot_class_acc(y, yhat, clf.classes_, title="GaussianNB Classifier") confusion_matrix(y, yhat, clf.classes_) roc_curve(y, yhat, clf)
  • 28. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 28/43 Analysis of Regression models: ⏫ Back to Top Results and Analysis K Nearest Neighbors The evaluation metrics for the optimized model are printed below. After PCA, K Nearest Neighbors Regression when n_neighbors=16 has a MSE of 3223970. (+/- 395957) and R^2 of 0.91 (+/- 0.01). ⏫ Back to Top
  • 29. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 29/43 In [16]: Results and Analysis Random Forest The evaluation metrics for the optimized model are printed below. Random Forest Regression when max_depth is 26 and n_estimators is 5 has a MSE of 2684614 (+/- 514568) and R^2 of 0.93 (+/- 0.02). ⏫ Back to Top Evaluation metrics: MSE: 3223970.51 (+/- 395957.39) R2: 0.91 (+/- 0.01) X = dataset_reg['X'] y = dataset_reg['y'] mses = [] r2s = [] yhat = np.zeros(y.shape) # we will fill this with predictions cv = KFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] pca = PCA(n_components=100, random_state=seed) pca.fit(X_train) X_train = pca.transform(X_train) X_test = pca.transform(X_test) clf = KNeighborsRegressor(n_neighbors=16) clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) mses.append(mean_squared_error(y_test, clf.predict(X_test))) r2s.append(r2_score(y_test, clf.predict(X_test))) print('Evaluation metrics:') print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2)) print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))
  • 30. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 30/43 In [4]: Residuals distibution plot The plot shows the residuals for predicting the target variable "taxamount" In [5]: Results and Analysis Gaussian Regression Evaluation metrics: MSE: 2684614.88 (+/- 514568.63) R2: 0.93 (+/- 0.02) X = dataset_reg['X'] y = dataset_reg['y'] mses = [] r2s = [] yhat = np.zeros(y.shape) # we will fill this with predictions cv = KFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed) X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) mses.append(mean_squared_error(y_test, clf.predict(X_test))) r2s.append(r2_score(y_test, clf.predict(X_test))) print('Evaluation metrics:') print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2)) print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2)) f, ax = plt.subplots(nrows=1, ncols=2, figsize=[15, 7]) residuals = yhat - y sns.distplot(residuals, ax=ax[0]) sns.boxplot(data=residuals, ax=ax[1]);
  • 31. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 31/43 The evaluation metrics for the optimized model are printed below. Gaussian Regression when alpha is 1e-15 and normalize_y is True has a MSE if 35917993 (+/- 1636329) and R^2 of 0.01 (+/- 0.00). ⏫ Back to Top In [18]: Advantages of Each Model 10 points Description: Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course. ⏫ Back to Top Advantages of Classification model: K Nearest Neighbors - KD tree MSE: 35917993.55 (+/- 1636329.34) R2: 0.01 (+/- 0.00) X = dataset_reg['X'] y = dataset_reg['y'] mses = [] r2s = [] yhat = np.zeros(y.shape) # we will fill this with predictions cv = KFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = GaussianProcessRegressor(alpha=1e-15, normalize_y=True, random_state=see X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # we train on a subset because it otherwise requires too much memory X_train = X_train.iloc[:1000] y_train = y_train.iloc[:1000] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) mses.append(mean_squared_error(y_test, clf.predict(X_test))) r2s.append(r2_score(y_test, clf.predict(X_test))) print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2)) print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))
  • 32. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 32/43 K nearest neighbors classification is different than other classification models in that it does not attempt to make a model but only stores instances of the training data. The classification will build a tree which at each node is a rule based on the dimensions of its k nearest neighbors and leading to each leaf representing a class. The advantages of K nearest neighbors are that it is simple and converges to the correct decision surface as data goes to infinity. It can be used with multiclass data sets as well as more complex algorithms such as the KD tree. In our dataset we used the KD tree algorithm in order to speed up the KNN classification by indexing the tree. Why we did PCA with KNN KNN computes the distance between each neighbor's dimensions. We have so many dimensions in our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long time. We wanted to continue to increase the number of neighbors until the accuracy increase plateaued but without reducing the dimensions using PCA, it would have taken too long. Random forest Random forest is an ensemble classification algorithm which in nature is a huge advantage because because predicting off of one decision tree vs an ensemble of them the ensemble will usually win. Another advantage is that the forest can often correct a tree's overfitting of the training set. Gaussian Naive Bayes Naive Bayes largest advantage is that it is extremely simple and it is just counting up probabilities. When training sets are small, Naive Bayes is good because of its high bias and low variance which will not overfit the training data. However, as datasets grow larger such as our dataset, the high bias will prevent the model from being powerful enough to have a high accuracy. Gaussian Naive Bayes is a normally distributed NB classifier. Its advantages are that it is fast and can make probabilistic decisions Model Comparisons As stated above, we will be comparing our models based on the F1 and accuracy values. Starting with the F1 values, GaussianNB is statistically significantly lower than the random forest and the KD tree. GaussianNB has an F1 of 0.46 (+/- 0.00) while the other models had an F1 of 0.49 (+/- 0.01). There is no significant difference in the F1 of between the KD tree and the random forest. Next, we compare the accuracy between the KD tree and the random forest, each were run on the same number of instances, 58380, and the final accuracy values were 0.542 and 0.559 respectively. The variances would be (0.542)(.458) / 58380 = 0.00000425 and (0.559)(0.441) / 58380 = 0.00000422. Meaning that the final accuracy for each model is: KD Tree accuracy: 0.54224049332 +/- 0.00000425 = [0.5422,0.5422] Random forest accuracy: 0.559643713601 +/- 0.00000422 = [0.5596,0.5596] So our final winner is the Random Forest which is significantly better in F1 as the Gaussian Naive Bayes and significantly better in terms of accuracy to the KD tree. ⏫ Back to Top
  • 33. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 33/43 Advantages of Regression model: K Nearest Neighbors The advantages of K nearest neighbors are that it is non-parametric and can address missing and unusual data for regression prediction. Dimensionality reduction can be used to speed up the prediction modeling process because the model could be trained with nearest neighbors and leaf size from the results of PCA. Why we did PCA with KNN KNN computes the distance between each neighbor's dimensions. We have so many dimensions in our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long time. We wanted to continue to increase the number of neighbors until the accuracy increase plateaued but without reducing the dimensions using PCA, it would have taken too long. Random forest The advantages of Random forest are that by averaging multiple trees, it reduces overfitting, reduces variance from outliers, and is therefore more accurate. It is unbiased in the estimate of the generalization error for the forest building progress and provides effective methods for estimating missing data. Random forest can extended to unlabeled data, leading to unsupervised clustering. Gaussia Regression The advantage of Gaussian Regression are that it is fast and uses less CPU and runtime. However, it is more used towards data that have normal distributions. It provides the full probabilistic prediction and interpolates the observations for faster prediction. Model Comparisons As stated above, we will be comparing our models based on the MSE and R^2 values. Random Forest Regression with a max_depth of 26 and n_estimators of 5 yeilded the lowest MSE of 2684614 (+/- 514568) and the highest R^2 of 0.93 (+/- 0.02). So our final winner is the Random Forest which is significantly better in MSE and R^2 than both K Nearest Neighbors and Gaussian Regression. ⏫ Back to Top Important Attributes 10 points
  • 34. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 34/43 Description: Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task. ⏫ Back to Top Feature importance for classification dataset according to Random Forest The top feature was tax amount slightly above 0.08. We think this is the most important feature to classifying county because in the state of California, where each of the counties are located the tax rates are set at the county and city level. The next 3 important features also relate to taxes and each have an importance slightly below 0.08. The next 3 important features are all related to square footage and year built which we think goes back to builders and the demographic of the area. Each county could have either one specific builder for all of their neighborhoods or the builders matched the styles of the homes around them. The number of bedrooms and bathrooms is probably significant because each county could have their own demographic of family sizes. If it is close to a larger city, we may see more singles or couples with fewer numbers of bedrooms and baths and counties farther into suburbia may have more kids thus more bedrooms and bathrooms. ⏫ Back to Top
  • 35. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 35/43 In [15]: Feature importance for regression dataset according to Random Forest The top feature for taxamount was tax value dollar count with an importance of just below 0.9. The next three important features were longitude, latitude, and calculated finished square feet at significantly lower importance levels (less than 0.1). The tax value is set at the time of the assessment and tax value dollar count is calculated from the actual taxes and the assessed taxes. So the tax value to the dollar is important to the tax amount because we would assume these should be fairly similar. The longitude and latitude could be of higher importance because in California tax rates are set at county and city levels so this could vary based on location. The calculated square feet is also important because a higher square footage could mean a larger home or mansion which is more likely to have higher tax value than a smaller home. ⏫ Back to Top Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x104148ac8> X = dataset_class['X'] y = dataset_class['y'] clf = RandomForestClassifier(random_state=seed, max_depth=250) clf.fit(X, y) importances = clf.feature_importances_ importances = pd.Series(importances, index=X.columns) importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')
  • 36. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 36/43 In [16]: Deployment 5 points Description: How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.? ⏫ Back to Top Zillow began offering publicly available real estate data from disparate sources into a single platform, the gap between sellers’ prices and buyers’ offer prices has significantly decreased. The Zillow dataset was provided for the purpose of evaluating Zestimate’s accuracy based upon the variable logerror which is the difference of log(Zestimate) - log(SalePrice). For purposes of this lab assignment, we developed regression models with taxamount as the response. In our Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x10a707278> X = dataset_reg['X'] y = dataset_reg['y'] clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed) clf.fit(X, y) importances = clf.feature_importances_ importances = pd.Series(importances, index=X.columns) importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')
  • 37. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 37/43 classification model, we determine important features for the regionidcounty. For companies in the real estate space, classification models based on physical attributes, which provide valuable insight for buying, selling, and investment decisions. Our classification model can be adapted to more granular levels as cities and municipalities. Buyers, sellers, and investors alike can gain insights into which features have the highest importance to specific locations. This may drive investment decisions knowing how important certain attributes are for targeted locations. Knowing which features are highly important in certain locations can drive remodeling decisions to make properties more attractive to potential buyers. The value-add of this model for these companies can be measured in terms of returns on investment. Deployment of the model can be valuable for the rental market as well, where Airbnb can direct marketing efforts to areas with specific property attributes. Deployment of the model can also be used to provide the break-even horizon for making rent versus own decisions. In addition, loan refinancing companies can utilize this model along with Zillow’s liens and taxes database to target homeowners in specific areas. To further improve the effectiveness of the model, we should expand the model to include sales prices, liens, taxes, as well as identify biased data such as short sales, foreclosures, and “arms- length” transactions (i.e. sales to relatives). All these are readily available from Zillow, as they collect an enormous amount of data which are updated with high frequency. For our models to be relevant in this space, they should be updated daily just as Zillow does with their 7 to 11 million models. Exceptional Work 10 points Description: You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm? ⏫ Back to Top Approaches Considered for Balanced Classification ⏫ Back to Top One of the shortcomings of classification, K-nearest neighbors in particular, is the tendency to bias in favor of the majority class. To eliminate the bias, a number of approaches can be utilized including StratifiedKFold, which is the approach we use for our classification models. To be thorough, we explored a set of “imbalanced learn” algorithms, imblearn.RandomOverSampler, imblearn.SMOTE, and imblearn.ADASYN. 1. StratifiedKFold - A variant of KFold, this ensures each class is represented equally (i.e. equal weights) as the algorithm performs each fold. Stratification is performed on the
  • 38. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 38/43 training dataset "on the fly" as opposed to performing it as part of data preprocessing. 2. imblearn.RandomOverSampler - As a separate package, imblearn was developed to address the problem of imbalanced data sets; it is performed at data preprocessing. RandomOverSampler, in particular, performs a naive over sampling with replacement, duplicating original samples from the minority class. (Under sampling is the alternate approach). 3. imblearn.SMOTE - SMOTE compensates for classes that are difficult to separate by performing over and under sampling using Tomek’s link or edited nearest neighbors cleaning methods. 4. imblearn.ADASYN - Adaptive Synthetic Sampling Approach (ADASYN) is similar to SMOTE as it generates samples by interpolation but it focuses on the wrongly classified k- nearest neighbors. After considering these methods, we settled on the StratifiedKFold for simplicity since accuracies across the different approaches were practically equivalent. Below is an illustration of imblearn's RandomOverSampler algorithm in action.
  • 39. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 39/43 In [9]: Feature ranking with recursive feature elimination. Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. In this example we will first select top 20 features and will train a Random Forest with only that features. The performance of the model is printed below. ⏫ Back to Top Out[9]: <matplotlib.image.AxesImage at 0x11adee128> plt.figure(figsize=(12,16)) plt.imshow(imread('../../input/imblearn.png')) # just in case you dont see the ima
  • 40. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 40/43 In [36]: ----------------- KD Tree Classifier Evaluation ----------------- F1 Score: 0.49 (+/- 0.00) Accuracy 0.559164097294 Precision 0.469948908187 Recall 0.559164097294 X = dataset_class['X'].iloc[:2000] y = dataset_class['y'].iloc[:2000] estimator = RandomForestClassifier(max_depth=10, random_state=seed, n_estimators=1 selector = RFE(estimator, n_features_to_select=20, step=1) selector = selector.fit(X, y) X = dataset_class['X'] y = dataset_class['y'] X = X[X.columns[selector.support_]] scores = [] yhat = np.zeros(y.shape) yhat_score = np.zeros((len(y), 3)) cv = StratifiedKFold(n_splits=n_splits, random_state=seed) for train_index, test_index in cv.split(X, y): clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40 X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf.fit(X_train, y_train) yhat[test_index] = clf.predict(X_test) yhat_score[test_index] = clf.predict_proba(X_test) f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted') scores.append(f1_score) print_accuracy('KD Tree Classifier', y, yhat, scores) plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier") confusion_matrix(y, yhat, clf.classes_) roc_curve(y, yhat, clf)
  • 41. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 41/43 Two dimentional Linear Discriminant Analysis The idea is to see if there are separatable clusters by class. The colors green, blue, and red separates the 3 counties by LDA to see if there are any unique clusters or definite patterns that form on a 2D plane. ⏫ Back to Top
  • 42. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 42/43 In [37]: References: Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels (https://www.kaggle.com/c/zillow-prize-1/kernels) Scikitlearn logistic regression: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html (http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) Scikitlearn linear SVC: http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas (https://stackoverflow.com/questions/tagged/pandas) Deployment Reference: http://www.zdnet.com/article/zillow-machine-learning-and-data-in- real-estate/ (http://www.zdnet.com/article/zillow-machine-learning-and-data-in-real-estate/) Advantages of GaussianProcessRegression http://scikit- learn.org/stable/modules/gaussian_process.html (http://scikit- learn.org/stable/modules/gaussian_process.html) Advantages of GaussianProcessRegression https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian-process- models (https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian- process-models) X = dataset_class['X'] y = dataset_class['y'] lde = LDA(n_components=2) X_lde = lde.fit(X, y).transform(X) colors = y.astype(str) colors[colors=='3101'] = 'g' colors[colors=='2061'] = 'b' colors[colors=='1286'] = 'r' plt.scatter(X_lde[:, 1], X_lde[:, 0], s=2, c=colors);
  • 43. 1/12/2018 final-all http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 43/43 Advantages of GaussianProcessRegression https://www.quora.com/What-are-some- advantages-of-using-Gaussian-Process-Models-vs-SVMs (https://www.quora.com/What- are-some-advantages-of-using-Gaussian-Process-Models-vs-SVMs) Advantages of RandomForestRegression https://www.quora.com/What-are-some- advantages-of-using-a-random-forest-over-a-decision-tree-given-that-a-decision-tree-is- simpler (https://www.quora.com/What-are-some-advantages-of-using-a-random-forest- over-a-decision-tree-given-that-a-decision-tree-is-simpler) Advantages of RandomForestRegression https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm (https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm) Advantages of KNeighborsRegression https://stats.stackexchange.com/questions/104255/why-would-anyone-use-knn-for- regression (https://stats.stackexchange.com/questions/104255/why-would-anyone-use- knn-for-regression) Advantages of KNeighborsRegression https://machinelearningmastery.com/k-nearest- neighbors-for-machine-learning/ (https://machinelearningmastery.com/k-nearest- neighbors-for-machine-learning/) Advantages of KNeighborsRegression https://kevinzakka.github.io/2016/07/13/k-nearest- neighbor/#pros-and-cons-of-knn (https://kevinzakka.github.io/2016/07/13/k-nearest- neighbor/#pros-and-cons-of-knn) Imbalanced Learn http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html (http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html) ⏫ Back to Top