Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance

1/12/2018 final-all
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab2/final-all.ipynb 1/43
Lab 2: Zillow Dataset Classification and
Regression Prediction Models
MSDS 7331 Data Mining - Section 403 - Lab 2
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Contents
Imports
Define and Prepare Class Variables
Classification Variables
Regression Variables
Describe the Final Dataset
Classification Dataset
Regression Dataset
Explain Evaluation Metrics
Classification Metrics
Regression Metrics
Training and Testing Splits
For Classification
For Regression
Three Different Classification/Regression Models
Classification Models
K Nearest Neighbors
Random Forest
Naive Bayes
Regression Models
K Nearest Neighbors
Random Forest
Gaussian Regression
Visualizations of Results and Analysis
Analysis of Classification Models
Analysis of K Nearest Neighbors
Analysis of Random Forest
Analysis of Naive Bayes
Regression Models
Analysis of K Nearest Neighbors
Analysis of Random Forest
Analysis of Gaussian Regression
Advantages of Each Model
Regression Models
Important Attributes
Regression Models

1/12/2018 final-all
Deployment
Exceptional Work
Approaches Considered for Balanced Classification
Feature Elimination
Two dimensional Linear Discriminant Analysis
References
Imports & Custom Functions
We chose to use the same Zillow dataset from Lab 1 for this exploration in regression and
classification. For the origin and purpose of dataset as well as a detailed description of the dataset,
refer to https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb
(https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb).
The function output_variables_table shows if the variable is nominal or ordinal for further use
on classification or regression. The functions per_class_accuracy and confusion_matrix show
the confusion table for correctly and incorrectly identified classification prediction results. The
function plot_class_acc shows the visual accuracies of classification. The function
plot_feature_importance shows the feature importance of classification values. The function
print_accuracy shows the accuracy scores of the classification models. The function
get_dataset_subset obtains a subset of the full dataset for modeling and prediction.
We will be using a seed of 0. Due to our dataset being extremely large, we are using 5 folds for the
CPU usage and runtime to be more manageable to run through the prediction models for both
classification and regression.

1/12/2018 final-all
In [1]: %matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
from sklearn import metrics as mt
# classification imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA, SparsePCA
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score
# regression imports
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from scipy.ndimage import imread
import warnings
warnings.filterwarnings("ignore")
def output_variables_table(variables):
variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th>
for vname, atts in variables.iterrows():
if vname not in dataset.columns:
continue
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = dataset[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)

1/12/2018 final-all
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d)' % (dataset[vname].min(), dataset[vname
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row
return HTML('<table>%s</table>' % ''.join(rows))
# Define an accuracy plot
def per_class_accuracy(ytrue, yhat):
conf = mt.confusion_matrix(ytrue,yhat)
norm_conf = conf.astype('float') / conf.sum(axis=1)[:, np.newaxis]
return np.diag(norm_conf)
def plot_class_acc(ytrue, yhat, classes, title=''):
acc_list = per_class_accuracy(y, yhat)
pd.DataFrame(acc_list, index=pd.Index(classes, name='Classes')).plot(kind='ba
plt.xlabel('Class value (one per face)')
plt.ylabel('Accuracy within class')
plt.title(title+", Total Acc=%.1f"%(100*mt.accuracy_score(ytrue,yhat)))
plt.grid()
plt.ylim([0,1])
plt.show()
# Plot the feature importances of the forest
def plot_feature_importance(ytrue, yhat, rt, title=''):
importances = rt.feature_importances_
std = np.std([tree.feature_importances_ for tree in rt.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
def print_accuracy(model_name, y_test, yhat, scores):
scores = np.array(scores)
print('----------------- %s Evaluation -----------------' % model_name)
print(" F1 Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(' Accuracy', mt.accuracy_score(y_test, yhat))
print(' Precision', mt.precision_score(y_test, yhat, average='weighted'))
print(' Recall', mt.recall_score(y_test, yhat, average='weighted'))
def confusion_matrix(ytrue, yhat, classes):
index = pd.MultiIndex.from_product([['True Class'], classes])
columns = pd.MultiIndex.from_product([['Predicted Class'], classes])
return pd.DataFrame(mt.confusion_matrix(y, yhat), index=index, columns=column
def roc_curve(ytrue, yhat, clf):
for i, label in enumerate(clf.classes_):

1/12/2018 final-all
Define and Prepare Class Variables
10 points
Description:
Define and prepare your class variables. Use proper variable representations (int, float, one-hot,
etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove
variables that are not needed/useful for the analysis.
⏫ Back to Top
Classification Datasets:
The classification dataset removes logerror and transactiondate because they were for the
purposes of the Kaggle competition and were not complete for the training set. The column that was
created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake
of simplicity of only using original data for the prediction process. The table generated shows the
type of data used for classification purposes.
The dataset has 58380 rows and 1757 columns. All variables and details about the variables are
printed on the table below.
/usr/local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: Deprecat
ionWarning: This module was deprecated in version 0.18 in favor of the model_se
lection module into which all the refactored classes and functions are moved. A
lso note that the interface of the new CV iterators are different from that of
this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
fpr, tpr, _ = mt.roc_curve(y, yhat_score[:, i], pos_label=label)
roc_auc = mt.auc(fpr, tpr)
plt.plot(fpr, tpr, label='class {0} with {1} instances (area = {2:0.2f})'
''.format(label, sum(y==label), roc_auc))
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
def get_dataset_subset(dataset, n=1000):
return {
'X': dataset['X'].iloc[:n],
'y': dataset['y'].iloc[:n]
}
seed = 0
n_splits = 5

1/12/2018 final-all
The target class "regionidcounty" has three possible values: 1286, 2061 or 3101, representing three
different county codes. The distribution is skewed with code 1286 having 17749 observations, 3101
has 35563, and 2061 only 5068 observations.
⏫ Back to Top

1/12/2018 final-all
In [5]:
Dataset shape: (58380, 1757)
regionidcounty
1286 17749
2061 5068
3101 35563
Name: regionidcounty, dtype: int64
variables = pd.read_csv('../../datasets/variables.csv').set_index('name')
dataset = pd.read_csv('../../datasets/train.csv', low_memory=False)
# remove unneeded variables
del dataset['Unnamed: 0']
del dataset['logerror']
del dataset['transactiondate']
del dataset['city']
del dataset['price_per_sqft']
# delete all location information because we want to predict the couty
# and those feature will give it up to easy
y = dataset['regionidcounty'].copy()
del dataset['regionidcounty']
del dataset['regionidcity']
del dataset['regionidzip']
del dataset['regionidneighborhood']
del dataset['rawcensustractandblock']
del dataset['latitude']
del dataset['longitude']
output_variables = output_variables_table(variables)
nominal = variables[variables['type'].isin(['nominal'])]
nominal = nominal[nominal.index.isin(dataset.columns)]
continuous = variables[~variables['type'].isin(['nominal'])]
continuous = continuous[continuous.index.isin(dataset.columns)]
nominal_data = dataset[nominal.index]
nominal_data = pd.get_dummies(nominal_data, drop_first=True)
nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina
continuous_data = dataset[continuous.index]
dataset = pd.concat([continuous_data, nominal_data], axis=1)
columns = dataset.columns
variables = variables[variables.index.isin(dataset.columns)]
# shuffle the dataset (just in case)
X = dataset.sample(frac=1, random_state=seed)
dataset_class = {
'X': X,
'y': y
}
print('Dataset shape:', X.shape)
print(y.groupby(y).size())
output_variables

1/12/2018 final-all
Out[5]:
Variable Type Scale Description
airconditioningtypeid nominal [0, 1, 13, 5, 11, 3, 9]
Type of cooling system present in the
any)
assessmentyear interval (2015, 2015) The year of the property tax assessm
bathroomcnt ordinal
[1.0, 3.5, 2.5, 3.0,
2.0, ... (22 More)]
Number of bathrooms in home includi
fractional bathrooms
bedroomcnt ordinal
[1, 5, 4, 3, 2, ... (16
More)]
Number of bedrooms in home
buildingqualitytypeid ordinal [7, 4, 1, 10, 12, 8]
Overall assessment of condition of the
from best (lowest) to worst (highest)
calculatedbathnbr ordinal
[1.0, 3.5, 2.5, 3.0,
2.0, ... (22 More)]
Number of bathrooms in home includi
fractional bathroom
calculatedfinishedsquarefeet ratio (0, 10925)
Calculated total finished living area of
home
censustractandblock nominal
[60372040024100.0,
60590991081500.0,
60374078455800.0,
61110052978700.0,
60379010957300.0,
... (445 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
finishedsquarefeet12 ratio (0, 6615) Finished living area
finishedsquarefeet50 ratio (0, 8352)
Size of the finished living area on the
(entry) floor of the home
fips nominal [6037, 6059, 6111]
Federal Information Processing Stand
- see
https://en.wikipedia.org/wiki/FIPS_cou
for more details
fireplacecnt ordinal [0, 1, 2, 3, 5, 4] Number of fireplaces in a home (if any
fullbathcnt ordinal
[1.0, 3.0, 2.0, 6.0,
4.0, ... (17 More)]
Number of full bathrooms (sink, show
bathtub, and toilet) present in home
garagecarcnt ordinal
[0.0, 2.0, 1.0, 4.0,
3.0, ... (14 More)]
Total number of garages on the lot inc
attached garage
garagetotalsqft ratio (0, 1610)
Total number of square feet of all gara
lot including an attached garage
hashottuborspa ordinal [0, 1] Does the home have a hot tub or spa
heatingorsystemtypeid nominal
[7, 0, 2, 6, 24, ... (12
More)]
Type of home heating system
landtaxvaluedollarcnt ratio (22, 2477536)
The assessed value of the land area o
parcel

1/12/2018 final-all
location_type nominal
[PRIMARY, nan,
NOT ACCEPTABLE,
ACCEPTABLE]
Primary, Acceptable, Not Acceptable
lotsizesquarefeet ratio (0, 1710750) Area of the lot in square feet
numberofstories ordinal [1, 2, 3, 4] Number of stories or levels the home
parcelid nominal
[11800329,
14058566,
14636635,
17138404,
11270723, ... (49678
More)]
Unique identifier for parcels (lots)
poolcnt ordinal [0.0, 1.0] Number of pools on the lot (if any)
poolsizesum ratio (0, 1476) Total square footage of all pools on pr
pooltypeid10 nominal [0, 1] Spa or Hot Tub
pooltypeid2 nominal [0, 1] Pool with Spa/Hot Tub
pooltypeid7 nominal [0, 1] Pool without hot tub
propertycountylandusecode nominal
[0100, 122, 1, 1111,
010C, ... (71 More)]
County land use code i.e. it's zoning a
county level
propertylandusetypeid nominal
[261, 266, 246, 265,
269, ... (13 More)]
Type of land use the property is zoned
propertyzoningdesc nominal
[LAR2, 0,
LRRA7000*, TOPR-
MD, LCA11*, ...
(1655 More)]
Description of the allowed land uses (
for that property
roomcnt ordinal
[0, 9, 8, 4, 7, ... (16
More)]
Total number of rooms in the principal
residence
structuretaxvaluedollarcnt ratio (100, 2181198)
The assessed value of the built struct
the parcel
taxamount ratio (49, 51292)
The total property tax assessed for th
assessment year
taxdelinquencyflag nominal [0, 1]
Property taxes for this parcel are past
of 2015
taxdelinquencyyear interval (0, 26) Year
taxvaluedollarcnt ratio (22, 4052186) The total tax assessed value of the pa
threequarterbathnbr ordinal [0, 1, 2, 3, 4]
Number of 3/4 bathrooms in house (s
sink + toilet)
unitcnt ordinal [1, 2, 3, 4, 9, 6]
Number of units the structure is built i
= duplex, 3 = triplex, etc...)
yardbuildingsqft17 interval (0, 1485) Patio in yard

1/12/2018 final-all
Regression Datasets:
The regression dataset removes logerror and transactiondate because they were for the
purposes of the Kaggle competition and were not complete for the training set. The column that was
created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake
of simplicity of only using original data for the prediction process. We are only using nominal and
continuous data types for regression purposes.
The dataset has 58380 rows and 1758 columns. All variables and details about the variables are
printed on the table below.
⏫ Back to Top
yardbuildingsqft26 interval (0, 1366) Storage shed/building in yard
yearbuilt interval (1885, 2015) The Year the principal residence was
zipcode_type nominal
[STANDARD, nan,
PO BOX, MILITARY,
UNIQUE]
Standard, PO BOX Only, Unique,
Military(implies APO or FPO)

1/12/2018 final-all
In [13]:
Dataset shape: (58380, 1758)
dataset = pd.read_csv('../../datasets/train.csv', low_memory=False)
variables = pd.read_csv('../../datasets/variables.csv').set_index('name')
# remove unneeded variables
del dataset['logerror']
del dataset['transactiondate']
del dataset['city']
del dataset['price_per_sqft']
output_variables = output_variables_table(variables)
nominal = variables[variables['type'].isin(['nominal'])]
nominal = nominal[nominal.index.isin(dataset.columns)]
continuous = variables[~variables['type'].isin(['nominal'])]
continuous = continuous[continuous.index.isin(dataset.columns)]
nominal_data = dataset[nominal.index]
nominal_data = pd.get_dummies(nominal_data, drop_first=True)
nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina
continuous_data = dataset[continuous.index]
dataset = pd.concat([continuous_data, nominal_data], axis=1)
columns = dataset.columns
variables = variables[variables.index.isin(dataset.columns)]
# shuffle the dataset (just in case)
X = dataset.sample(frac=1, random_state=seed)
y = X['taxamount'].copy()
del X['taxamount']
dataset_reg = {
'X': X,
'y': y
}
print('Dataset shape:', X.shape)
plt.title('Distribution of the target variable: taxamount')
y.plot(kind='box')
output_variables

1/12/2018 final-all
Describe the Final Dataset
5 points
Description:
Describe the final dataset that is used for classification/regression (include a description of any
newly formed variables you created).
⏫ Back to Top
Classification Datasets:
⏫ Back to Top
Since we are using the same Zillow dataset that we used in the previous lab, most of the data was
already cleaned up. However, the purpose of our classification dataset is to predict the county each
property is located in. Therefore our final model removed all columns relating to location such as
latitude, longitude, city, and zipcode. We also removed variables we did not need such as
logerror, transactiondate, and price_per_sqft.
We did not create any new columns for the classification dataset but we did transform the
categorical variables into indicator variables.
The final shape of our classification dataset is 58380 instances and 1757 columns. The three
counties we are trying to predict have sizes of about 18k, 5k, and 36k so an accuracy below 0.61
will mean that we are better off classifying each with the latter county.
Regression Datasets:
The regression dataset removes logerror and transactiondate because they were for the purposes
of the Kaggle competition and were not complete for the training set. The column that was created
for "New Features" from Lab 1 (city and price_per_sqft) were also removed for the sake of
simplicity of only using original data for the prediction process.
We are only using nominal and continuous data types for regression purposes. The final shape of
our classification dataset is 58380 instances and 1758 columns. The varaiable that we are
predicting, taxamount, is right skewed, with outlier property costing more than the standard
deviation.
⏫ Back to Top
Out[13]: Variable Type Scale Description
Type of cooling system present in th

1/12/2018 final-all
Explain Evaluation Metrics
10 points
Description:
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-
measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the
results of your modeling? Give a detailed explanation backing up any assertions.
⏫ Back to Top
Classification Metrics:
⏫ Back to Top
Because of our class distribution is very skewed, we will be optimizing the models base on F1
score.
For our evaluation, we will be taking into account the accuracy and F-measure. In order to compute
the F-measure, we will need the precision and recall. Because F-measure is a weighted average of
these, we think a better F-measure score means the model has a better precision and recall.
Accuracy is the ratio of correct predictions to the total number of observations. It is calculated as:
(TP+TN) / (TP+FP+FN+TN). The closer accuracy is to 1, the more accurate the model is, with one
caveat. For high accuracy to be a reliable indicator, the dataset has to be symmetric, i.e. total false
positives are about equal to false negatives. Otherwise, we need to review other parameters as
well.
Precision is the ratio of correctly predicted positive observations to the total positive observations. It
is calculated as: TP / (TP+FP).
Recall is the ratio of correctly predicted positive observations to all actual positives. It is calculated
as TP / (TP+FN). The consequences of type 2 errors, predicting a false negative, are not extreme
so we think recall is an appropriate measure of completeness.
Finally, we will also use F-measure which is essentially a weighted average of the precision and
recall into one simple statistic. F-measure will be a number between 0 and 1 where closer to 1 is
better and approaching 0 is worse. It overcomes the limitations of accuracy whenever false
positives and false negatives are not about equal or symmetric.
Regression Metrics:
For our evaluation of regression prediction models, we are looking at mean squared error (MSE)
and R^2. With the large data size and right skew of taxamount, we are trying to minimize MSE and
have a R^2 value close to 1. Whichever model with optimal parameters that could reduce MSE and
increase R^2 while using less CPU and less runtime would be the best regression model to use for
the prediction of the dataset.

1/12/2018 final-all
⏫ Back to Top
Training and Testing Splits
10 points
Description:
Choose the method you will use for dividing your data into training and testing splits (i.e., are you
using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or
use more than one method as appropriate. For example, if you are using time series data then you
should be using continuous training and testing sets across time.
⏫ Back to Top
Due to our dataset being extremely large, we are using 5 folds for the CPU usage and runtime to be
more managable to run through the prediction models for both classification and regression. Our
data is not a time series so we did not need to train and test over time with a moving time window.
Classification Splits:
⏫ Back to Top
For the classification task we choose to use Stratified K-Fold cross validation with 5 folds. We chose
stratified in order to preserve the percentage of samples in each class. We also had a very large
dataset so splitting into more than 5 folds would have been computationally expensive with not a
large enough return on value. We felt that splitting the data into 5 folds would be enough splits to
reduce the weight of any outliers or noise.
Regression Splits:
For the regression task we choose to use K-Fold cross validation with 5 folds. We chose K-Fold in
order to preserve the percentage of samples in each class. We also had a very large dataset so
splitting into more than 5 folds would have been computationally expensive with not a large enough
return on value. We felt that splitting the data into 5 folds would be enough splits to reduce the
weight of any outliers or noise.
⏫ Back to Top
Three Different Classification/Regression
Models
20 points

1/12/2018 final-all
Description:
Create three different classification/regression models for each task (e.g., random forest, KNN, and
SVM for task one and the same or different algorithms for task two). Two modeling techniques must
be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to
increase generalization performance using your chosen metric. You must investigate different
parameters of the algorithms!
⏫ Back to Top
Classification Models:
⏫ Back to Top
Definition and optimization of K Nearest Neighbors (KD Tree)
K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a
certain n_neighbors count in order to form classification models to predict the y_hat for the test set.
Optimization result for different values of "n_neighbors" is printed below.
When n_neighbors is 12, the f1_score is highest at 0.492.
⏫ Back to Top

1/12/2018 final-all
In [5]:
Definition and optimization of Random Forest
Random forest is for the prediction of values based on training decision trees by by a certain max
depth in order to form classification models to predict the y_hat for the test set. Optimization result
for different test values of "max_depth" is printed below.
When max_depth is 351, the f1_score starts to plateau at 0.49062.
⏫ Back to Top
n_neighbors: 2 , f1_score: 0.470579117819
X = dataset_class['X']
y = dataset_class['y']
result = []
scores = []
for n_neighbors in range(2, 30)[::5]:
yhat = np.zeros(y.shape) # we will fill this with predictions
cv = StratifiedKFold(n_splits=n_splits, random_state=seed)
for train_index, test_index in cv.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# in order to reduce the time for training KNeighborsClassifier
# we reduce the dimetions of the data from 1717 to 100 and we use kd_tree
pca = PCA(n_components=100, random_state=seed)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
clf = KNeighborsClassifier(n_neighbors=n_neighbors, algorithm='kd_tree', w
clf.fit(X_train, y_train)
yhat[test_index] = clf.predict(X_test)
f1_score = mt.f1_score(y, yhat, average='weighted')
print ('n_neighbors:', n_neighbors, ', f1_score:', f1_score)

1/12/2018 final-all
In [6]:
Definition and optimization of Naive Bayes (Gaussian)
The Naive Bayes doesn't have any parameters to optimize and uses maximum likelihood training to
classify and predict for the test set. We will just show a F1 score with confidence interval with an
interval and inspect the result of this model in more details in the next section.
max_depth: 1 F1 score: 0.46120951624
max_depth: 51 F1 score: 0.465099779665
max_depth: 101 F1 score: 0.481927336387
max_depth: 151 F1 score: 0.489103832227
max_depth: 201 F1 score: 0.490156681956
max_depth: 251 F1 score: 0.49059336761
max_depth: 301 F1 score: 0.490213068185
max_depth: 351 F1 score: 0.490623495801
result = []
index = []
for max_depth in range(1, 401)[::50]:
yhat = np.zeros(y.shape, dtype=int) # we will fill this with predictions
clf = RandomForestClassifier(max_depth=max_depth, random_state=seed, n_est
f1_score = mt.f1_score(y, yhat, average='weighted')
print ('max_depth:', max_depth, 'F1 score:', f1_score)
result.append(f1_score)
index.append(max_depth)
plt.title('F1 score for different max_depth')
pd.Series(result, index=pd.Index(index, name='max_depth'), name='f1_score').plot(

1/12/2018 final-all
Naive Bayes has no optimal parameters to adjust and has a F1 score of 0.46.
⏫ Back to Top
In [8]:
Definition and optimization of Regression models:
⏫ Back to Top
Definition and optimization of K Nearest Neighbors
K Nearest Neighbors is for the prediction of values based on training their nearest neighbors by a
certain n_neighbors in order to form regression models to predict the y_hat for the test set. In order
to fit the model in a reasonable amount of time, we shrank the dimension of the dataset to a 100
features with a PCA. The result of the optimization is printed below. We choose the hyper-
parameters with the highest R2 score to be the optimal parameters.
When n_neighbors is 11, MSE is at it lowest at 3159735 and R^2 peaks at 0.913.
⏫ Back to Top
F1 score: 0.46 (+/- 0.00)
scores = []
clf = GaussianNB()
f1_score = mt.f1_score(y_test, clf.predict(X_test), average='weighted')
scores.append(f1_score)
scores = np.array(scores)
print("F1 score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

1/12/2018 final-all
In [17]:
Definition and optimization of Random Forest
Random forest is for the prediction of values based on training decision trees by by a certain max
depth in order to form regression models to predict the y_hat for the test set. The result of the
optimization is printed below. We choose the hyper-parameters with the highest R2 score to be the
optimal parameters.
When max_depth is 26, MSE is at it lowest at 2684615 and R^2 peaks at 0.9257.
⏫ Back to Top
n_neighbors: 1, MSE: 4097283, R^2: 0.887
n_neighbors: 6, MSE: 3169526, R^2: 0.912
n_neighbors: 11, MSE: 3159735, R^2: 0.913
n_neighbors: 16, MSE: 3223971, R^2: 0.911
n_neighbors: 21, MSE: 3236105, R^2: 0.910
X = dataset_reg['X']
y = dataset_reg['y']
yhat = np.zeros(y.shape)
cv = KFold(n_splits=n_splits, random_state=seed)
for n_neighbors in range(1, 22)[::5]:
clf = KNeighborsRegressor(n_neighbors=n_neighbors)
pca.fit(X_train)
print("n_neighbors: %.f, MSE: %.f, R^2: %0.3f" % (n_neighbors, mean_squared_e

1/12/2018 final-all
In [21]:
Definition and optimization of Gaussian Regression
Gaussian Regression is for the prediction of values based on normally distributed variables where
alpha could be optimized in order to form regression models to predict the y_hat for the test set.
The result of the optimization is printed below. We choose the hyper-parameters with the lowest
MSE to be the optimal parameters.
When alpha is 1e-15, MSE is at it lowest at 35488348 and R^2 peaks at 0.0177.
⏫ Back to Top
max_depth: 1, MSE: 15160579, R^2: 0.5804
max_depth: 6, MSE: 3376013, R^2: 0.9066
max_depth: 11, MSE: 2848552, R^2: 0.9212
max_depth: 16, MSE: 2715775, R^2: 0.9248
max_depth: 21, MSE: 2712884, R^2: 0.9249
max_depth: 26, MSE: 2684615, R^2: 0.9257
max_depth: 31, MSE: 2708345, R^2: 0.9250
max_depth: 36, MSE: 2689061, R^2: 0.9256
max_depth: 41, MSE: 2708431, R^2: 0.9250
for max_depth in range(1, 42)[::5]:
clf = RandomForestRegressor(max_depth=max_depth, n_estimators=5, random_st
print("max_depth: %.f, MSE: %.f, R^2: %0.4f" % (max_depth, mean_squared_error

1/12/2018 final-all
In [6]:
Visualizations of Results and Analysis
10 points
Description:
Analyze the results using your chosen method of evaluation. Use visualizations of the results to
bolster the analysis. Explain any visuals and analyze why they are interesting to someone that
might use this model.
⏫ Back to Top
Analysis of Classification model:
For our visualizations for each classification model, we display a bar graph of the count of each
class that was predicted. If the model is good, this should easily show a very small amount for the
center bar (~5%) and a majority in the last bar (61% of values).
The other visualization we display is an ROC curve which maps the false positive rate on the x axis
and true positive rate on the y axis. In an ROC plot, the accuracy is the area under the curve so we
can quickly determine which model has curves higher above the y = x line.
⏫ Back to Top
alpha: 0.000000, MSE: 35488348, R^2: 0.0177
alpha: 0.000250, MSE: 35488410, R^2: 0.0177
alpha: 0.000500, MSE: 35488472, R^2: 0.0177
alpha: 0.000750, MSE: 35488534, R^2: 0.0177
alpha: 0.001000, MSE: 35488596, R^2: 0.0177
for alpha in np.linspace(1e-15, 0.001, 5):
# have to train work on a subset of the training data because it otherwise
X_train = X_train.iloc[:2000]
y_train = y_train.iloc[:2000]
clf = GaussianProcessRegressor(normalize_y=True, alpha=alpha, random_state
print("alpha: %f, MSE: %.f, R^2: %0.4f" % (alpha, mean_squared_error(y, yhat)

1/12/2018 final-all
Results and Analysis of a Dummy model
This model is only predicting the most frequent class. It is used for a base line to compare other
models to. The dummy model of 3101 is better at predicting classifiers than some of the
classification methods.
In [43]:
Results and Analysis of K Nearest Neighbors (KD Tree)
All metrics and analysis of the optimized K Nearest Neighbors (KD Tree) are printed below.
For K Neighbors Classifier when n_neighbors is 12, the F1 Score is 0.49 (+/- 0.01), Accuracy is
0.5422, Precision is 0.4709, and Recall is 0.5422.
⏫ Back to Top
----------------- Dummy Evaluation -----------------
F1 Score: 0.46 (+/- 0.00)
Accuracy 0.609164097294
Precision 0.371080897432
Recall 0.609164097294
f1_score = mt.f1_score(y, [3101] * len(y), average='weighted')
print_accuracy('Dummy', y, [3101] * len(y), [f1_score])

1/12/2018 final-all
In [12]:
----------------- KD Tree Classifier Evaluation -----------------
F1 Score: 0.49 (+/- 0.01)
Accuracy 0.54224049332
Precision 0.470945268342
Recall 0.54224049332
scores = []
yhat_score = np.zeros((len(y), 3))
clf = KNeighborsClassifier(n_neighbors=12, algorithm='kd_tree', weights='dista
pca.fit(X_train)
yhat_score[test_index] = clf.predict_proba(X_test)
print_accuracy('KD Tree Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier")
confusion_matrix(y, yhat, clf.classes_)
roc_curve(y, yhat, clf)

1/12/2018 final-all
Results and Analysis of Random Forest
All metrics and analysis of the optimized Random Forest Classifier are printed below.
For Random Forest Classifier when max_depth is 250 and n_estimators is 40, the F1 Score is 0.49
(+/- 0.00), Accuracy is 0.5596, Precision is 0.4708, and Recall is 0.5596.
⏫ Back to Top

1/12/2018 final-all
In [13]:
F1 Score: 0.49 (+/- 0.00)
Accuracy 0.559643713601
Precision 0.470878065706
Recall 0.559643713601
scores = []
clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40
print_accuracy('Random Forest Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="Random Forest Classifier")

1/12/2018 final-all
Results and Analysis Naive Bayes
All metrics and analysis of the Naive Bayes Classifier are printed below.
For Naive Bayes Classifier, the F1 Score is 0.46 (+/- 0.00), Accuracy is 0.6047, Precision is 0.4596,
and Recall is 0.6047.
⏫ Back to Top

1/12/2018 final-all
In [14]:
F1 Score: 0.46 (+/- 0.00)
Accuracy 0.604727646454
Precision 0.459602898787
Recall 0.604727646454
from sklearn.naive_bayes import GaussianNB
scores = []
clf = GaussianNB()
print_accuracy('GaussianNB Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="GaussianNB Classifier")

1/12/2018 final-all
Analysis of Regression models:
⏫ Back to Top
Results and Analysis K Nearest Neighbors
The evaluation metrics for the optimized model are printed below.
After PCA, K Nearest Neighbors Regression when n_neighbors=16 has a MSE of 3223970. (+/-
395957) and R^2 of 0.91 (+/- 0.01).
⏫ Back to Top

1/12/2018 final-all
In [16]:
Results and Analysis Random Forest
Random Forest Regression when max_depth is 26 and n_estimators is 5 has a MSE of 2684614
(+/- 514568) and R^2 of 0.93 (+/- 0.02).
⏫ Back to Top
Evaluation metrics:
MSE: 3223970.51 (+/- 395957.39)
R2: 0.91 (+/- 0.01)
mses = []
r2s = []
pca.fit(X_train)
clf = KNeighborsRegressor(n_neighbors=16)
mses.append(mean_squared_error(y_test, clf.predict(X_test)))
r2s.append(r2_score(y_test, clf.predict(X_test)))
print('Evaluation metrics:')
print("MSE: %0.2f (+/- %0.2f)" % (np.mean(mses), np.std(mses) * 2))
print("R2: %0.2f (+/- %0.2f)" % (np.mean(r2s), np.std(r2s) * 2))

1/12/2018 final-all
In [4]:
Residuals distibution plot
The plot shows the residuals for predicting the target variable "taxamount"
In [5]:
Results and Analysis Gaussian Regression
Evaluation metrics:
MSE: 2684614.88 (+/- 514568.63)
R2: 0.93 (+/- 0.02)
mses = []
r2s = []
clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed)
print('Evaluation metrics:')
f, ax = plt.subplots(nrows=1, ncols=2, figsize=[15, 7])
residuals = yhat - y
sns.distplot(residuals, ax=ax[0])
sns.boxplot(data=residuals, ax=ax[1]);

1/12/2018 final-all
Gaussian Regression when alpha is 1e-15 and normalize_y is True has a MSE if 35917993 (+/-
1636329) and R^2 of 0.01 (+/- 0.00).
⏫ Back to Top
In [18]:
Advantages of Each Model
10 points
Description:
Discuss the advantages of each model for each classification task, if any. If there are not
advantages, explain why. Is any model better than another? Is the difference significant with 95%
confidence? Use proper statistical comparison methods. You must use statistical comparison
techniques—be sure they are appropriate for your chosen method of validation as discussed in unit
7 of the course.
⏫ Back to Top
Advantages of Classification model:
K Nearest Neighbors - KD tree
MSE: 35917993.55 (+/- 1636329.34)
R2: 0.01 (+/- 0.00)
mses = []
r2s = []
clf = GaussianProcessRegressor(alpha=1e-15, normalize_y=True, random_state=see
# we train on a subset because it otherwise requires too much memory
X_train = X_train.iloc[:1000]
y_train = y_train.iloc[:1000]

1/12/2018 final-all
K nearest neighbors classification is different than other classification models in that it does not
attempt to make a model but only stores instances of the training data. The classification will build a
tree which at each node is a rule based on the dimensions of its k nearest neighbors and leading to
each leaf representing a class. The advantages of K nearest neighbors are that it is simple and
converges to the correct decision surface as data goes to infinity. It can be used with multiclass data
sets as well as more complex algorithms such as the KD tree. In our dataset we used the KD tree
algorithm in order to speed up the KNN classification by indexing the tree.
Why we did PCA with KNN
KNN computes the distance between each neighbor's dimensions. We have so many dimensions in
our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long
time. We wanted to continue to increase the number of neighbors until the accuracy increase
plateaued but without reducing the dimensions using PCA, it would have taken too long.
Random forest
Random forest is an ensemble classification algorithm which in nature is a huge advantage
because because predicting off of one decision tree vs an ensemble of them the ensemble will
usually win. Another advantage is that the forest can often correct a tree's overfitting of the training
set.
Gaussian Naive Bayes
Naive Bayes largest advantage is that it is extremely simple and it is just counting up probabilities.
When training sets are small, Naive Bayes is good because of its high bias and low variance which
will not overfit the training data. However, as datasets grow larger such as our dataset, the high bias
will prevent the model from being powerful enough to have a high accuracy. Gaussian Naive Bayes
is a normally distributed NB classifier. Its advantages are that it is fast and can make probabilistic
decisions
Model Comparisons
As stated above, we will be comparing our models based on the F1 and accuracy values. Starting
with the F1 values, GaussianNB is statistically significantly lower than the random forest and the KD
tree. GaussianNB has an F1 of 0.46 (+/- 0.00) while the other models had an F1 of 0.49 (+/- 0.01).
There is no significant difference in the F1 of between the KD tree and the random forest.
Next, we compare the accuracy between the KD tree and the random forest, each were run on the
same number of instances, 58380, and the final accuracy values were 0.542 and 0.559 respectively.
The variances would be (0.542)(.458) / 58380 = 0.00000425 and (0.559)(0.441) / 58380 =
0.00000422. Meaning that the final accuracy for each model is:
KD Tree accuracy: 0.54224049332 +/- 0.00000425 = [0.5422,0.5422]
Random forest accuracy: 0.559643713601 +/- 0.00000422 = [0.5596,0.5596]
So our final winner is the Random Forest which is significantly better in F1 as the Gaussian Naive
Bayes and significantly better in terms of accuracy to the KD tree.
⏫ Back to Top

1/12/2018 final-all
Advantages of Regression model:
K Nearest Neighbors
The advantages of K nearest neighbors are that it is non-parametric and can address missing and
unusual data for regression prediction. Dimensionality reduction can be used to speed up the
prediction modeling process because the model could be trained with nearest neighbors and leaf
size from the results of PCA.
Why we did PCA with KNN
KNN computes the distance between each neighbor's dimensions. We have so many dimensions in
our dataset that even with 100 neighbors the accuracy was still continuing to grow and taking a long
time. We wanted to continue to increase the number of neighbors until the accuracy increase
plateaued but without reducing the dimensions using PCA, it would have taken too long.
Random forest
The advantages of Random forest are that by averaging multiple trees, it reduces overfitting,
reduces variance from outliers, and is therefore more accurate. It is unbiased in the estimate of the
generalization error for the forest building progress and provides effective methods for estimating
missing data. Random forest can extended to unlabeled data, leading to unsupervised clustering.
Gaussia Regression
The advantage of Gaussian Regression are that it is fast and uses less CPU and runtime. However,
it is more used towards data that have normal distributions. It provides the full probabilistic
prediction and interpolates the observations for faster prediction.
Model Comparisons
As stated above, we will be comparing our models based on the MSE and R^2 values. Random
Forest Regression with a max_depth of 26 and n_estimators of 5 yeilded the lowest MSE of
2684614 (+/- 514568) and the highest R^2 of 0.93 (+/- 0.02).
So our final winner is the Random Forest which is significantly better in MSE and R^2 than both K
Nearest Neighbors and Gaussian Regression.
⏫ Back to Top
Important Attributes
10 points

1/12/2018 final-all
Description:
Which attributes from your analysis are most important? Use proper methods discussed in class to
evaluate the importance of different attributes. Discuss the results and hypothesize about why
certain attributes are more important than others for a given classification task.
⏫ Back to Top
Feature importance for classification dataset according to
Random Forest
The top feature was tax amount slightly above 0.08. We think this is the most important feature to
classifying county because in the state of California, where each of the counties are located the tax
rates are set at the county and city level. The next 3 important features also relate to taxes and
each have an importance slightly below 0.08.
The next 3 important features are all related to square footage and year built which we think goes
back to builders and the demographic of the area. Each county could have either one specific
builder for all of their neighborhoods or the builders matched the styles of the homes around them.
The number of bedrooms and bathrooms is probably significant because each county could have
their own demographic of family sizes. If it is close to a larger city, we may see more singles or
couples with fewer numbers of bedrooms and baths and counties farther into suburbia may have
more kids thus more bedrooms and bathrooms.
⏫ Back to Top

1/12/2018 final-all
In [15]:
Feature importance for regression dataset according to
Random Forest
The top feature for taxamount was tax value dollar count with an importance of just below 0.9. The
next three important features were longitude, latitude, and calculated finished square feet at
significantly lower importance levels (less than 0.1). The tax value is set at the time of the
assessment and tax value dollar count is calculated from the actual taxes and the assessed taxes.
So the tax value to the dollar is important to the tax amount because we would assume these
should be fairly similar.
The longitude and latitude could be of higher importance because in California tax rates are set at
county and city levels so this could vary based on location. The calculated square feet is also
important because a higher square footage could mean a larger home or mansion which is more
likely to have higher tax value than a smaller home.
⏫ Back to Top
Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x104148ac8>
clf = RandomForestClassifier(random_state=seed, max_depth=250)
clf.fit(X, y)
importances = clf.feature_importances_
importances = pd.Series(importances, index=X.columns)
importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')

1/12/2018 final-all
In [16]:
Deployment
5 points
Description:
How useful is your model for interested parties (i.e., the companies or organizations that might want
to use it for prediction)? How would you measure the model's value if it was used by these parties?
How would your deploy your model for interested parties? What other data should be collected?
How often would the model need to be updated, etc.?
⏫ Back to Top
Zillow began offering publicly available real estate data from disparate sources into a single
platform, the gap between sellers’ prices and buyers’ offer prices has significantly decreased.
The Zillow dataset was provided for the purpose of evaluating Zestimate’s accuracy based upon the
variable logerror which is the difference of log(Zestimate) - log(SalePrice). For purposes of this
lab assignment, we developed regression models with taxamount as the response. In our
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x10a707278>
clf = RandomForestRegressor(max_depth=26, n_estimators=5, random_state=seed)
clf.fit(X, y)
importances = clf.feature_importances_
importances = pd.Series(importances, index=X.columns)
importances.sort_values(ascending=False).iloc[:10].plot(kind='bar')

1/12/2018 final-all
classification model, we determine important features for the regionidcounty.
For companies in the real estate space, classification models based on physical attributes, which
provide valuable insight for buying, selling, and investment decisions. Our classification model can
be adapted to more granular levels as cities and municipalities. Buyers, sellers, and investors alike
can gain insights into which features have the highest importance to specific locations. This may
drive investment decisions knowing how important certain attributes are for targeted locations.
Knowing which features are highly important in certain locations can drive remodeling decisions to
make properties more attractive to potential buyers. The value-add of this model for these
companies can be measured in terms of returns on investment.
Deployment of the model can be valuable for the rental market as well, where Airbnb can direct
marketing efforts to areas with specific property attributes. Deployment of the model can also be
used to provide the break-even horizon for making rent versus own decisions. In addition, loan
refinancing companies can utilize this model along with Zillow’s liens and taxes database to target
homeowners in specific areas.
To further improve the effectiveness of the model, we should expand the model to include sales
prices, liens, taxes, as well as identify biased data such as short sales, foreclosures, and “arms-
length” transactions (i.e. sales to relatives). All these are readily available from Zillow, as they
collect an enormous amount of data which are updated with high frequency. For our models to be
relevant in this space, they should be updated daily just as Zillow does with their 7 to 11 million
models.
Exceptional Work
10 points
Description:
You have free reign to provide additional analyses. One idea: grid search parameters in a
parallelized fashion and visualize the performances across attributes. Which parameters are most
significant for making a good model for each classification algorithm?
⏫ Back to Top
Approaches Considered for Balanced Classification
⏫ Back to Top
One of the shortcomings of classification, K-nearest neighbors in particular, is the tendency to bias
in favor of the majority class. To eliminate the bias, a number of approaches can be utilized
including StratifiedKFold, which is the approach we use for our classification models. To be
thorough, we explored a set of “imbalanced learn” algorithms, imblearn.RandomOverSampler,
imblearn.SMOTE, and imblearn.ADASYN.
1. StratifiedKFold - A variant of KFold, this ensures each class is represented equally (i.e.
equal weights) as the algorithm performs each fold. Stratification is performed on the

1/12/2018 final-all
training dataset "on the fly" as opposed to performing it as part of data preprocessing.
2. imblearn.RandomOverSampler - As a separate package, imblearn was developed to
address the problem of imbalanced data sets; it is performed at data preprocessing.
RandomOverSampler, in particular, performs a naive over sampling with replacement,
duplicating original samples from the minority class. (Under sampling is the alternate
approach).
3. imblearn.SMOTE - SMOTE compensates for classes that are difficult to separate by
performing over and under sampling using Tomek’s link or edited nearest neighbors
cleaning methods.
4. imblearn.ADASYN - Adaptive Synthetic Sampling Approach (ADASYN) is similar to
SMOTE as it generates samples by interpolation but it focuses on the wrongly classified k-
nearest neighbors.
After considering these methods, we settled on the StratifiedKFold for simplicity since accuracies
across the different approaches were practically equivalent.
Below is an illustration of imblearn's RandomOverSampler algorithm in action.

1/12/2018 final-all
In [9]:
Feature ranking with recursive feature elimination.
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear
model), the goal of recursive feature elimination (RFE) is to select features by recursively
considering smaller and smaller sets of features. First, the estimator is trained on the initial set of
features and the importance of each feature is obtained either through a coef_ attribute or
through a feature_importances_ attribute. Then, the least important features are pruned from
current set of features. That procedure is recursively repeated on the pruned set until the desired
number of features to select is eventually reached.
In this example we will first select top 20 features and will train a Random Forest with only that
features. The performance of the model is printed below.
⏫ Back to Top
Out[9]: <matplotlib.image.AxesImage at 0x11adee128>
plt.figure(figsize=(12,16))
plt.imshow(imread('../../input/imblearn.png')) # just in case you dont see the ima

1/12/2018 final-all
In [36]:
F1 Score: 0.49 (+/- 0.00)
Accuracy 0.559164097294
Precision 0.469948908187
Recall 0.559164097294
X = dataset_class['X'].iloc[:2000]
y = dataset_class['y'].iloc[:2000]
estimator = RandomForestClassifier(max_depth=10, random_state=seed, n_estimators=1
selector = RFE(estimator, n_features_to_select=20, step=1)
selector = selector.fit(X, y)
X = X[X.columns[selector.support_]]
scores = []
clf = RandomForestClassifier(random_state=seed, max_depth=250, n_estimators=40
print_accuracy('KD Tree Classifier', y, yhat, scores)
plot_class_acc(y, yhat, clf.classes_, title="KD Tree Classifier")

1/12/2018 final-all
Two dimentional Linear Discriminant Analysis
The idea is to see if there are separatable clusters by class. The colors green, blue, and red
separates the 3 counties by LDA to see if there are any unique clusters or definite patterns that form
on a 2D plane.
⏫ Back to Top

1/12/2018 final-all
In [37]:
References:
Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels
(https://www.kaggle.com/c/zillow-prize-1/kernels)
Scikitlearn logistic regression: http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
(http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
Scikitlearn linear SVC: http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas
(https://stackoverflow.com/questions/tagged/pandas)
Deployment Reference: http://www.zdnet.com/article/zillow-machine-learning-and-data-in-
real-estate/ (http://www.zdnet.com/article/zillow-machine-learning-and-data-in-real-estate/)
Advantages of GaussianProcessRegression http://scikit-
learn.org/stable/modules/gaussian_process.html (http://scikit-
learn.org/stable/modules/gaussian_process.html)
Advantages of GaussianProcessRegression
https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian-process-
models (https://stats.stackexchange.com/questions/207183/main-advantages-of-gaussian-
process-models)
lde = LDA(n_components=2)
X_lde = lde.fit(X, y).transform(X)
colors = y.astype(str)
colors[colors=='3101'] = 'g'
colors[colors=='2061'] = 'b'
colors[colors=='1286'] = 'r'
plt.scatter(X_lde[:, 1], X_lde[:, 0], s=2, c=colors);

1/12/2018 final-all
Advantages of GaussianProcessRegression https://www.quora.com/What-are-some-
advantages-of-using-Gaussian-Process-Models-vs-SVMs (https://www.quora.com/What-
are-some-advantages-of-using-Gaussian-Process-Models-vs-SVMs)
Advantages of RandomForestRegression https://www.quora.com/What-are-some-
advantages-of-using-a-random-forest-over-a-decision-tree-given-that-a-decision-tree-is-
simpler (https://www.quora.com/What-are-some-advantages-of-using-a-random-forest-
over-a-decision-tree-given-that-a-decision-tree-is-simpler)
Advantages of RandomForestRegression
https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm
(https://www.datasciencecentral.com/profiles/blogs/random-forests-algorithm)
Advantages of KNeighborsRegression
https://stats.stackexchange.com/questions/104255/why-would-anyone-use-knn-for-
regression (https://stats.stackexchange.com/questions/104255/why-would-anyone-use-
knn-for-regression)
Advantages of KNeighborsRegression https://machinelearningmastery.com/k-nearest-
neighbors-for-machine-learning/ (https://machinelearningmastery.com/k-nearest-
neighbors-for-machine-learning/)
Advantages of KNeighborsRegression https://kevinzakka.github.io/2016/07/13/k-nearest-
neighbor/#pros-and-cons-of-knn (https://kevinzakka.github.io/2016/07/13/k-nearest-
neighbor/#pros-and-cons-of-knn)
Imbalanced Learn http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html
(http://contrib.scikit-learn.org/imbalanced-learn/stable/install.html)
⏫ Back to Top

Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance

Ähnlich wie Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance (20)

Mehr von Yao Yao

Mehr von Yao Yao (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance