Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation, Density Graph

Y

https://github.com/yaowser/data_mining_group_project https://www.kaggle.com/c/zillow-prize-1/data From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms: Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph

1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 1/22
Mini-lab 1: Zillow Dataset Logistic Regression
and SVMs
MSDS 7331 Data Mining - Section 403 - Mini Lab 1
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Contents
Imports
Models
Advantages of Each Model
Feature Importance
Insights
References
Imports
We chose to use the same Zillow dataset from Lab 1 for this exploration in logistic regression and
SVM. For origin and purpose of dataset as well as a detailed description of the dataset, refer to
https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb
(https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb).
In [1]:
Load Data, Create y and X
Since we are using the Zillow dataset from our previous lab, the cleanup files were exported from
lab 1 into mini-lab 1. Note that for logistic regression and support vector classifier models, we
choose to use mostly complete continuous variables as well as create dummy variables for nominal
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from tqdm import tqdm
import time
from collections import OrderedDict
warnings.filterwarnings('ignore')
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 2/22
variables to cross compare the performance, feature importance, and insights of each model. X is
the training set and y is the test set, where we are testing if our models can accurately predict
positive (1) logerrror from that of negative (0).
Data columns that are only available for the training set and not the test set (transaction date) were
removed. parcelid was removed because each individual property has its own ID and does not
correlate well with regression or SVMs. The column that was created for "New Features" from Lab 1
(city and pricepersqft) were also removed for the sake of simplicity of only using original data
for the prediction process.
In [2]:
Dealing with Nominal Data
Nominal data usually has more than two values. For logistic regression and SVMs, we created
dummy variables that only factor in 0s and 1s for the prediction process of logistic regression and
SVMs.
In [3]:
Dealing with Continuous Data
StandardScaler from sklearn was applied to the continuous data columns to standardize the dataset
around center 0 with equal variance for creating normal distributions prior to the application of
logistic regression and SVMs.
In [4]:
Out[2]: 'The dataset has 116761 rows and 49 columns'
# load datasets here:
variables = pd.read_csv('../../datasets/variables.csv').set_index('name')
X = pd.read_csv('../../datasets/train.csv', low_memory=False)
y = (X['logerror'] > 0).astype(np.int32)
del X['logerror']
del X['transactiondate']
del X['parcelid']
del X['city']
del X['price_per_sqft']
'The dataset has %d rows and %d columns' % X.shape
nominal = variables[variables['type'].isin(['nominal'])]
nominal = nominal[nominal.index.isin(X.columns)]
nominal_data = X[nominal.index]
nominal_data = pd.get_dummies(nominal_data, drop_first=True)
nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina
continuous = variables[~variables['type'].isin(['nominal'])]
continuous = continuous[continuous.index.isin(X.columns)]
continuous_data = X[continuous.index]
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 3/22
Merging the Data
The data was then merged for the application of logistic regression and SVM prediction. The
following shows the final shape of the dataset after the application of dummy variables and
StandardScaler.
In [5]:
Back to Top
Models
[50 points]
Create a logistic regression model and a support vector machine model for the classification task
involved with your dataset. Assess how well each model performs (use 80/20 training/testing split
for your data). Adjust parameters of the models to make them more accurate. If your dataset
size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is,
the SGDClassifier is fine to use for optimizing logistic regression and linear support vector
machines. For many problems, SGD will be required in order to train the SVM model in a
reasonable timeframe.
Create Models
SGDClassifier Over the Other Sklearn Functions
We tried out a few sklearn support vector machine functions and noticed that the accuracy was
similar for each but with such a large dataset we decided to try to cut down on the time for logistic
regression.
First, we tried SVC setting kernel = 'linear' but waited a long time for it to finish.
Next, we tried LinearSVC because the liblinear library it uses tends to be faster to converge the
larger the number of samples is than the libsvm library.
Finally tried SGDClassifier with loss = 'log' which was exponentially faster than the others so this is
what we use for logistic regression.
Functions to Test Accuracy
These are the functions that we wrote to individually find, visualize, and report the best parameters
per model, where we reuse those parameters for the optimized model.
Out[5]: 'The dataset has 116761 rows and 2107 columns'
X = pd.concat([continuous_data, nominal_data], axis=1)
columns = X.columns
'The dataset has %d rows and %d columns' % X.shape
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 4/22
In [7]:
Logistic Regression
For the logistic regression model, we created a function that took in X_train and Y_train from the
original data set to test for X_test from the modified dataset. The accuracy of the logistic regression
prediction for positive or negative logerror was compared with that of the original, where a
confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we
are slightly better than 50% accuracy.
def test_accuracy(model, n_splits=8, print_steps=False, params={}):
accuracies = []
for i in range(1, n_splits+1):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, ra
yhat, _ = model(
X_train=X_train,
y_train=y_train,
X_test=X_test,
**params
)
accuracy = float(sum(yhat==y_test)) / len(y_test)
accuracies.append(accuracy)
if print_steps:
matrix = pd.DataFrame(confusion_matrix(y_test, yhat),
columns=['Predicted 1', 'Predicted 0'],
index=['Actual 1', 'Actual 0'],
)
print('*' * 15 + ' Split %d ' % i + '*' * 15)
print('Accuracy:', accuracy)
print(matrix)
return np.mean(accuracies)
def find_optimal_accuracy(model, param, param_values, params={}):
result = {}
for param_value in tqdm(list(param_values)):
params_local = params.copy()
params_local[param] = param_value
result[param_value] = test_accuracy(model, params=params_local)
result = pd.Series(result).sort_index()
plt.xlabel(param, fontsize=15)
plt.ylabel('Accuracy', fontsize=15)
optimal_param = result.argmax()
optimal_accuracy = result[optimal_param]
if type(param_value) == str:
result.plot(kind='bar')
else:
result.plot()
plt.show()
return optimal_param
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 5/22
In [8]:
*************** Split 1 ***************
Accuracy: 0.5544897871793774
Predicted 1 Predicted 0
Actual 1 5957 4488
Actual 0 5916 6992
*************** Split 2 ***************
Accuracy: 0.5663940393097247
Predicted 1 Predicted 0
Actual 1 1456 8955
Actual 0 1171 11771
*************** Split 3 ***************
Accuracy: 0.5502933241981758
Predicted 1 Predicted 0
Actual 1 5883 4591
Actual 0 5911 6968
*************** Split 4 ***************
Accuracy: 0.5652806919881814
Predicted 1 Predicted 0
Actual 1 1783 8819
Actual 0 1333 11418
*************** Split 5 ***************
Accuracy: 0.5028475998801011
Predicted 1 Predicted 0
Actual 1 8796 1616
Actual 0 9994 2947
*************** Split 6 ***************
Accuracy: 0.5612126921594656
Predicted 1 Predicted 0
Actual 1 2250 8243
Actual 0 2004 10856
*************** Split 7 ***************
Accuracy: 0.5663940393097247
Predicted 1 Predicted 0
Actual 1 2750 7693
Actual 0 2433 10477
*************** Split 8 ***************
Accuracy: 0.5659230077506102
Predicted 1 Predicted 0
def logistic_regression_model(X_train, y_train, X_test, **params):
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
params['loss'] = 'log'
clf = SGDClassifier(**params)
clf.fit(X_train, y_train)
return clf.predict(X_test), clf
best_params_logistic = {}
model = logistic_regression_model
accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru
print('-' * 50)
'Average unoptimized accuracy: %f' % accuracy
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 6/22
Optimizing the Logistic Regression Model
By running logistic regression one time with the built in parameters, we got an average accuracy of
0.554 from 8 splits. To try to improve this, we want to do a few things.
First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By
splitting the training and test sets up multiple times, we can minimize the effects of outliers.
Second, we want to see how changing the value of alpha, epsilon, number of iterations, and penalty
will affect the accuracy. To do this we have another for loop which sets alpha and epsilon at ten and
twenty linear increments from 0.00001 to 0.001 and 0.01 to .5, respectively. The number of
iterations could be 1, 3, 6, 10, or 15 and penalty could be L1 or L2.
We found that the optimal value for alpha is 0.00023 and that for epsilon is 0.293. The optimal
penalty is L2 at 15 iterations.
Alpha is just a constant multiplied to the regularization term so our value of 0.00023 is expected.
Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini
lab.
Our value of 0.293 for epsilon is important in the threshold of our model which is why we ran
iterations over values very close to 0. The default is 0.001 but we found values even smaller than
that increased our accuracy.
We found L2, the squared error, is slightly more accurate than L1, the error. This was expected
because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models,
where it performed the best for our model.
The iteration number vs accuracy should be a fairly random distribution. We expected to get
different results each time and expected that they would be about our initial accuracy, 0.55 +/- 0.1.
This time, 15 iterations is the optimal number. Although the accuracy per iteration was still going up,
we had to stop at 15 iterations due to running time restraints.
Out[8]: 'Average unoptimized accuracy: 0.554104'
Actual 1 3354 7103
Actual 0 3034 9862
--------------------------------------------------
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 7/22
In [9]:
100%|██████████| 5/5 [06:04<00:00, 76.82s/it]
Best n_iter 15
100%|██████████| 10/10 [17:06<00:00, 102.69s/it]
Best alpha 0.00023
test_params = [
('n_iter', [1, 3, 6, 10, 15]),
('alpha', np.linspace(0.00001, 0.001, 10)),
('epsilon', np.linspace(0.01, .5, 20)),
('penalty', ['l1', 'l2'])
]
for param, param_values in test_params:
best_params_logistic[param] = find_optimal_accuracy(
logistic_regression_model,
param=param,
param_values=param_values,
params=best_params_logistic
)
print("Best", param, best_params_logistic[param])
time.sleep(1)
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 8/22
Optimized Logistic Regression Model Performance
Once we plugged in all optimal values into the model, the final accuracy became 0.568, which is
slightly better than that of 0.554 from default parameters. Due to the dataset being very
complicated, no large improvement in accuracy was expected.
In [10]:
100%|██████████| 20/20 [34:21<00:00, 103.28s/it]
Best epsilon 0.293684210526
100%|██████████| 2/2 [04:19<00:00, 140.15s/it]
Best penalty l2
Optimized Logistic Regression Accuracy 0.568090
1min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n1 -r1
accuracy = test_accuracy(
logistic_regression_model, n_splits=5, params=best_params_logistic)
print('Optimized Logistic Regression Accuracy %f' % accuracy)
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 9/22
Support Vector Machine Classifier
For the support vector machine model, we created a function that took in X_train and Y_train from
the original data set to test for X_test from the modified dataset. The accuracy of the SVM
prediction for positive or negative logerror was compared with that of the original, where a
confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we
are again slightly better than 50% accuracy.
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 10/22
In [14]:
*************** Split 1 ***************
Accuracy: 0.532736693358455
Predicted 1 Predicted 0
Actual 1 4716 5729
Actual 0 5183 7725
*************** Split 2 ***************
Accuracy: 0.5036611998458442
Predicted 1 Predicted 0
Actual 1 4259 6152
Actual 0 5439 7503
*************** Split 3 ***************
Accuracy: 0.533978503832484
Predicted 1 Predicted 0
Actual 1 4864 5610
Actual 0 5273 7606
*************** Split 4 ***************
Accuracy: 0.540829871965058
Predicted 1 Predicted 0
Actual 1 4296 6306
Actual 0 4417 8334
*************** Split 5 ***************
Accuracy: 0.5325654091551406
Predicted 1 Predicted 0
Actual 1 4525 5887
Actual 0 5029 7912
*************** Split 6 ***************
Accuracy: 0.5332933670192267
Predicted 1 Predicted 0
Actual 1 4960 5533
Actual 0 5366 7494
*************** Split 7 ***************
Accuracy: 0.5422429666424015
Predicted 1 Predicted 0
Actual 1 3739 6704
Actual 0 3986 8924
*************** Split 8 ***************
Accuracy: 0.5345351774932556
def support_vector_machine_model(X_train, y_train, X_test, **params):
# X = (X - µ) / σ
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
params['loss'] = 'hinge'
clf = SGDClassifier(**params)
clf.fit(X_train, y_train)
return clf.predict(X_test), clf
best_params_svc = {}
model = support_vector_machine_model
accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru
print('-' * 50)
'Average unoptimized accuracy: %f' % accuracy
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 11/22
Optimizing the Support Vector Machine
Model
By running SVM model one time with the built in parameters, we got an average accuracy of 0.531
from 8 splits. To try to improve this, we will do a few things listed below:
First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By
splitting the training and test sets up multiple times, we can minimize the effects of outliers.
Second, we want to see how changing the value of alpha, number of iterations, and penalty will
affect the accuracy. To do this we have another for loop which sets alpha at 20 linear increments
from 0.00001 to 0.01. The number of iterations could be 10, 15, 30, 60, or 100 and penalty could be
L1 or L2.
We found that the optimal value for alpha is 0.00421. The optimal penalty is L2 at 100 iterations.
Alpha is just a constant multiplied to the regularization term so our value of 0.00421 is expected.
Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini
lab.
Epsilon was not changed because the results had noisy accuracy and we decided to remove it.
We found L2, the squared error, is slightly more accurate than L1, the error. This was expected
because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models,
where it performed the best for our model.
The iteration number vs accuracy should be a fairly random distribution. We expected to get
different results each time and expected that they would be about our initial accuracy, 0.53 +/- 0.1.
This time, 100 iterations is the optimal number. Although the accuracy per iteration was still going
up, we had to stop at 100 iterations due to running time restraints.
Out[14]: 'Average unoptimized accuracy: 0.531730'
Predicted 1 Predicted 0
Actual 1 3902 6555
Actual 0 4315 8581
--------------------------------------------------
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 12/22
In [20]:
100%|██████████| 5/5 [15:18<00:00, 208.83s/it]
Best n_iter 100
100%|██████████| 20/20 [2:06:52<00:00, 384.29s/it]
Best alpha 0.00421631578947
test_params = [
('n_iter', [10, 15, 30, 60, 100]),
('alpha', np.linspace(0.00001, 0.01, 20)),
('penalty', ['l1', 'l2'])
]
model = support_vector_machine_model
for param, test_values in test_params:
best_params_svc[param] = find_optimal_accuracy(
model=model,
param=param,
param_values=test_values,
params=best_params_svc
)
print("Best", param, best_params_svc[param])
time.sleep(1)
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 13/22
Optimized Support Vector Machine Model
Performance
Once we plugged in all optimal values into the model, the final accuracy became 0.560, which is
better than that of 0.531 from default parameters. Due to the dataset being very complicated, no
large improvement in accuracy was expected but we were pleased that this improvement was more
than the improvement from logistic regression.
In [21]:
Comparing the Results of the Two Models
Here is an accuracy vs time comparison of the two models with parameters optimized. Although the
optimized logistic regression model performed better than that of the SVM model, the difference in
accuracy is not significant.
In [25]:
Back to Top
100%|██████████| 2/2 [21:15<00:00, 739.44s/it]
Best penalty l2
Optimized SVC Accuracy 0.560202
3min 59s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Out[25]: Accuracy Time
Logistic Regression 0.568 1 min
Support Vector Machine 0.560 4 min
%%timeit -n1 -r1
accuracy = test_accuracy(
support_vector_machine_model, n_splits=5, params=best_params_svc)
print('Optimized SVC Accuracy %f' % accuracy)
pd.DataFrame([[0.568, '1 min'], [0.56, '4 min']], columns=['Accuracy', 'Time'], i
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 14/22
Advantages of Each Model
[10 points]
Discuss the advantages of each model for each classification task. Does one type of model offer
superior performance over another in terms of prediction accuracy? In terms of training time or
efficiency? Explain in detail.
Model Advantages
Advantages in Accuracy
Logistic regression runs best when there is a single linear decision boundary. However, our dataset
is a fairly hard problem to solve and the decision line is not very smooth. We know this because we
ran logistic regression one time using the built in parameters, we got an accuracy of 0.554. After
optimizing for iterations, alpha, epsilon, and L1 and L2 penalties, we were only able to obtain a final
accuracy of 0.568. This could be an indication of optimizing parameters individually, where
interaction between the terms were not considered and we also have a high risk of overfitting our
model.
The advantages of support vector machines is that we can fit a region for a decision boundary, we
are not constrained to a single line as above. We thought this would be better for our dataset
because we have so many factors and do not think the boundaries are a clear linear line. We were
suprised when we ran SDG with basic parameters (set alpha but did not test for optimization) and
we found an accuracy of 0.531 (less than that of logistic regression but could be due to lack of
optimization). After optimizing, we were able to achieve an accuracy of 0.560. This is a slight
improvement, where we were still optimizing parameters individually, where interaction between the
terms were not considered and we also have a high risk of overfitting our model.
Advantages in Time and Efficiency
For the sklearn functions that were considered, SVC with a linear kernel calculates the distance
between each point in the dataset. Thus, the run time is essentially number of features multiplied by
the number of observations squared. In other words, longer than the patience of some team
members to watch it complete and is the slowest method we used.
As mentioned above this was improved by LinearSVC because it is implemented using liblinear
which uses a linear SVC and a logistic regression. This means run time is log linear times linear
which is better than SVC.
Logistic regression uses the liblinear library and uses a one vs the rest algorithm. This means that
the run time is in log linear time, improving the efficiency from SVC functions.
SGDClassifier is fastest and arguably linear, which to a software engineer a matrix can only run in n
* m for n the number of features and m the number of observations. It's convergence to a solution
depending on the loss setting means it uses only a subset of the dataset also improving time.
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 15/22
In terms of our dataset, we have about 2000 features which is a sparse dataset. Logistic regression
turned out to be the fastest but SDGClassifier is a close second.
Conclusion
SDG with loss = "log" was our best performer in terms of accuracy (0.568) and was our fastest
algorithm. So, we decided that the logistic regression model was best for our dataset.
Back to Top
Feature Importance
[30 points]
Use the weights from logistic regression to interpret the importance of different features for the
classification task. Explain your interpretation in detail. Why do you think some variables are more
important?
Logistic Regression Feature Importance
Above we chose the logistic regression model over the SVM so we have pulled out the top 50
variables in our dataset below.
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 16/22
In [15]:
Interpret Feature Importance
_, clf = logistic_regression_model(X_train=X, y_train=y, X_test=X, **best_params_
abs_coefs = np.abs(clf.coef_[0])
top_50_vars = pd.Series(abs_coefs, index=X.columns).sort_values().index[:50]
importance_top_50 = pd.Series(clf.coef_[0], index=X.columns).loc[top_50_vars]
plt.figure(figsize=(15, 20))
importance_top_50.plot(kind='barh')
plt.title('Logistic Regression Feature Importance (TOP 50 Variables)')
plt.xlabel('Weight', fontsize=15)
plt.ylabel('Feature', fontsize=15);
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 17/22
The features
After scaling the continuous variables, we found that 49 of the top 50 important features were a
flavor of propertyzoningdesc (per county). This means that these variables are the showed the
most importance for predicting logerror, interestingly weighted on both sides of + or -. We think
that this could be due to the fact that neighborhoods based on location and amenities near by highly
dictate the sales price of a property or highly sway the difference in price vs estimation.
Assessmentyear was also an important feature because when the property was assessed could be
a strong indicator if the sales price of a property appreciated or depreciated in price.
Outside of the top 50 important features, propertytax was also a "big" factor. We say big because
after propertyzoningdesc is accounted for the weights become exponentially smaller. Perhaps the
more land owners pay for property tax could better predict property value because more amenities
could be added for a richer neighborhood than that for a poorer.
The weights
There are over 2000 variables in our dataset and we have a good amount of missing values so our
dataset is already fairly sparse. This could be why our largest weight was under 0.002. We also
found that only 36 features had a weight higher than 0.0005. While this sounds like good news (yay
only 36 features to key in on!) all of these were flavors of property zoning.
Back to Top
Insights
[10 points]
Look at the chosen support vectors for the classification task. Do these provide any insight into the
data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for
support vectors), try subsampling your data to train the SVC model — then analyze the support
vectors from the subsampled dataset.
Interpret the Support Vectors
We reviewed support vectors for logerror equals 0 (for negative values) versus 1 (for positive
values) for a number of features using Kernel Density Estimation (KDE). From our analysis of
feature importance, we found significance for propertyzoningdesc. Unfortunately, these were
categorical variables so we were unable to make a comparison of how much more effective the
support vectors were at defining the classification boundaries. However, we are able to perform the
KDE for the tax amount features.
We selected features that are intuitively relevant in the real estate industry for predicting sale price,
e.g. bathrooms, bedrooms, square footage, year built. Since we are testing logerror, and not Sale
Price, we didn't see any significant differences between the original and the chosen support vectors
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 18/22
for these features. In other words, it wouldn't be unusual to see the original logerror approximating
the support vectors, since the effects of these features have already been backed into logerror.
These are the features where SVM resembled the original data.
bathroomcnt, fullbathcnt, calculatedbathnbr - For all these features, the original
data had distinct numbers that were in increments of 0.5 while the SVM model had the
same shape as the original data set but with continuous variables. Negative and positive
logerror were consistent between the original and the resulting support vectors.
yearbuilt - Original data had more details with additional curvature while the SVM model
had the same shape as the original data set but with less details in curvature. The curves
for positive (1) and negative (0) logerror follow the same shape, as seen in both graphs.
Negative and positive logerror were consistent between the original and the resulting
support vectors.
Tax related features - taxamount and taxvaluedollarcnt - tell a different story and we found
differences between the original and SVM. The original data had positive (1) and negative (0) error
peaks at around the same dollar amounts with very minor peaks at higher values. The SVM model
did not really follow the original graph shape and instead exaggerated the second peak. Also, it had
logerror 0 surpassing logerror 1 for the highest density, where the original data portrayed the
opposite. The SVM model for taxamount and taxvaluedollarcnt did not approximate the original
data.
In [6]:
Density Graph of Positive (1) and Negative (0) Logerror for
Six Variables
For bathroomcnt, the original data had distinct numbers that were in increments of 0.5 while the
SVM model had the same shape as the original data set but with continuous variables. Negative
logerror had a larger area underneath the curve than positive logerror, as seen in both graphs. The
SVM model for bathroomcnt overall did good in preserving original data integrity.
Number of support vectors for each feature: [333 420]
from sklearn.svm import SVC
clf = SVC(kernel='linear', max_iter=500)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
clf.fit(X_scaled, y)
# this hold the indexes of the support vectors
clf.support_
# this holds a subset of the data which is used for support vectors
support_vectors = pd.DataFrame(clf.support_vectors_, columns=X.columns)
# get number of support vectors for each class
print('Number of support vectors for each feature:', clf.n_support_)
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 19/22
For fullbathcnt, the original data had distinct numbers that were in increments of 0.5 while the
SVM model had the same shape as the original data set but with continuous variables. The curves
for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model
for fullbathcnt overall did good in preserving original data integrity.
For taxamount and taxvaluedollarcnt, the original data had positive and negative error peak at
around the same dollar amount with a very minor peak at an higher value. The SVM model did not
really follow the original graph shape and instead exaggerated the second peak. Also, it had
logerror 1 surpassing logerror 0 for the highest density, where the original data portrayed the
opposite. The SVM model for taxamount and taxvaluedollarcnt did not do well in preserving
original data integrity.
For calculatedbathnbr, the original data had distinct numbers that were in increments of 0.5
while the SVM model had the same shape as the original data set but with continuous variables.
The curves for positive and negative logerror follow the same shape, as seen in both graphs. The
SVM model for calculatedbathnbr overall did good in preserving original data integrity.
For yearbuilt, the original data had more details in additional curvature while the SVM model had
the same shape as the original data set but with lesser details in curvature. The curves for positive
and negative logerror follow the same shape, as seen in both graphs. The SVM model for
yearbuilt overall did good in preserving original data integrity.
Overall, the SVM model kept data integrity for bathroomcnt, fullbathcnt, calculatedbathnbr,
and yearbuilt but not really for taxamount and taxvaluedollarcnt.
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 20/22
In [14]: V_grouped = support_vectors.groupby(y.loc[clf.support_].values)
X_grouped = X.groupby(y.values)
vars_to_plot = ['bathroomcnt','fullbathcnt','calculatedbathnbr',
'yearbuilt','taxamount','taxvaluedollarcnt']
for v in vars_to_plot:
plt.figure(figsize=(10,4)).subplots_adjust(wspace=.4)
plt.subplot(1,2,1)
V_grouped[v].plot.kde()
plt.legend(['logerror 0','logerror 1'])
plt.title(v+' (Instances chosen as Support Vectors)')
plt.subplot(1,2,2)
X_grouped[v].plot.kde()
plt.legend(['logerror 0','logerror 1'])
plt.title(v+' (Original)')
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 21/22
1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 22/22
Back to Top
References:
Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels
(https://www.kaggle.com/c/zillow-prize-1/kernels)
Scikitlearn logistic regression: http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
(http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
Scikitlearn linear SVC: http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit-
learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas
(https://stackoverflow.com/questions/tagged/pandas)
Scikitlearn SDGClassfier: http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html (http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

Recomendados

Data preprocessing for Machine Learning with R and Python von
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonAkhilesh Joshi
648 views20 Folien
Map reduce: beyond word count von
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
11.3K views29 Folien
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by... von
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
1.3K views37 Folien
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati... von
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
1.3K views11 Folien
Java 8 monads von
Java 8   monadsJava 8   monads
Java 8 monadsAsela Illayapparachchi
168 views15 Folien
Enhancing Spark SQL Optimizer with Reliable Statistics von
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
3.1K views24 Folien

Más contenido relacionado

Was ist angesagt?

An introduction to Test Driven Development on MapReduce von
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
879 views24 Folien
Big data unit iv and v lecture notes qb model exam von
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
724 views70 Folien
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell von
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
2.5K views31 Folien
Testing Hadoop jobs with MRUnit von
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
11.6K views27 Folien
Time Series Analysis for Network Secruity von
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruitymrphilroth
6.3K views46 Folien
AJUG April 2011 Cascading example von
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleChristopher Curtin
475 views6 Folien

Was ist angesagt?(20)

An introduction to Test Driven Development on MapReduce von Ananth PackkilDurai
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
Big data unit iv and v lecture notes qb model exam von Indhujeni
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni724 views
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell von Databricks
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks2.5K views
Testing Hadoop jobs with MRUnit von Eric Wendelin
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
Eric Wendelin11.6K views
Time Series Analysis for Network Secruity von mrphilroth
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
mrphilroth6.3K views
Deep Convolutional GANs - meaning of latent space von Hansol Kang
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
Hansol Kang657 views
GeoMesa on Apache Spark SQL with Anthony Fox von Databricks
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks2.7K views
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa... von InfluxData
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
InfluxData136 views
Machinelearning Spark Hadoop User Group Munich Meetup 2016 von Comsysto Reply GmbH
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016 von Comsysto Reply GmbH
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Michael Häusler – Everyday flink von Flink Forward
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward8.5K views
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr von AgileNCR2013
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
AgileNCR20131.4K views
Cheat Sheet for Machine Learning in Python: Scikit-learn von Karlijn Willems
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems4.3K views
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in... von Data Con LA
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA871 views
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn von Arnaud Joly
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly3.2K views
Stratosphere Intro (Java and Scala Interface) von Robert Metzger
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
Robert Metzger5.4K views
Vector class in C++ von Jawad Khan
Vector class in C++Vector class in C++
Vector class in C++
Jawad Khan2.4K views
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013 von Robert Metzger
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger5K views

Similar a Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation, Density Graph

Lab 2: Classification and Regression Prediction Models, training and testing ... von
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
421 views43 Folien
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf von
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
- K-Nearest Neighbours Classifier Now we can start building the actua.pdfinfo893569
2 views4 Folien
maXbox starter69 Machine Learning VII von
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
117 views7 Folien
Machine Learning - Simple Linear Regression von
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
706 views11 Folien
Decision Tree.pptx von
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptxRamakrishna Reddy Bijjam
126 views23 Folien
PPT on Data Science Using Python von
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
7.6K views42 Folien

Similar a Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation, Density Graph(20)

Lab 2: Classification and Regression Prediction Models, training and testing ... von Yao Yao
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao421 views
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf von info893569
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
- K-Nearest Neighbours Classifier Now we can start building the actua.pdf
info8935692 views
maXbox starter69 Machine Learning VII von Max Kleiner
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
Max Kleiner117 views
# Produce the features of a testing data instance X_new = np. arr.pdf von info893569
# Produce the features of a testing data instance X_new = np. arr.pdf# Produce the features of a testing data instance X_new = np. arr.pdf
# Produce the features of a testing data instance X_new = np. arr.pdf
info8935692 views
Linear Regression (Machine Learning) von Omkar Rane
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
Omkar Rane122 views
ML-Ops how to bring your data science to production von Herman Wu
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
Herman Wu5.7K views
maxbox starter60 machine learning von Max Kleiner
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner96 views
Viktor Tsykunov: Azure Machine Learning Service von Lviv Startup Club
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
Lviv Startup Club140 views
R programming & Machine Learning von AmanBhalla14
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14676 views
Spark ml streaming von Adam Doyle
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle242 views

Más de Yao Yao

Lessons after working as a data scientist for 1 year von
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearYao Yao
440 views42 Folien
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S... von
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
881 views52 Folien
Yelp's Review Filtering Algorithm Paper von
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYao Yao
360 views33 Folien
Yelp's Review Filtering Algorithm Poster von
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYao Yao
246 views1 Folie
Yelp's Review Filtering Algorithm Powerpoint von
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYao Yao
307 views33 Folien
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model von
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
137 views21 Folien

Más de Yao Yao(19)

Lessons after working as a data scientist for 1 year von Yao Yao
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
Yao Yao440 views
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S... von Yao Yao
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...
Yao Yao881 views
Yelp's Review Filtering Algorithm Paper von Yao Yao
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
Yao Yao360 views
Yelp's Review Filtering Algorithm Poster von Yao Yao
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
Yao Yao246 views
Yelp's Review Filtering Algorithm Powerpoint von Yao Yao
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
Yao Yao307 views
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model von Yao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao137 views
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model von Yao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Yao Yao68 views
Estimating the initial mean number of views for videos to be on youtube's tre... von Yao Yao
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao129 views
Estimating the initial mean number of views for videos to be on youtube's tre... von Yao Yao
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
Yao Yao46 views
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai... von Yao Yao
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Yao Yao210 views
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin... von Yao Yao
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Yao Yao277 views
Prediction of Future Employee Turnover via Logistic Regression von Yao Yao
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
Yao Yao1.2K views
Data Reduction and Classification for Lumosity Data von Yao Yao
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
Yao Yao132 views
Predicting Sales Price of Homes Using Multiple Linear Regression von Yao Yao
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
Yao Yao224 views
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform von Yao Yao
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao136 views
Blockchain Security and Demonstration von Yao Yao
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao556 views
API Python Chess: Distribution of Chess Wins based on random moves von Yao Yao
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
Yao Yao793 views
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform von Yao Yao
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao457 views
Blockchain Security and Demonstration von Yao Yao
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
Yao Yao501 views

Último

[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... von
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
9 views77 Folien
Ukraine Infographic_22NOV2023_v2.pdf von
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K views3 Folien
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion von
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
9 views37 Folien
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... von
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
7 views26 Folien
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... von
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...DataScienceConferenc1
5 views18 Folien
Amy slides.pdf von
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 views13 Folien

Último(20)

[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... von DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
Ukraine Infographic_22NOV2023_v2.pdf von AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion von Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... von StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... von DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
SUPER STORE SQL PROJECT.pptx von khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
Data Journeys Hard Talk workshop final.pptx von info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf von 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
Best Home Security Systems.pptx von mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference von AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Listed Instruments Survey 2022.pptx von secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat493 views
CRIJ4385_Death Penalty_F23.pptx von yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... von DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...

Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation, Density Graph

  • 1. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 1/22 Mini-lab 1: Zillow Dataset Logistic Regression and SVMs MSDS 7331 Data Mining - Section 403 - Mini Lab 1 Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion Contents Imports Models Advantages of Each Model Feature Importance Insights References Imports We chose to use the same Zillow dataset from Lab 1 for this exploration in logistic regression and SVM. For origin and purpose of dataset as well as a detailed description of the dataset, refer to https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb (https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb). In [1]: Load Data, Create y and X Since we are using the Zillow dataset from our previous lab, the cleanup files were exported from lab 1 into mini-lab 1. Note that for logistic regression and support vector classifier models, we choose to use mostly complete continuous variables as well as create dummy variables for nominal %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.svm import SVC, LinearSVC from sklearn.preprocessing import StandardScaler, MinMaxScaler from tqdm import tqdm import time from collections import OrderedDict warnings.filterwarnings('ignore')
  • 2. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 2/22 variables to cross compare the performance, feature importance, and insights of each model. X is the training set and y is the test set, where we are testing if our models can accurately predict positive (1) logerrror from that of negative (0). Data columns that are only available for the training set and not the test set (transaction date) were removed. parcelid was removed because each individual property has its own ID and does not correlate well with regression or SVMs. The column that was created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake of simplicity of only using original data for the prediction process. In [2]: Dealing with Nominal Data Nominal data usually has more than two values. For logistic regression and SVMs, we created dummy variables that only factor in 0s and 1s for the prediction process of logistic regression and SVMs. In [3]: Dealing with Continuous Data StandardScaler from sklearn was applied to the continuous data columns to standardize the dataset around center 0 with equal variance for creating normal distributions prior to the application of logistic regression and SVMs. In [4]: Out[2]: 'The dataset has 116761 rows and 49 columns' # load datasets here: variables = pd.read_csv('../../datasets/variables.csv').set_index('name') X = pd.read_csv('../../datasets/train.csv', low_memory=False) y = (X['logerror'] > 0).astype(np.int32) del X['logerror'] del X['transactiondate'] del X['parcelid'] del X['city'] del X['price_per_sqft'] 'The dataset has %d rows and %d columns' % X.shape nominal = variables[variables['type'].isin(['nominal'])] nominal = nominal[nominal.index.isin(X.columns)] nominal_data = X[nominal.index] nominal_data = pd.get_dummies(nominal_data, drop_first=True) nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina continuous = variables[~variables['type'].isin(['nominal'])] continuous = continuous[continuous.index.isin(X.columns)] continuous_data = X[continuous.index]
  • 3. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 3/22 Merging the Data The data was then merged for the application of logistic regression and SVM prediction. The following shows the final shape of the dataset after the application of dummy variables and StandardScaler. In [5]: Back to Top Models [50 points] Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. Create Models SGDClassifier Over the Other Sklearn Functions We tried out a few sklearn support vector machine functions and noticed that the accuracy was similar for each but with such a large dataset we decided to try to cut down on the time for logistic regression. First, we tried SVC setting kernel = 'linear' but waited a long time for it to finish. Next, we tried LinearSVC because the liblinear library it uses tends to be faster to converge the larger the number of samples is than the libsvm library. Finally tried SGDClassifier with loss = 'log' which was exponentially faster than the others so this is what we use for logistic regression. Functions to Test Accuracy These are the functions that we wrote to individually find, visualize, and report the best parameters per model, where we reuse those parameters for the optimized model. Out[5]: 'The dataset has 116761 rows and 2107 columns' X = pd.concat([continuous_data, nominal_data], axis=1) columns = X.columns 'The dataset has %d rows and %d columns' % X.shape
  • 4. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 4/22 In [7]: Logistic Regression For the logistic regression model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the logistic regression prediction for positive or negative logerror was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we are slightly better than 50% accuracy. def test_accuracy(model, n_splits=8, print_steps=False, params={}): accuracies = [] for i in range(1, n_splits+1): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, ra yhat, _ = model( X_train=X_train, y_train=y_train, X_test=X_test, **params ) accuracy = float(sum(yhat==y_test)) / len(y_test) accuracies.append(accuracy) if print_steps: matrix = pd.DataFrame(confusion_matrix(y_test, yhat), columns=['Predicted 1', 'Predicted 0'], index=['Actual 1', 'Actual 0'], ) print('*' * 15 + ' Split %d ' % i + '*' * 15) print('Accuracy:', accuracy) print(matrix) return np.mean(accuracies) def find_optimal_accuracy(model, param, param_values, params={}): result = {} for param_value in tqdm(list(param_values)): params_local = params.copy() params_local[param] = param_value result[param_value] = test_accuracy(model, params=params_local) result = pd.Series(result).sort_index() plt.xlabel(param, fontsize=15) plt.ylabel('Accuracy', fontsize=15) optimal_param = result.argmax() optimal_accuracy = result[optimal_param] if type(param_value) == str: result.plot(kind='bar') else: result.plot() plt.show() return optimal_param
  • 5. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 5/22 In [8]: *************** Split 1 *************** Accuracy: 0.5544897871793774 Predicted 1 Predicted 0 Actual 1 5957 4488 Actual 0 5916 6992 *************** Split 2 *************** Accuracy: 0.5663940393097247 Predicted 1 Predicted 0 Actual 1 1456 8955 Actual 0 1171 11771 *************** Split 3 *************** Accuracy: 0.5502933241981758 Predicted 1 Predicted 0 Actual 1 5883 4591 Actual 0 5911 6968 *************** Split 4 *************** Accuracy: 0.5652806919881814 Predicted 1 Predicted 0 Actual 1 1783 8819 Actual 0 1333 11418 *************** Split 5 *************** Accuracy: 0.5028475998801011 Predicted 1 Predicted 0 Actual 1 8796 1616 Actual 0 9994 2947 *************** Split 6 *************** Accuracy: 0.5612126921594656 Predicted 1 Predicted 0 Actual 1 2250 8243 Actual 0 2004 10856 *************** Split 7 *************** Accuracy: 0.5663940393097247 Predicted 1 Predicted 0 Actual 1 2750 7693 Actual 0 2433 10477 *************** Split 8 *************** Accuracy: 0.5659230077506102 Predicted 1 Predicted 0 def logistic_regression_model(X_train, y_train, X_test, **params): scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) params['loss'] = 'log' clf = SGDClassifier(**params) clf.fit(X_train, y_train) return clf.predict(X_test), clf best_params_logistic = {} model = logistic_regression_model accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru print('-' * 50) 'Average unoptimized accuracy: %f' % accuracy
  • 6. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 6/22 Optimizing the Logistic Regression Model By running logistic regression one time with the built in parameters, we got an average accuracy of 0.554 from 8 splits. To try to improve this, we want to do a few things. First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By splitting the training and test sets up multiple times, we can minimize the effects of outliers. Second, we want to see how changing the value of alpha, epsilon, number of iterations, and penalty will affect the accuracy. To do this we have another for loop which sets alpha and epsilon at ten and twenty linear increments from 0.00001 to 0.001 and 0.01 to .5, respectively. The number of iterations could be 1, 3, 6, 10, or 15 and penalty could be L1 or L2. We found that the optimal value for alpha is 0.00023 and that for epsilon is 0.293. The optimal penalty is L2 at 15 iterations. Alpha is just a constant multiplied to the regularization term so our value of 0.00023 is expected. Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini lab. Our value of 0.293 for epsilon is important in the threshold of our model which is why we ran iterations over values very close to 0. The default is 0.001 but we found values even smaller than that increased our accuracy. We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model. The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 0.55 +/- 0.1. This time, 15 iterations is the optimal number. Although the accuracy per iteration was still going up, we had to stop at 15 iterations due to running time restraints. Out[8]: 'Average unoptimized accuracy: 0.554104' Actual 1 3354 7103 Actual 0 3034 9862 --------------------------------------------------
  • 7. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 7/22 In [9]: 100%|██████████| 5/5 [06:04<00:00, 76.82s/it] Best n_iter 15 100%|██████████| 10/10 [17:06<00:00, 102.69s/it] Best alpha 0.00023 test_params = [ ('n_iter', [1, 3, 6, 10, 15]), ('alpha', np.linspace(0.00001, 0.001, 10)), ('epsilon', np.linspace(0.01, .5, 20)), ('penalty', ['l1', 'l2']) ] for param, param_values in test_params: best_params_logistic[param] = find_optimal_accuracy( logistic_regression_model, param=param, param_values=param_values, params=best_params_logistic ) print("Best", param, best_params_logistic[param]) time.sleep(1)
  • 8. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 8/22 Optimized Logistic Regression Model Performance Once we plugged in all optimal values into the model, the final accuracy became 0.568, which is slightly better than that of 0.554 from default parameters. Due to the dataset being very complicated, no large improvement in accuracy was expected. In [10]: 100%|██████████| 20/20 [34:21<00:00, 103.28s/it] Best epsilon 0.293684210526 100%|██████████| 2/2 [04:19<00:00, 140.15s/it] Best penalty l2 Optimized Logistic Regression Accuracy 0.568090 1min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) %%timeit -n1 -r1 accuracy = test_accuracy( logistic_regression_model, n_splits=5, params=best_params_logistic) print('Optimized Logistic Regression Accuracy %f' % accuracy)
  • 9. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 9/22 Support Vector Machine Classifier For the support vector machine model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the SVM prediction for positive or negative logerror was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we are again slightly better than 50% accuracy.
  • 10. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 10/22 In [14]: *************** Split 1 *************** Accuracy: 0.532736693358455 Predicted 1 Predicted 0 Actual 1 4716 5729 Actual 0 5183 7725 *************** Split 2 *************** Accuracy: 0.5036611998458442 Predicted 1 Predicted 0 Actual 1 4259 6152 Actual 0 5439 7503 *************** Split 3 *************** Accuracy: 0.533978503832484 Predicted 1 Predicted 0 Actual 1 4864 5610 Actual 0 5273 7606 *************** Split 4 *************** Accuracy: 0.540829871965058 Predicted 1 Predicted 0 Actual 1 4296 6306 Actual 0 4417 8334 *************** Split 5 *************** Accuracy: 0.5325654091551406 Predicted 1 Predicted 0 Actual 1 4525 5887 Actual 0 5029 7912 *************** Split 6 *************** Accuracy: 0.5332933670192267 Predicted 1 Predicted 0 Actual 1 4960 5533 Actual 0 5366 7494 *************** Split 7 *************** Accuracy: 0.5422429666424015 Predicted 1 Predicted 0 Actual 1 3739 6704 Actual 0 3986 8924 *************** Split 8 *************** Accuracy: 0.5345351774932556 def support_vector_machine_model(X_train, y_train, X_test, **params): # X = (X - µ) / σ scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) params['loss'] = 'hinge' clf = SGDClassifier(**params) clf.fit(X_train, y_train) return clf.predict(X_test), clf best_params_svc = {} model = support_vector_machine_model accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru print('-' * 50) 'Average unoptimized accuracy: %f' % accuracy
  • 11. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 11/22 Optimizing the Support Vector Machine Model By running SVM model one time with the built in parameters, we got an average accuracy of 0.531 from 8 splits. To try to improve this, we will do a few things listed below: First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By splitting the training and test sets up multiple times, we can minimize the effects of outliers. Second, we want to see how changing the value of alpha, number of iterations, and penalty will affect the accuracy. To do this we have another for loop which sets alpha at 20 linear increments from 0.00001 to 0.01. The number of iterations could be 10, 15, 30, 60, or 100 and penalty could be L1 or L2. We found that the optimal value for alpha is 0.00421. The optimal penalty is L2 at 100 iterations. Alpha is just a constant multiplied to the regularization term so our value of 0.00421 is expected. Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini lab. Epsilon was not changed because the results had noisy accuracy and we decided to remove it. We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model. The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 0.53 +/- 0.1. This time, 100 iterations is the optimal number. Although the accuracy per iteration was still going up, we had to stop at 100 iterations due to running time restraints. Out[14]: 'Average unoptimized accuracy: 0.531730' Predicted 1 Predicted 0 Actual 1 3902 6555 Actual 0 4315 8581 --------------------------------------------------
  • 12. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 12/22 In [20]: 100%|██████████| 5/5 [15:18<00:00, 208.83s/it] Best n_iter 100 100%|██████████| 20/20 [2:06:52<00:00, 384.29s/it] Best alpha 0.00421631578947 test_params = [ ('n_iter', [10, 15, 30, 60, 100]), ('alpha', np.linspace(0.00001, 0.01, 20)), ('penalty', ['l1', 'l2']) ] model = support_vector_machine_model for param, test_values in test_params: best_params_svc[param] = find_optimal_accuracy( model=model, param=param, param_values=test_values, params=best_params_svc ) print("Best", param, best_params_svc[param]) time.sleep(1)
  • 13. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 13/22 Optimized Support Vector Machine Model Performance Once we plugged in all optimal values into the model, the final accuracy became 0.560, which is better than that of 0.531 from default parameters. Due to the dataset being very complicated, no large improvement in accuracy was expected but we were pleased that this improvement was more than the improvement from logistic regression. In [21]: Comparing the Results of the Two Models Here is an accuracy vs time comparison of the two models with parameters optimized. Although the optimized logistic regression model performed better than that of the SVM model, the difference in accuracy is not significant. In [25]: Back to Top 100%|██████████| 2/2 [21:15<00:00, 739.44s/it] Best penalty l2 Optimized SVC Accuracy 0.560202 3min 59s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) Out[25]: Accuracy Time Logistic Regression 0.568 1 min Support Vector Machine 0.560 4 min %%timeit -n1 -r1 accuracy = test_accuracy( support_vector_machine_model, n_splits=5, params=best_params_svc) print('Optimized SVC Accuracy %f' % accuracy) pd.DataFrame([[0.568, '1 min'], [0.56, '4 min']], columns=['Accuracy', 'Time'], i
  • 14. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 14/22 Advantages of Each Model [10 points] Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail. Model Advantages Advantages in Accuracy Logistic regression runs best when there is a single linear decision boundary. However, our dataset is a fairly hard problem to solve and the decision line is not very smooth. We know this because we ran logistic regression one time using the built in parameters, we got an accuracy of 0.554. After optimizing for iterations, alpha, epsilon, and L1 and L2 penalties, we were only able to obtain a final accuracy of 0.568. This could be an indication of optimizing parameters individually, where interaction between the terms were not considered and we also have a high risk of overfitting our model. The advantages of support vector machines is that we can fit a region for a decision boundary, we are not constrained to a single line as above. We thought this would be better for our dataset because we have so many factors and do not think the boundaries are a clear linear line. We were suprised when we ran SDG with basic parameters (set alpha but did not test for optimization) and we found an accuracy of 0.531 (less than that of logistic regression but could be due to lack of optimization). After optimizing, we were able to achieve an accuracy of 0.560. This is a slight improvement, where we were still optimizing parameters individually, where interaction between the terms were not considered and we also have a high risk of overfitting our model. Advantages in Time and Efficiency For the sklearn functions that were considered, SVC with a linear kernel calculates the distance between each point in the dataset. Thus, the run time is essentially number of features multiplied by the number of observations squared. In other words, longer than the patience of some team members to watch it complete and is the slowest method we used. As mentioned above this was improved by LinearSVC because it is implemented using liblinear which uses a linear SVC and a logistic regression. This means run time is log linear times linear which is better than SVC. Logistic regression uses the liblinear library and uses a one vs the rest algorithm. This means that the run time is in log linear time, improving the efficiency from SVC functions. SGDClassifier is fastest and arguably linear, which to a software engineer a matrix can only run in n * m for n the number of features and m the number of observations. It's convergence to a solution depending on the loss setting means it uses only a subset of the dataset also improving time.
  • 15. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 15/22 In terms of our dataset, we have about 2000 features which is a sparse dataset. Logistic regression turned out to be the fastest but SDGClassifier is a close second. Conclusion SDG with loss = "log" was our best performer in terms of accuracy (0.568) and was our fastest algorithm. So, we decided that the logistic regression model was best for our dataset. Back to Top Feature Importance [30 points] Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important? Logistic Regression Feature Importance Above we chose the logistic regression model over the SVM so we have pulled out the top 50 variables in our dataset below.
  • 16. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 16/22 In [15]: Interpret Feature Importance _, clf = logistic_regression_model(X_train=X, y_train=y, X_test=X, **best_params_ abs_coefs = np.abs(clf.coef_[0]) top_50_vars = pd.Series(abs_coefs, index=X.columns).sort_values().index[:50] importance_top_50 = pd.Series(clf.coef_[0], index=X.columns).loc[top_50_vars] plt.figure(figsize=(15, 20)) importance_top_50.plot(kind='barh') plt.title('Logistic Regression Feature Importance (TOP 50 Variables)') plt.xlabel('Weight', fontsize=15) plt.ylabel('Feature', fontsize=15);
  • 17. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 17/22 The features After scaling the continuous variables, we found that 49 of the top 50 important features were a flavor of propertyzoningdesc (per county). This means that these variables are the showed the most importance for predicting logerror, interestingly weighted on both sides of + or -. We think that this could be due to the fact that neighborhoods based on location and amenities near by highly dictate the sales price of a property or highly sway the difference in price vs estimation. Assessmentyear was also an important feature because when the property was assessed could be a strong indicator if the sales price of a property appreciated or depreciated in price. Outside of the top 50 important features, propertytax was also a "big" factor. We say big because after propertyzoningdesc is accounted for the weights become exponentially smaller. Perhaps the more land owners pay for property tax could better predict property value because more amenities could be added for a richer neighborhood than that for a poorer. The weights There are over 2000 variables in our dataset and we have a good amount of missing values so our dataset is already fairly sparse. This could be why our largest weight was under 0.002. We also found that only 36 features had a weight higher than 0.0005. While this sounds like good news (yay only 36 features to key in on!) all of these were flavors of property zoning. Back to Top Insights [10 points] Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model — then analyze the support vectors from the subsampled dataset. Interpret the Support Vectors We reviewed support vectors for logerror equals 0 (for negative values) versus 1 (for positive values) for a number of features using Kernel Density Estimation (KDE). From our analysis of feature importance, we found significance for propertyzoningdesc. Unfortunately, these were categorical variables so we were unable to make a comparison of how much more effective the support vectors were at defining the classification boundaries. However, we are able to perform the KDE for the tax amount features. We selected features that are intuitively relevant in the real estate industry for predicting sale price, e.g. bathrooms, bedrooms, square footage, year built. Since we are testing logerror, and not Sale Price, we didn't see any significant differences between the original and the chosen support vectors
  • 18. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 18/22 for these features. In other words, it wouldn't be unusual to see the original logerror approximating the support vectors, since the effects of these features have already been backed into logerror. These are the features where SVM resembled the original data. bathroomcnt, fullbathcnt, calculatedbathnbr - For all these features, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. Negative and positive logerror were consistent between the original and the resulting support vectors. yearbuilt - Original data had more details with additional curvature while the SVM model had the same shape as the original data set but with less details in curvature. The curves for positive (1) and negative (0) logerror follow the same shape, as seen in both graphs. Negative and positive logerror were consistent between the original and the resulting support vectors. Tax related features - taxamount and taxvaluedollarcnt - tell a different story and we found differences between the original and SVM. The original data had positive (1) and negative (0) error peaks at around the same dollar amounts with very minor peaks at higher values. The SVM model did not really follow the original graph shape and instead exaggerated the second peak. Also, it had logerror 0 surpassing logerror 1 for the highest density, where the original data portrayed the opposite. The SVM model for taxamount and taxvaluedollarcnt did not approximate the original data. In [6]: Density Graph of Positive (1) and Negative (0) Logerror for Six Variables For bathroomcnt, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. Negative logerror had a larger area underneath the curve than positive logerror, as seen in both graphs. The SVM model for bathroomcnt overall did good in preserving original data integrity. Number of support vectors for each feature: [333 420] from sklearn.svm import SVC clf = SVC(kernel='linear', max_iter=500) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) clf.fit(X_scaled, y) # this hold the indexes of the support vectors clf.support_ # this holds a subset of the data which is used for support vectors support_vectors = pd.DataFrame(clf.support_vectors_, columns=X.columns) # get number of support vectors for each class print('Number of support vectors for each feature:', clf.n_support_)
  • 19. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 19/22 For fullbathcnt, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for fullbathcnt overall did good in preserving original data integrity. For taxamount and taxvaluedollarcnt, the original data had positive and negative error peak at around the same dollar amount with a very minor peak at an higher value. The SVM model did not really follow the original graph shape and instead exaggerated the second peak. Also, it had logerror 1 surpassing logerror 0 for the highest density, where the original data portrayed the opposite. The SVM model for taxamount and taxvaluedollarcnt did not do well in preserving original data integrity. For calculatedbathnbr, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for calculatedbathnbr overall did good in preserving original data integrity. For yearbuilt, the original data had more details in additional curvature while the SVM model had the same shape as the original data set but with lesser details in curvature. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for yearbuilt overall did good in preserving original data integrity. Overall, the SVM model kept data integrity for bathroomcnt, fullbathcnt, calculatedbathnbr, and yearbuilt but not really for taxamount and taxvaluedollarcnt.
  • 20. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 20/22 In [14]: V_grouped = support_vectors.groupby(y.loc[clf.support_].values) X_grouped = X.groupby(y.values) vars_to_plot = ['bathroomcnt','fullbathcnt','calculatedbathnbr', 'yearbuilt','taxamount','taxvaluedollarcnt'] for v in vars_to_plot: plt.figure(figsize=(10,4)).subplots_adjust(wspace=.4) plt.subplot(1,2,1) V_grouped[v].plot.kde() plt.legend(['logerror 0','logerror 1']) plt.title(v+' (Instances chosen as Support Vectors)') plt.subplot(1,2,2) X_grouped[v].plot.kde() plt.legend(['logerror 0','logerror 1']) plt.title(v+' (Original)')
  • 22. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 22/22 Back to Top References: Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels (https://www.kaggle.com/c/zillow-prize-1/kernels) Scikitlearn logistic regression: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html (http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) Scikitlearn linear SVC: http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas (https://stackoverflow.com/questions/tagged/pandas) Scikitlearn SDGClassfier: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html (http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)