Melden

Teilen

•0 gefällt mir•160 views

https://github.com/yaowser/data_mining_group_project https://www.kaggle.com/c/zillow-prize-1/data From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms: Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph

•0 gefällt mir•160 views

Melden

Teilen

https://github.com/yaowser/data_mining_group_project https://www.kaggle.com/c/zillow-prize-1/data From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms: Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph

- 1. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 1/22 Mini-lab 1: Zillow Dataset Logistic Regression and SVMs MSDS 7331 Data Mining - Section 403 - Mini Lab 1 Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion Contents Imports Models Advantages of Each Model Feature Importance Insights References Imports We chose to use the same Zillow dataset from Lab 1 for this exploration in logistic regression and SVM. For origin and purpose of dataset as well as a detailed description of the dataset, refer to https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb (https://github.com/post2web/data_mining_group_project/blob/master/notebooks/lab1.ipynb). In [1]: Load Data, Create y and X Since we are using the Zillow dataset from our previous lab, the cleanup files were exported from lab 1 into mini-lab 1. Note that for logistic regression and support vector classifier models, we choose to use mostly complete continuous variables as well as create dummy variables for nominal %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.svm import SVC, LinearSVC from sklearn.preprocessing import StandardScaler, MinMaxScaler from tqdm import tqdm import time from collections import OrderedDict warnings.filterwarnings('ignore')
- 2. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 2/22 variables to cross compare the performance, feature importance, and insights of each model. X is the training set and y is the test set, where we are testing if our models can accurately predict positive (1) logerrror from that of negative (0). Data columns that are only available for the training set and not the test set (transaction date) were removed. parcelid was removed because each individual property has its own ID and does not correlate well with regression or SVMs. The column that was created for "New Features" from Lab 1 (city and pricepersqft) were also removed for the sake of simplicity of only using original data for the prediction process. In [2]: Dealing with Nominal Data Nominal data usually has more than two values. For logistic regression and SVMs, we created dummy variables that only factor in 0s and 1s for the prediction process of logistic regression and SVMs. In [3]: Dealing with Continuous Data StandardScaler from sklearn was applied to the continuous data columns to standardize the dataset around center 0 with equal variance for creating normal distributions prior to the application of logistic regression and SVMs. In [4]: Out[2]: 'The dataset has 116761 rows and 49 columns' # load datasets here: variables = pd.read_csv('../../datasets/variables.csv').set_index('name') X = pd.read_csv('../../datasets/train.csv', low_memory=False) y = (X['logerror'] > 0).astype(np.int32) del X['logerror'] del X['transactiondate'] del X['parcelid'] del X['city'] del X['price_per_sqft'] 'The dataset has %d rows and %d columns' % X.shape nominal = variables[variables['type'].isin(['nominal'])] nominal = nominal[nominal.index.isin(X.columns)] nominal_data = X[nominal.index] nominal_data = pd.get_dummies(nominal_data, drop_first=True) nominal_data = nominal_data[nominal_data.columns[~nominal_data.columns.isin(nomina continuous = variables[~variables['type'].isin(['nominal'])] continuous = continuous[continuous.index.isin(X.columns)] continuous_data = X[continuous.index]
- 3. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 3/22 Merging the Data The data was then merged for the application of logistic regression and SVM prediction. The following shows the final shape of the dataset after the application of dummy variables and StandardScaler. In [5]: Back to Top Models [50 points] Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. Create Models SGDClassifier Over the Other Sklearn Functions We tried out a few sklearn support vector machine functions and noticed that the accuracy was similar for each but with such a large dataset we decided to try to cut down on the time for logistic regression. First, we tried SVC setting kernel = 'linear' but waited a long time for it to finish. Next, we tried LinearSVC because the liblinear library it uses tends to be faster to converge the larger the number of samples is than the libsvm library. Finally tried SGDClassifier with loss = 'log' which was exponentially faster than the others so this is what we use for logistic regression. Functions to Test Accuracy These are the functions that we wrote to individually find, visualize, and report the best parameters per model, where we reuse those parameters for the optimized model. Out[5]: 'The dataset has 116761 rows and 2107 columns' X = pd.concat([continuous_data, nominal_data], axis=1) columns = X.columns 'The dataset has %d rows and %d columns' % X.shape
- 4. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 4/22 In [7]: Logistic Regression For the logistic regression model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the logistic regression prediction for positive or negative logerror was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we are slightly better than 50% accuracy. def test_accuracy(model, n_splits=8, print_steps=False, params={}): accuracies = [] for i in range(1, n_splits+1): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, ra yhat, _ = model( X_train=X_train, y_train=y_train, X_test=X_test, **params ) accuracy = float(sum(yhat==y_test)) / len(y_test) accuracies.append(accuracy) if print_steps: matrix = pd.DataFrame(confusion_matrix(y_test, yhat), columns=['Predicted 1', 'Predicted 0'], index=['Actual 1', 'Actual 0'], ) print('*' * 15 + ' Split %d ' % i + '*' * 15) print('Accuracy:', accuracy) print(matrix) return np.mean(accuracies) def find_optimal_accuracy(model, param, param_values, params={}): result = {} for param_value in tqdm(list(param_values)): params_local = params.copy() params_local[param] = param_value result[param_value] = test_accuracy(model, params=params_local) result = pd.Series(result).sort_index() plt.xlabel(param, fontsize=15) plt.ylabel('Accuracy', fontsize=15) optimal_param = result.argmax() optimal_accuracy = result[optimal_param] if type(param_value) == str: result.plot(kind='bar') else: result.plot() plt.show() return optimal_param
- 5. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 5/22 In [8]: *************** Split 1 *************** Accuracy: 0.5544897871793774 Predicted 1 Predicted 0 Actual 1 5957 4488 Actual 0 5916 6992 *************** Split 2 *************** Accuracy: 0.5663940393097247 Predicted 1 Predicted 0 Actual 1 1456 8955 Actual 0 1171 11771 *************** Split 3 *************** Accuracy: 0.5502933241981758 Predicted 1 Predicted 0 Actual 1 5883 4591 Actual 0 5911 6968 *************** Split 4 *************** Accuracy: 0.5652806919881814 Predicted 1 Predicted 0 Actual 1 1783 8819 Actual 0 1333 11418 *************** Split 5 *************** Accuracy: 0.5028475998801011 Predicted 1 Predicted 0 Actual 1 8796 1616 Actual 0 9994 2947 *************** Split 6 *************** Accuracy: 0.5612126921594656 Predicted 1 Predicted 0 Actual 1 2250 8243 Actual 0 2004 10856 *************** Split 7 *************** Accuracy: 0.5663940393097247 Predicted 1 Predicted 0 Actual 1 2750 7693 Actual 0 2433 10477 *************** Split 8 *************** Accuracy: 0.5659230077506102 Predicted 1 Predicted 0 def logistic_regression_model(X_train, y_train, X_test, **params): scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) params['loss'] = 'log' clf = SGDClassifier(**params) clf.fit(X_train, y_train) return clf.predict(X_test), clf best_params_logistic = {} model = logistic_regression_model accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru print('-' * 50) 'Average unoptimized accuracy: %f' % accuracy
- 6. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 6/22 Optimizing the Logistic Regression Model By running logistic regression one time with the built in parameters, we got an average accuracy of 0.554 from 8 splits. To try to improve this, we want to do a few things. First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By splitting the training and test sets up multiple times, we can minimize the effects of outliers. Second, we want to see how changing the value of alpha, epsilon, number of iterations, and penalty will affect the accuracy. To do this we have another for loop which sets alpha and epsilon at ten and twenty linear increments from 0.00001 to 0.001 and 0.01 to .5, respectively. The number of iterations could be 1, 3, 6, 10, or 15 and penalty could be L1 or L2. We found that the optimal value for alpha is 0.00023 and that for epsilon is 0.293. The optimal penalty is L2 at 15 iterations. Alpha is just a constant multiplied to the regularization term so our value of 0.00023 is expected. Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini lab. Our value of 0.293 for epsilon is important in the threshold of our model which is why we ran iterations over values very close to 0. The default is 0.001 but we found values even smaller than that increased our accuracy. We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model. The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 0.55 +/- 0.1. This time, 15 iterations is the optimal number. Although the accuracy per iteration was still going up, we had to stop at 15 iterations due to running time restraints. Out[8]: 'Average unoptimized accuracy: 0.554104' Actual 1 3354 7103 Actual 0 3034 9862 --------------------------------------------------
- 7. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 7/22 In [9]: 100%|██████████| 5/5 [06:04<00:00, 76.82s/it] Best n_iter 15 100%|██████████| 10/10 [17:06<00:00, 102.69s/it] Best alpha 0.00023 test_params = [ ('n_iter', [1, 3, 6, 10, 15]), ('alpha', np.linspace(0.00001, 0.001, 10)), ('epsilon', np.linspace(0.01, .5, 20)), ('penalty', ['l1', 'l2']) ] for param, param_values in test_params: best_params_logistic[param] = find_optimal_accuracy( logistic_regression_model, param=param, param_values=param_values, params=best_params_logistic ) print("Best", param, best_params_logistic[param]) time.sleep(1)
- 8. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 8/22 Optimized Logistic Regression Model Performance Once we plugged in all optimal values into the model, the final accuracy became 0.568, which is slightly better than that of 0.554 from default parameters. Due to the dataset being very complicated, no large improvement in accuracy was expected. In [10]: 100%|██████████| 20/20 [34:21<00:00, 103.28s/it] Best epsilon 0.293684210526 100%|██████████| 2/2 [04:19<00:00, 140.15s/it] Best penalty l2 Optimized Logistic Regression Accuracy 0.568090 1min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) %%timeit -n1 -r1 accuracy = test_accuracy( logistic_regression_model, n_splits=5, params=best_params_logistic) print('Optimized Logistic Regression Accuracy %f' % accuracy)
- 9. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 9/22 Support Vector Machine Classifier For the support vector machine model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the SVM prediction for positive or negative logerror was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we are again slightly better than 50% accuracy.
- 10. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 10/22 In [14]: *************** Split 1 *************** Accuracy: 0.532736693358455 Predicted 1 Predicted 0 Actual 1 4716 5729 Actual 0 5183 7725 *************** Split 2 *************** Accuracy: 0.5036611998458442 Predicted 1 Predicted 0 Actual 1 4259 6152 Actual 0 5439 7503 *************** Split 3 *************** Accuracy: 0.533978503832484 Predicted 1 Predicted 0 Actual 1 4864 5610 Actual 0 5273 7606 *************** Split 4 *************** Accuracy: 0.540829871965058 Predicted 1 Predicted 0 Actual 1 4296 6306 Actual 0 4417 8334 *************** Split 5 *************** Accuracy: 0.5325654091551406 Predicted 1 Predicted 0 Actual 1 4525 5887 Actual 0 5029 7912 *************** Split 6 *************** Accuracy: 0.5332933670192267 Predicted 1 Predicted 0 Actual 1 4960 5533 Actual 0 5366 7494 *************** Split 7 *************** Accuracy: 0.5422429666424015 Predicted 1 Predicted 0 Actual 1 3739 6704 Actual 0 3986 8924 *************** Split 8 *************** Accuracy: 0.5345351774932556 def support_vector_machine_model(X_train, y_train, X_test, **params): # X = (X - µ) / σ scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) params['loss'] = 'hinge' clf = SGDClassifier(**params) clf.fit(X_train, y_train) return clf.predict(X_test), clf best_params_svc = {} model = support_vector_machine_model accuracy = test_accuracy(model=model, params=best_params_logistic, print_steps=Tru print('-' * 50) 'Average unoptimized accuracy: %f' % accuracy
- 11. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasic… 11/22 Optimizing the Support Vector Machine Model By running SVM model one time with the built in parameters, we got an average accuracy of 0.531 from 8 splits. To try to improve this, we will do a few things listed below: First, we want to do the 80/20 split 5 times and average those results to get a better accuracy. By splitting the training and test sets up multiple times, we can minimize the effects of outliers. Second, we want to see how changing the value of alpha, number of iterations, and penalty will affect the accuracy. To do this we have another for loop which sets alpha at 20 linear increments from 0.00001 to 0.01. The number of iterations could be 10, 15, 30, 60, or 100 and penalty could be L1 or L2. We found that the optimal value for alpha is 0.00421. The optimal penalty is L2 at 100 iterations. Alpha is just a constant multiplied to the regularization term so our value of 0.00421 is expected. Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini lab. Epsilon was not changed because the results had noisy accuracy and we decided to remove it. We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model. The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 0.53 +/- 0.1. This time, 100 iterations is the optimal number. Although the accuracy per iteration was still going up, we had to stop at 100 iterations due to running time restraints. Out[14]: 'Average unoptimized accuracy: 0.531730' Predicted 1 Predicted 0 Actual 1 3902 6555 Actual 0 4315 8581 --------------------------------------------------
- 12. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 12/22 In [20]: 100%|██████████| 5/5 [15:18<00:00, 208.83s/it] Best n_iter 100 100%|██████████| 20/20 [2:06:52<00:00, 384.29s/it] Best alpha 0.00421631578947 test_params = [ ('n_iter', [10, 15, 30, 60, 100]), ('alpha', np.linspace(0.00001, 0.01, 20)), ('penalty', ['l1', 'l2']) ] model = support_vector_machine_model for param, test_values in test_params: best_params_svc[param] = find_optimal_accuracy( model=model, param=param, param_values=test_values, params=best_params_svc ) print("Best", param, best_params_svc[param]) time.sleep(1)
- 13. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 13/22 Optimized Support Vector Machine Model Performance Once we plugged in all optimal values into the model, the final accuracy became 0.560, which is better than that of 0.531 from default parameters. Due to the dataset being very complicated, no large improvement in accuracy was expected but we were pleased that this improvement was more than the improvement from logistic regression. In [21]: Comparing the Results of the Two Models Here is an accuracy vs time comparison of the two models with parameters optimized. Although the optimized logistic regression model performed better than that of the SVM model, the difference in accuracy is not significant. In [25]: Back to Top 100%|██████████| 2/2 [21:15<00:00, 739.44s/it] Best penalty l2 Optimized SVC Accuracy 0.560202 3min 59s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) Out[25]: Accuracy Time Logistic Regression 0.568 1 min Support Vector Machine 0.560 4 min %%timeit -n1 -r1 accuracy = test_accuracy( support_vector_machine_model, n_splits=5, params=best_params_svc) print('Optimized SVC Accuracy %f' % accuracy) pd.DataFrame([[0.568, '1 min'], [0.56, '4 min']], columns=['Accuracy', 'Time'], i
- 14. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 14/22 Advantages of Each Model [10 points] Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail. Model Advantages Advantages in Accuracy Logistic regression runs best when there is a single linear decision boundary. However, our dataset is a fairly hard problem to solve and the decision line is not very smooth. We know this because we ran logistic regression one time using the built in parameters, we got an accuracy of 0.554. After optimizing for iterations, alpha, epsilon, and L1 and L2 penalties, we were only able to obtain a final accuracy of 0.568. This could be an indication of optimizing parameters individually, where interaction between the terms were not considered and we also have a high risk of overfitting our model. The advantages of support vector machines is that we can fit a region for a decision boundary, we are not constrained to a single line as above. We thought this would be better for our dataset because we have so many factors and do not think the boundaries are a clear linear line. We were suprised when we ran SDG with basic parameters (set alpha but did not test for optimization) and we found an accuracy of 0.531 (less than that of logistic regression but could be due to lack of optimization). After optimizing, we were able to achieve an accuracy of 0.560. This is a slight improvement, where we were still optimizing parameters individually, where interaction between the terms were not considered and we also have a high risk of overfitting our model. Advantages in Time and Efficiency For the sklearn functions that were considered, SVC with a linear kernel calculates the distance between each point in the dataset. Thus, the run time is essentially number of features multiplied by the number of observations squared. In other words, longer than the patience of some team members to watch it complete and is the slowest method we used. As mentioned above this was improved by LinearSVC because it is implemented using liblinear which uses a linear SVC and a logistic regression. This means run time is log linear times linear which is better than SVC. Logistic regression uses the liblinear library and uses a one vs the rest algorithm. This means that the run time is in log linear time, improving the efficiency from SVC functions. SGDClassifier is fastest and arguably linear, which to a software engineer a matrix can only run in n * m for n the number of features and m the number of observations. It's convergence to a solution depending on the loss setting means it uses only a subset of the dataset also improving time.
- 15. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 15/22 In terms of our dataset, we have about 2000 features which is a sparse dataset. Logistic regression turned out to be the fastest but SDGClassifier is a close second. Conclusion SDG with loss = "log" was our best performer in terms of accuracy (0.568) and was our fastest algorithm. So, we decided that the logistic regression model was best for our dataset. Back to Top Feature Importance [30 points] Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important? Logistic Regression Feature Importance Above we chose the logistic regression model over the SVM so we have pulled out the top 50 variables in our dataset below.
- 16. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 16/22 In [15]: Interpret Feature Importance _, clf = logistic_regression_model(X_train=X, y_train=y, X_test=X, **best_params_ abs_coefs = np.abs(clf.coef_[0]) top_50_vars = pd.Series(abs_coefs, index=X.columns).sort_values().index[:50] importance_top_50 = pd.Series(clf.coef_[0], index=X.columns).loc[top_50_vars] plt.figure(figsize=(15, 20)) importance_top_50.plot(kind='barh') plt.title('Logistic Regression Feature Importance (TOP 50 Variables)') plt.xlabel('Weight', fontsize=15) plt.ylabel('Feature', fontsize=15);
- 17. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 17/22 The features After scaling the continuous variables, we found that 49 of the top 50 important features were a flavor of propertyzoningdesc (per county). This means that these variables are the showed the most importance for predicting logerror, interestingly weighted on both sides of + or -. We think that this could be due to the fact that neighborhoods based on location and amenities near by highly dictate the sales price of a property or highly sway the difference in price vs estimation. Assessmentyear was also an important feature because when the property was assessed could be a strong indicator if the sales price of a property appreciated or depreciated in price. Outside of the top 50 important features, propertytax was also a "big" factor. We say big because after propertyzoningdesc is accounted for the weights become exponentially smaller. Perhaps the more land owners pay for property tax could better predict property value because more amenities could be added for a richer neighborhood than that for a poorer. The weights There are over 2000 variables in our dataset and we have a good amount of missing values so our dataset is already fairly sparse. This could be why our largest weight was under 0.002. We also found that only 36 features had a weight higher than 0.0005. While this sounds like good news (yay only 36 features to key in on!) all of these were flavors of property zoning. Back to Top Insights [10 points] Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model — then analyze the support vectors from the subsampled dataset. Interpret the Support Vectors We reviewed support vectors for logerror equals 0 (for negative values) versus 1 (for positive values) for a number of features using Kernel Density Estimation (KDE). From our analysis of feature importance, we found significance for propertyzoningdesc. Unfortunately, these were categorical variables so we were unable to make a comparison of how much more effective the support vectors were at defining the classification boundaries. However, we are able to perform the KDE for the tax amount features. We selected features that are intuitively relevant in the real estate industry for predicting sale price, e.g. bathrooms, bedrooms, square footage, year built. Since we are testing logerror, and not Sale Price, we didn't see any significant differences between the original and the chosen support vectors
- 18. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 18/22 for these features. In other words, it wouldn't be unusual to see the original logerror approximating the support vectors, since the effects of these features have already been backed into logerror. These are the features where SVM resembled the original data. bathroomcnt, fullbathcnt, calculatedbathnbr - For all these features, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. Negative and positive logerror were consistent between the original and the resulting support vectors. yearbuilt - Original data had more details with additional curvature while the SVM model had the same shape as the original data set but with less details in curvature. The curves for positive (1) and negative (0) logerror follow the same shape, as seen in both graphs. Negative and positive logerror were consistent between the original and the resulting support vectors. Tax related features - taxamount and taxvaluedollarcnt - tell a different story and we found differences between the original and SVM. The original data had positive (1) and negative (0) error peaks at around the same dollar amounts with very minor peaks at higher values. The SVM model did not really follow the original graph shape and instead exaggerated the second peak. Also, it had logerror 0 surpassing logerror 1 for the highest density, where the original data portrayed the opposite. The SVM model for taxamount and taxvaluedollarcnt did not approximate the original data. In [6]: Density Graph of Positive (1) and Negative (0) Logerror for Six Variables For bathroomcnt, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. Negative logerror had a larger area underneath the curve than positive logerror, as seen in both graphs. The SVM model for bathroomcnt overall did good in preserving original data integrity. Number of support vectors for each feature: [333 420] from sklearn.svm import SVC clf = SVC(kernel='linear', max_iter=500) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) clf.fit(X_scaled, y) # this hold the indexes of the support vectors clf.support_ # this holds a subset of the data which is used for support vectors support_vectors = pd.DataFrame(clf.support_vectors_, columns=X.columns) # get number of support vectors for each class print('Number of support vectors for each feature:', clf.n_support_)
- 19. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 19/22 For fullbathcnt, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for fullbathcnt overall did good in preserving original data integrity. For taxamount and taxvaluedollarcnt, the original data had positive and negative error peak at around the same dollar amount with a very minor peak at an higher value. The SVM model did not really follow the original graph shape and instead exaggerated the second peak. Also, it had logerror 1 surpassing logerror 0 for the highest density, where the original data portrayed the opposite. The SVM model for taxamount and taxvaluedollarcnt did not do well in preserving original data integrity. For calculatedbathnbr, the original data had distinct numbers that were in increments of 0.5 while the SVM model had the same shape as the original data set but with continuous variables. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for calculatedbathnbr overall did good in preserving original data integrity. For yearbuilt, the original data had more details in additional curvature while the SVM model had the same shape as the original data set but with lesser details in curvature. The curves for positive and negative logerror follow the same shape, as seen in both graphs. The SVM model for yearbuilt overall did good in preserving original data integrity. Overall, the SVM model kept data integrity for bathroomcnt, fullbathcnt, calculatedbathnbr, and yearbuilt but not really for taxamount and taxvaluedollarcnt.
- 20. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 20/22 In [14]: V_grouped = support_vectors.groupby(y.loc[clf.support_].values) X_grouped = X.groupby(y.values) vars_to_plot = ['bathroomcnt','fullbathcnt','calculatedbathnbr', 'yearbuilt','taxamount','taxvaluedollarcnt'] for v in vars_to_plot: plt.figure(figsize=(10,4)).subplots_adjust(wspace=.4) plt.subplot(1,2,1) V_grouped[v].plot.kde() plt.legend(['logerror 0','logerror 1']) plt.title(v+' (Instances chosen as Support Vectors)') plt.subplot(1,2,2) X_grouped[v].plot.kde() plt.legend(['logerror 0','logerror 1']) plt.title(v+' (Original)')
- 22. 1/12/2018 Mini-Lab1Angelov_Yao_Kirasich_Asuncion http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Mini%20Lab%201/Mini-Lab1Angelov_Yao_Kirasi… 22/22 Back to Top References: Kernels from Kaggle competition: https://www.kaggle.com/c/zillow-prize-1/kernels (https://www.kaggle.com/c/zillow-prize-1/kernels) Scikitlearn logistic regression: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html (http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) Scikitlearn linear SVC: http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html (http://scikit- learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) Stackoverflow pandas questions: https://stackoverflow.com/questions/tagged/pandas (https://stackoverflow.com/questions/tagged/pandas) Scikitlearn SDGClassfier: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html (http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)