# 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

H

RararaLeslie Daniels
2 views63 Folien
Sql Lab 4 EssayLorie Harris
4 views95 Folien
234 views5 Folien

## Más de honey725342

### Último(20)

Dance KS5 Breakdown
WestHatch53 views
Azure DevOps Pipeline setup for Mule APIs #36
MysoreMuleSoftMeetup84 views
BYSC infopack.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego160 views
AI Tools for Business and Startups
Svetlin Nakov74 views
ICANN
RajaulKarim2061 views
Industry4wrd.pptx
BC Chew157 views
STERILITY TEST.pptx
Anupkumar Sharma107 views
Structure and Functions of Cell.pdf
Nithya Murugan256 views
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptx
Debapriya Chakraborty479 views
SIMPLE PRESENT TENSE_new.pptx
Narration ppt.pptx
TARIQ KHAN76 views
Lecture: Open Innovation
Michal Hron94 views
Class 10 English lesson plans
TARIQ KHAN189 views

### 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

• 1. 1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD September, 2017 Introduction The Masters in Predictive Analytics program at Northwestern University offers graduate courses that cover predictive modeling using several software products such as SAS, R and Python. The Predict 410 course is one of the core courses and this section focuses on using Python. Predict 410 will follow a sequence in the assignments. The first assignment will ask you to perform an EDA(See Ratner1 Chapters 1&2) for the Ames Housing Data dataset to determine the best single variable model. It will be followed by an
• 2. assignment to expand to a multivariable model. Python software for boxplots, scatterplots and more will help you identify the single variable. However, it is easy to get lost in the programming and lose sight of the objective. Namely, which of the variable choices best explain the variability in the response variable? (You will need to be familiar with the data types and level of measurement. This will be critical in determining the choice of when to use a dummy variable for model building. If this topic is new to you review the definitions at Types of Data before reading further.) This report will help you become familiar with some of the tools for EDA and allow you to interact with the data by using links to a software product, Shiny, that will demonstrate and interact with you to produce various plots of the data. Shiny is located on a cloud server and will allow you to make choices in looking at the plots for the data. Study the plots carefully. This is your initial EDA
• 3. tool and leads to your model building and your overall understanding of predictive analytics. Single Variable Linear Regression EDA 1. Become Familiar With the Data 2 Identify the variables that are categorical and the variables that are quantitative. For the Ames Housing Data, you should review the Ames Data Description pdf file. 2. Look at Plots of the Data For the variables that are quantitative, you should look at scatter plots vs the response variable saleprice. For the categorical variables, look at boxplots vs saleprice. You have sample Python code to help with the EDA and below are some links that will demonstrate the relationships for the a different building_prices dataset.
• 4. For the boxplots with Shiny: Click here For the scatterplots with Shiny: Click here 3. Begin Writing Python Code Start with the shell code and improve on the model provided. http://melvin.shinyapps.io/SboxPlot http://melvin.shinyapps.io/SScatter/ http://melvin.shinyapps.io/SScatter/ 3 Single Variable Logistic Regression EDA 1. Become Familiar With the Data In 411 you will have an introduction to logistic regression and again will ask you to perform an EDA. See the file credit data for more info. Make sure you recognize which variables are quantitative and which are categorical. And, for several of
• 5. these variables, what is the level of measurement? 2. Look at Plots of the Data For logistic regression, the response variable is of the type yes/no. In this dataset it is coded as good/bad. So, the EDA may include histograms for quantitative variables with a separate histogram for each of the response values. For numeric coded explanatory categorical variables, if the response good/bad is recoded as 0/1 then the mean for the response variable for each of the categories will indicate if there is a relationship. For the histograms with Shiny: Click here For the means with Shiny: Click here 3. Begin Writing Python Code OK. You have looked at the plots, which variable do you think will be most useful for predicting or explaining bad credit? After you answer this question, begin
• 6. writing Python code to see if you can replicate these plots. http://melvin.shinyapps.io/SHisto http://melvin.shinyapps.io/SHisto http://melvin.shinyapps.io/SCRMeans http://melvin.shinyapps.io/SCRMeans 4 The data set CREDIT contains information on 1000 customers. There are 21 variables in the data set: Name Model Role Measurement Level Description AGE Input Interval Age in years AMOUNT Input Interval Amount of credit requested CHECKING Input Nominal or Ordinal
• 7. Balance in existing checking account: 1 = less than 0 DM 2 = more than 0 but less than 200 DM 3 = at least 200 DM 4 = no checking account COAPP Input Nominal Other debtors or guarantors: 1 = none 2 = co-applicant 3 = guarantor DEPENDS Input Interval Number of dependents DURATION Input Interval Length of loan in months EMPLOYED Input Ordinal Time at present employment: 1 = unemployed 2 = less than 1 year 3 = at least 1, but less than 4 years 4 = at least 4, but less than 7 years 5 = at least 7 years EXISTCR Input Interval Number of existing accounts at this
• 8. bank FOREIGN Input Binary Foreign worker: 1 = Yes 2 = No GOOD_BAD Target Binary Credit Rating Status (good or bad) 5 HISTORY Input Ordinal Credit History: 0 = no loans taken / all loans paid back in full and on time 1 = all loans at this bank paid back in full and on time 2 = all loans paid back on time until now 3 = late payments on previous loans 4 = critical account / loans in arrears at other banks HOUSING Input Nominal Rent/Own: 1 = rent
• 9. 2 = own 3 = free housing INSTALLP Input Interval Debt as a percent of disposable income JOB Input Ordinal Employment status: 1 = unemployed / unskilled non-resident 2 = unskilled resident 3 = skilled employee / official 4 = management / self-employed / highly skilled employee / officer MARITAL Input Nominal Marital status and gender 1 = male – divorced/separated 2 = female – divorced/separated/married 3 = male – single 4 = male – married/widowed 5 = female – single OTHER Input Nominal or Ordinal
• 10. Other installment loans: 1 = bank 2 = stores 3 = none PROPERTY Input Nominal or Ordinal Collateral property for loan: 1 – real estate 2 = if not 1, building society savings agreement / life insurance 3 = if not 1 or 2, car or others 4 = unknown / no property 6 PURPOSE Input Nominal Reason for loan request: 0 = new car 1 = used car 2 = furniture/equipment
• 11. 3 = radio / television 4 = domestic appliances 5 = repairs 6 = education 7 = vacation 8 = retraining 9 = business x = other RESIDENT Input Interval Years at current address SAVINGS Input Nominal or Ordinal Savings account balance: 1 = less than 100 DM 2 = at least 100, but less than 500 DM 3 = at least 500, but less than 1000 DM 4 = at least 1000 DM 5 = unknown / no savings account TELEPHON Input Binary Telephone:
• 12. 1 = none 2 = yes, registered under the customer’s name Exploratory Data Analysis (EDA) Ratner1 describes ‘data mining’ “as any process that finds unexpected structures in data and uses the EDA framework to ensure that the process explores the data, not exploits it.” Unexpected suggests that the word exploratory is very appropriate to this process. Tukey2 in his book and in many presentations gave structure to EDA. Others have extended it to include ‘big’ data. Big data has occurred due to our ability to capture huge datasets, store it on servers cost effectively, and analyze it with software that will handle it. Shiny App s To learn more about Shiny applications with RStudio click on the link below:
• 13. http://rstudio.github.io/shiny/tutorial/ http://rstudio.github.io/shiny/tutorial/ 7 Types of Data Quantitative data are numeric and represent counts or measurements. Categorical data are names or labels such as a,b,c but can often be shown as 1,2,3. They do not suggest counts or measurements. Discrete data are finite or countable numeric data. Continuous data are values that represent a continuous scale of measurement. A nominal level of measurement suggests names or categories. There is no apparent order suggested. Ordinal level data suggest a sequential ordering but mathematical calculations should not be performed on this data. Interval level data are ordinal plus the difference between two data values is
• 14. meaningful. And, there is no zero level. Ratio level data are interval and have a zero level plus differences and ratios may be calculated. References: 1. Ratner, B. (2012). Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data (2nd ed.). New York: CRC Press [ISBN-13: 9781439860915] 2. Tukey, J.W. (1977). Exploratory Data Analysis. Addison- Wesley. Ames Housing OLS Regression Project (300 Points) The ames_train data set contains approximately 2039 records. See the data
• 15. description in the file Introduction_to_Ames_Housing_Data. This is a random selection of training data selected from the full dataset. Note, the index numbers have been randomized and the split between train and test is also random so you will not be able to match the test data with sale price values. You are to use OLS (“Linear”) Regression to predict the sale price for homes in the ames_test_sfam dataset by building two models using the ames_train data. Note, the test data set is single family homes, the training data is all homes. DELIVERABLES zip files). Your write up should have five sections. Each section should have enough detail so that I can follow your logic and someone else can replicate your work. (150
• 16. Points) analysis. I should be able to run this file and get all the output that you got. ames_test_sfam. There will be only two columns in this file: index and p_saleprice. You will be graded on how your model performs versus my model and those of other students in the class. submitting your csv file to kaggle at https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf It is OK to submit to Kaggle many times. You will have to tell me your alias for Kaggle so I can see the score.
• 17. https://www.kaggle.com/t/0415308f8dd54fc4abed54bef75448bf 이강복 강조 이강복 강조 WRITE UP (200 POINTS) 1. First Steps (40 points) Describe the ames_train data set so that I am convinced you understand it. Use my shell code as a start to explore the data. Apply your creativity and go from there. If you know how to do pivot tables in Excel, it is a great tool for Exploratory Data Analysis (EDA). EDA was well established by John Tukey. He was a great advocate for it and developed much of what we do today.
• 18. Knowing your data typically consists of three components: (a) a data survey, (b) a data quality check, and (c) an initial exploratory data analysis. (a) A Data Survey - Take a broad overview of the Ames housing data set. Read over the data documentation. What data do you have, and what is it supposed to represent? - In the linear regression component of this course you build linear regression models to predict the value of a property (single family home). Do you have the right data to properly address the problem? Are there observations in the data that should be excluded? - What kinds of problems can you properly address given the data that you have? In particular if you were to build a regression model with the variable SalePrice as the response variable, what types of properties would you be valuing? Be careful about what you are doing here.
• 19. (b) Define the Sample Population - When building statistical models you have to define the population of interest, and then sample from THAT population. Frequently you will not actively perform the sampling function. Instead, the data will be made available and you will have to sample from it retrospectively, i.e. you will need to carve out the population of interest. In this assignment the objective of is to be able to provide estimates of home values for 'typical' homes in Ames, Iowa. You may not be able to define what 이강복 강조 'typical' is, but can use the data to find out what is atypical. Any values which are not atypical are then considered to be typical. - Define the appropriate sample population for your statistical problem. Hint: You are building regression models for the response variable
• 20. SalePrice. Are all properties the same? Would you want to include an apartment building in the same sample as a single family residence? Would you want to include a warehouse or a shopping center in the same sample as a single family residence? Would you want to include condominiums in the same sample as a single family residence? - Define your sample using ‘drop conditions’. Create for the drop conditions and include it in your report so that it is clear to any reader what you are excluding from the data set when defining your sample population. The definition of your sample data should be clearly noted in your assignment report. (c) A Data Quality Check - In practice your data will not be 'clean'. You will need to examine your data for errors and outliers. Errors will not always show as outliers, and outliers are not
• 21. necessarily errors. - If you have a data dictionary that states the set of proper values for each field, then you will want to check your data against the data dictionary. - If you do not have a data dictionary, then you will need to reason and explore your way to a proper data set. Example 1: In this project you will be modeling the sales price of housing transactions. It should be obvious that none of these sales prices should be zero or negative. Observations with a zero or negative sales price should logically be considered to be errors. Example 2: Suppose we had a 'small' number of housing transactions with a sale price over one million dollars, should we consider these sales prices to be valid? In this case these values could be valid data points, which would make them outliers, or they could be errors, such as 140,000.00 entered as 1,400,000. In either case
• 22. they are not relevant data points if the objective is to model the 'typical' home price for the area. 2. EDA (30 Points) Pick ten variables from the data quality check to explore in your initial exploratory data analysis. Perform an initial exploratory data analysis. How do you perform an exploratory data analysis for continuous versus discrete (or categorical) data? Consider the use of scatterplots, scatterplot smoothers such as LOESS, and boxplots to produce relevant graphics when appropriate. Note that you are particularly interested in the relationships between the response variable and the predictor variables. Suggest you split your EDA into two sections in your report – one section for continuous variables and one section for discrete variables.
• 23. 3. BUILD MODELS (100 Points) Build at least four different LINEAR REGRESSION models. The first model should be a simple (single prediction variable) model. Find the best single variable model. The next model should be a multiple regression model with two predictor variables. Find the best two variable model. You do not need to build more complex models for this assignment. More complex models will be the topic for hw02. Show all of your models and the statistical significance of the input variables. Discuss the quality of fit, R squared and adjusted R squared, parsimony and anything else you can think of that might be of value to share. Discuss the coefficients in the model you select, do they make
• 24. sense? Are you keeping the model even though it is counter intuitive? 4. SELECT MODELS (20 Points) Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the metrics in a table to display the results. 5. WRITE MODEL FORMULA (10 Points) 이강복 강조 Write a mathematical formula that will show the model you selected. Explain your formula. Make sure you include this as a section in your report. Do not
• 25. expect that I will search your report to find it. This step should allow someone else to deploy your model. The variable with the predicted saleprice should be named: p_saleprice SCORED DATA FILE (100 POINTS) Use the python model that you selected. Score the data file ames_test_sfam. Overall scoring for your model is based on providing a prediction for every record in the test data. Make sure you have not deleted any records in the test data and that none of your predictions are out of range. Create a file that has only TWO variables for each record: index p_saleprice
• 26. The first variable, index, will allow me to match my grading key to your predicted value. If I cannot do this, you won’t get a grade. So please include this value. The second value, p_saleprice is the predicted price for a property per your model. Your values will be compared against … redict the Average value for everybody (MEAN) If your model is not better than simply using an AVERAGE value, you will lose points. BONUS If you want Bonus Points, write a brief section at the top of your Write Up document and tell me exactly what you did and how many points you are attempting.
• 27. If I cannot see your Bonus work, I cannot give you credit. Bonus is difficult to grade and I don’t have time to go back looking for it. If you don’t tell me it’s there, I cannot give you points. The policy with Bonus is: All Sales are Final ! the results the same? Are there any differences? run with it. I might give you points. PENALTY BOX e file
• 28. names of any files you hand in you hand in 1 Assignment Template New and Revised for September 2017 In the real world, you will be building predictive models and doing analytic work. But that is not your only function. After you do the work, you need to explain it to other people (most of whom will not understand analytics). Therefore, it is critical that you are able to explain your results in such a way that non analytic people can
• 29. understand it. If you dump 20 or 30 pages of output on the person and say “it’s all in here”, then they won’t read it. In fact, that person will likely just ignore your results go about with their day to day business without giving your work a second thought. This is not a desirable outcome. You must write your report so that it can be understood by others and it must contain enough detail that it can be replicated. In my work I am often handed the work of others and asked to provide a critique. If I am unable to replicate their work because it is lacking in detail then the critique will be very negative. It is not enough that you build a great model. You also have to sell it. DOs AND DON’Ts: our document in PDF Format
• 30. “Homework_03_Fred_Smith.pdf”) plaining the output discussion format mework_03.pdf” scroll through it discussing it diagram at the end of the document … UNLESS IT IS ABSOLUTELY NECESSARY to do that) 2
• 31. Example Report (with a lot of commentary) Assignment #3 Fred Smith PREDICT 410 Section 58 INTRODUCTION The introduction should describe the purpose of the assignment and what you are going to do in order to complete the assignment. It should be clear that you understand why you are performing certain steps in an analysis. BAD INTRODUCTION The purpose of this report is to analyze baseball data. GOOD INTRODUCTION The purpose of the assignment is to analyze data from somewhere in order to predict the number of something. This will be accomplished by generating simple and multivariate regression models using different variable
• 32. selection techniques including, but not limited to, Forward, Stepwise, and Backward regression. From these techniques, the best model will be selected. This best model will then be further analyzed to determine if it is an adequate model to predict or if further analysis is necessary. Make sure you follow the assignment instructions. To get points for each of these sections, you have to show them in your report. Each assignment will require a different type report. This template is fairly generic so adjust to assignment instructions. If I don’t see the section in your report you will get 0 points for it. 1. Data Exploration Important step. This is where you make or break model building. Spend time on this.
• 33. 3 o many charts, Bar Charts, Box Plots, Scatter Plots of the data variables?) be imputed “fixed”? will cause test records to be deleted, fix them. 2. Data Preparation Also, a critical section. Experiment with this step. Be creative. I like creative ideas even if they don’t work. Fix outliers
• 34. to create new variables 3. Build Models These are instructions from Assignment 1 but will be similar in the other assignments. Build at least two different LINEAR REGRESSION models using different variables. Show all of your models and the statistical significance of the input variables. Discuss the coefficients in the model, do they make sense? Are you keeping the model even though it is counter intuitive? Why? Display the Python results for your assignment and comment on the results. Your
• 35. discussion of the results should be intertwined with (or linked to) the Python output, i.e. the discussion should be on or near the page containing the output. You should not be showing a lot of unnecessary Python output. 4 Discuss the results thoroughly. Include such discussion points as: g different be done? GOOD DESCRIPTION OF A DIAGRAM The analysis continues by examining the plot of the residual values versus the
• 36. predicted variables given in Figure 1. In this type of analysis, a visual inspection of the chart is conducted to determine whether or not any patterns exist in the residuals. Some patterns might include errors that increase or decrease with larger predictive variables or some other type of pattern such as a curve. In an ideal situation, the data will appear to be random. An inspection of Figure 1 suggests that the data points are randomly distributed and no obvious patterns exist in the data. Therefore, there are no immediate concerns with the distribution of the errors. Figure 1 Housing Data Predicted vs Residual Graph BAD DESCRIPTION OF A DIAGRAM I examined the output at the end of the document. There are no patterns in the data. GOOD DESCRIPTION OF AN EQUATION
• 37. The model chosen from the different candidates was the XXX model because it had the highest Adjusted R-Squared value and the lowest AIC and SBC values. Using these metrics, it was far superior to the other models. The formula given for the predicted sale price is: p_saleprice = 50000 + 5000 * X1 LotFrontage + 6000 * X2 LotArea + 3000 * X3 OverallCond 5 The formula makes intuitive sense for the most part because sale price coefficients reflect that size and condition add to the value of a property. However, the data should be analyzed for multi-collinearity
• 38. which can result in sign changes. Also, it might be wise to remove the variable from the model if no explanation can be found. BAD DESCRIPTION OF AN EQUATION This is the formula I chose. p_saleprice = 10.4901 + 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 + 2.65534 * X5 - 3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9 Additionally, it is important to note that this data was developed on data from XXXX years, so it is unknown as to whether this data will translate into years in the future. Further analysis will need to be done to determine whether this model will be robust and translate outside the XXXX year time window. NOTE: This is a made up formula, so don’t go investing in housing in New York
• 39. based on this model. Come to think of it, it’s probably not a good idea to invest in New York unless you are very familiar with New York. 4. Select Models Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the results in a table to display and discuss. 5. Model Formula If you expect points for this step, show it in your report and explain it. You will get 0 points if it is somewhere in your code and left out of the report. Don’t expect that I will search your code for it. Write python code that will score new data and predict the sale price. The variable with the predicted sale price should be named:
• 40. 6 p_saleprice 6. Scored Data File Make sure you submit as a csv file. Use the stand alone program that you wrote in the previous section. Score the data file ames_test. Create a file that has only TWO variables for each record: index p_saleprice The first variable, index, will allow me to match my grading key to your predicted value. If I cannot do this, you won’t get a grade. The second value, p_saleprice is the predicted sale price of a home based on the data given to you.
• 41. Your values will be compared against … body (MEAN) If your model is not better than simply using an AVERAGE value, you will lose points. CONCLUSION: A short wrap up of the assignment including a discussion of results and what was learned. GOOD CONCLUSION: Several models were developed to predict the sale price of a home using Ames
• 42. Housing data. The best model was derived using XXXX. Although there were no problems with the model from a statistical standpoint, the winning model did have a 7 sign issue with one of the variables where seemingly bad construction would result in a higher sale price. This issue needs further investigation but is beyond the scope of this document. BAD CONCLUSION: I built some models that were good and I learned a lot. CODE: Attach as a separate file or paste your code in at the end. BONUS Place all bonus work at the end of the document. Clearly identify what you are
• 43. doing and how many points you are trying to earn. Exploratory Data Analysis (EDA) Assignment #1 Jahee Koo PREDICT 410 Section 58 (It is just a template example, so should change all contents based on written instructions.) INTRODUCTION The purpose of the assignment is to analyze data from somewhere in order to predict the number of something. This will be accomplished by generating simple and multivariate regression models using different variable selection techniques including, but not limited to, Forward, Stepwise, and Backward regression. From these techniques, the best model will be selected. This best model will then be further analyzed to determine if it is an adequate model to predict or if further analysis is necessary. Make sure you follow the assignment instructions. To get points for each of these sections, you have to show them in your report. Each assignment will require a different type report. This template is fairly generic so adjust to assignment instructions. 1. Data Exploration of the data variables?) be imputed “fixed”?
• 44. fix them. 2. Data Preparation variables (such as ratios or adding or multiplying) to create new variables 3. Build Models These are instructions from Assignment 1 but will be similar in the other assignments. Build at least two different LINEAR REGRESSION models using different variables. Show all of your models and the statistical significance of the input variables. Discuss the coefficients in the model, do they make sense? Are you keeping the model even though it is counter intuitive? Why? Display the Python results for your assignment and comment on the results. Your discussion of the results should be intertwined with (or linked to) the Python output, i.e. the discussion should be on or near the page containing the output. You should not be showing a lot of unnecessary Python output. Discuss the results thoroughly. Include such discussion points What is observed in the graph / table / output nse?
• 45. GOOD DESCRIPTION OF A DIAGRAM The analysis continues by examining the plot of the residual values versus the predicted variables given in Figure 1. In this type of analysis, a visual inspection of the chart is conducted to determine whether or not any patterns exist in the residuals. Some patterns might include errors that increase or decrease with larger predictive variables or some other type of pattern such as a curve. In an ideal situation, the data will appear to be random. An inspection of Figure 1 suggests that the data points are randomly distributed and no obvious patterns exist in the data. Therefore, there are no immediate concerns with the distribution of the errors. Figure 1 Housing Data Predicted vs Residual Graph GOOD DESCRIPTION OF AN EQUATION The model chosen from the different candidates was the XXX model because it had the highest Adjusted R-Squared value and the lowest AIC and SBC values. Using these metrics, it was far superior to the other models. The formula given for the predicted sale price is: p_saleprice = 50000 + 5000 * X1 LotFrontage + 6000 * X2 LotArea + 3000 * X3 OverallCond 5 The formula makes intuitive sense for the most part because sale price coefficients reflect that size and condition add to the value of a property. However, the data should be analyzed for multi-collinearity which can result in sign changes. Also, it might be wise to remove the variable from the model if no explanation can be found. 4. Select Models
• 46. Decide on the criteria for selecting the “Best Model”. Will you use a metric such as Adjusted R-Square or AIC? Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. Put the results in a table to display and discuss. 5. Model Formula Write python code that will score new data and predict the sale price. The variable with the predicted sale price should be named: p_saleprice 6. Scored Data File Make sure you submit as a csv file. Use the stand alone program that you wrote in the previous section. Score the data file ames_test. Create a file that has only TWO variables for each record: index p_saleprice CONCLUSION Several models were developed to predict the sale price of a home using Ames Housing data. The best model was derived using XXXX. Although there were no problems with the model from a statistical standpoint, the winning model did have a … CODE: Attach as a separate file or paste your code in at the end. BONUS Place all bonus work at the end of the document. Clearly identify what you are doing and how many points you are trying to earn.
• 47. # -*- coding: utf-8 -*- """ Created on Tue Jan 16 22:58:46 2018 @author: Paul Lee """ # Using Linear Regression to predict # family home sale prices in Ames, Iowa # Packages import pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import matplotlib.pyplot as plt from scipy import stats from sklearn import linear_model, metrics # Set some options for the output pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 40) pd.set_option('display.max_rows', 10) pd.set_option('display.width', 120) # Read in the data train = pd.read_csv('C:/Users/Jahee Koo/Desktop/AMES_TRAIN.csv') test = pd.read_csv('C:/Users/Jahee Koo/Desktop/AMES_TEST_SFAM.csv') # Convert all variable names to lower case train.columns = [col.lower() for col in train.columns] test.columns = [col.lower() for col in test.columns]
• 48. # EDA print('n----- Summary of Train Data -----n') print('Object type: ', type(train)) print('Number of observations & variables: ', train.shape) # Variable names and information print(train.info()) print(train.dtypes.value_counts()) # Descriptive statistics print(train.describe()) # show a portion of the beginning of the DataFrame print(train.head(10)) print(train.shape) train.loc[:, train.isnull().any()].isnull().sum().sort_values(ascending=False) train[train == 0].count().sort_values(ascending=False) t_null = train.isnull().sum() t_zero = train[train == 0].count() t_good = train.shape[0] - (t_null + t_zero) xx = range(train.shape[1]) plt.figure(figsize=(8,8)) plt.bar(xx, t_good, color='g', width=1, bottom=t_null+t_zero) plt.bar(xx, t_zero, color='y', width=1, bottom=t_null) plt.bar(xx, t_null, color='r', width=1) plt.show() print(t_null[t_null > 1000].sort_values(ascending=False)) print(t_zero[t_zero > 1900].sort_values(ascending=False))
• 49. drop_cols = (t_null > 1000) | (t_zero > 1900) train = train.loc[:, -drop_cols] # Some quick plots of the data train.hist(figsize=(18,14)) train.plot( kind='box', subplots=True, layout=(5,9), sharex=False, sharey=False, figsize=(18,14) ) train.plot.scatter(x='grlivarea', y='saleprice') train.boxplot(column='saleprice', by='yrsold') train.plot.scatter(x='subclass', y='saleprice') train.boxplot(column='saleprice', by='overallqual') train.boxplot(column='saleprice', by='overallcond') train.plot.scatter(x='overallcond', y='saleprice') train.plot.scatter(x='lotarea', y='saleprice') # Replace NaN values with medians in train data train = train.fillna(train.median()) train = train.apply(lambda med:med.fillna(med.value_counts().index[0])) train.head() t_null = train.isnull().sum() t_zero = train[train == 0].count() t_good = train.shape[0] - (t_null + t_zero) xx = range(train.shape[1]) plt.figure(figsize=(14,14)) plt.bar(xx, t_good, color='g', width=.8, bottom=t_null+t_zero) plt.bar(xx, t_zero, color='y', width=.8,
• 50. bottom=t_null) plt.bar(xx, t_null, color='r', width=.8) plt.show() train.bldgtype.unique() train.housestyle.unique() # Goal is typical family home # Drop observations too far from typical iqr = np.percentile(train.saleprice, 75) - np.percentile(train.saleprice, 25) drop_rows = train.saleprice > iqr * 1.5 + np.percentile(train.saleprice, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.grlivarea, 75) - np.percentile(train.grlivarea, 25) drop_rows = train.grlivarea > iqr * 1.5 + np.percentile(train.grlivarea, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.lotarea, 75) - np.percentile(train.lotarea, 25) drop_rows = train.lotarea > iqr * 1.5 + np.percentile(train.lotarea, 75) train = train.loc[-drop_rows, :] iqr = np.percentile(train.totalbsmtsf, 75) - np.percentile(train.totalbsmtsf, 25) drop_rows = train.totalbsmtsf > iqr * 1.5 + np.percentile(train.totalbsmtsf, 75) train = train.loc[-drop_rows, :] # Replace 0 values with median to living area in train data m = np.median(train.grlivarea[train.grlivarea > 0]) train = train.replace({'grlivarea': {0: m}})
• 51. # Discrete variables plt.figure() g = sns.PairGrid(train, x_vars=["bldgtype", "exterqual", "centralair", "kitchenqual", "salecondition"], y_vars=["saleprice"], aspect=.75, size=3.5) g.map(sns.violinplot, palette="pastel"); # Print correlations corr_matrix = train.corr() print(corr_matrix["saleprice"].sort_values(ascending=False).hea d(10)) print(corr_matrix["saleprice"].sort_values(ascending=True).hea d(10)) ## Pick 10 variable to focus on pick_10 = [ 'saleprice', 'grlivarea', 'overallqual', 'garagecars', 'yearbuilt', 'totalbsmtsf', 'salecondition', 'bldgtype', 'kitchenqual', 'exterqual', 'centralair' ] corr = train[pick_10].corr()
• 52. blank = np.zeros_like(corr, dtype=np.bool) blank[np.triu_indices_from(blank)] = True fig, ax = plt.subplots(figsize=(10, 10)) corr_map = sns.diverging_palette(255, 133, l=60, n=7, center="dark", as_cmap=True) sns.heatmap(corr, mask=blank, cmap=corr_map, square=True, vmax=.3, linewidths=0.25, cbar_kws={"shrink": .5}) # Quick plots for variable in pick_10[1:]: if train[variable].dtype.name == 'object': plt.figure() sns.stripplot(y="saleprice", x=variable, data=train, jitter=True) plt.show() plt.figure() sns.factorplot(y="saleprice", x=variable, data=train, kind="box") plt.show() else: fig, ax = plt.subplots() ax.set_ylabel('Sale Price') ax.set_xlabel(variable) scatter_plot = ax.scatter( y=train['saleprice'], x=train[variable], facecolors = 'none', edgecolors = 'blue' ) plt.show() plt.figure() sns.factorplot(x="bldgtype", y="saleprice", col="exterqual", row="kitchenqual", hue="overallqual", data=train, kind="swarm")
• 53. plt.figure() sns.countplot(y="overallqual", hue="exterqual", data=train, palette="Greens_d") # Run simple models model1 = smf.ols(formula='saleprice ~ grlivarea', data=train).fit() model2 = smf.ols(formula='saleprice ~ grlivarea + overallqual', data=train).fit() model3 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars' , data=train).fit() model4 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars + yearbuilt' , data=train).fit() model5 = smf.ols(formula='saleprice ~ grlivarea + overallqual + garagecars + yearbuilt + totalbsmtsf + kitchenqual + exterqual + centralair', data=train).fit() print('nnmodel 1----------n', model1.summary()) print('nnmodel 2----------n', model2.summary()) print('nnmodel 3----------n', model3.summary()) print('nnmodel 4----------n', model4.summary()) print('nnmodel 5----------n', model5.summary()) out = [model1, model2, model3, model4, model5] out_df = pd.DataFrame() out_df['labels'] = ['rsquared', 'rsquared_adj', 'fstatistic', 'aic'] i = 0 for model in out: train['pred'] = model.fittedvalues plt.figure()
• 54. train.plot.scatter(x='saleprice', y='pred', title='model' + str(i+1)) plt.show() out_df['model' + str(i+1)] = [ model.rsquared.round(3), model.rsquared_adj.round(3), model.fvalue.round(3), model.aic.round(3) ] i += 1 train['predictions'] = model5.fittedvalues print(train['predictions']) # Clean test data test.info() test[3:] = test[3:].fillna(test[3:].median()) test["kitchenqual"] = test["kitchenqual"].fillna(test["kitchenqual"].value_counts().ind ex[0]) test["exterqual"] = test["exterqual"].fillna(test["exterqual"].value_counts().index[0 ]) m = np.median(test.grlivarea[test.grlivarea > 0]) test = test.replace({'grlivarea': {0: m}}) print(test) # Convert the array predictions to a data frame then merge with the index for the test data test_predictions = model5.predict(test) test_predictions[test_predictions < 0] = train['saleprice'].min() print(test_predictions)
• 55. dat = {'p_saleprice': test_predictions} df1 = test[['index']] df2 = pd.DataFrame(data=dat) submission = pd.concat([df1,df2], axis = 1, join_axes=[df1.index]) print(submission) submission.to_csv('C:/Users/Jahee Koo/Desktop/hw01_predictions.csv') NAME: AmesHousing.txt TYPE: Population SIZE: 2930 observations, 82 variables ARTICLE TITLE: Ames Iowa: Alternative to the Boston Housing Data Set DESCRIPTIVE ABSTRACT: Data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. SOURCES: Ames, Iowa Assessor’s Office
• 56. VARIABLE DESCRIPTIONS: Tab characters are used to separate variables in the data file. The data has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers). Order (Discrete): Observation number PID (Nominal): Parcel identification number - can be used with city web site for parcel review. MS SubClass (Nominal): Identifies the type of dwelling involved in the sale. 020 1-STORY 1946 & NEWER ALL STYLES 030 1-STORY 1945 & OLDER 040 1-STORY W/FINISHED ATTIC ALL AGES 045 1-1/2 STORY - UNFINISHED ALL AGES 050 1-1/2 STORY FINISHED ALL AGES 060 2-STORY 1946 & NEWER 070 2-STORY 1945 & OLDER
• 57. 075 2-1/2 STORY ALL AGES 080 SPLIT OR MULTI-LEVEL 085 SPLIT FOYER 090 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES MS Zoning (Nominal): Identifies the general zoning classification of the sale. A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density
• 58. RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density Lot Frontage (Continuous): Linear feet of street connected to property Lot Area (Continuous): Lot size in square feet Street (Nominal): Type of road access to property Grvl Gravel Pave Paved Alley (Nominal): Type of alley access to property Grvl Gravel Pave Paved NA No alley access Lot Shape (Ordinal): General shape of property Reg Regular IR1 Slightly irregular IR2 Moderately Irregular
• 59. IR3 Irregular Land Contour (Nominal): Flatness of the property Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression Utilities (Ordinal): Type of utilities available AllPub All public Utilities (E,G,W,& S) NoSewr Electricity, Gas, and Water (Septic Tank) NoSeWa Electricity and Gas Only ELO Electricity only Lot Config (Nominal): Lot configuration Inside Inside lot Corner Corner lot CulDSac Cul-de-sac
• 60. FR2 Frontage on 2 sides of property FR3 Frontage on 3 sides of property Land Slope (Ordinal): Slope of property Gtl Gentle slope Mod Moderate Slope Sev Severe Slope Neighborhood (Nominal): Physical locations within Ames city limits (map available) Blmngtn Bloomington Heights Blueste Bluestem BrDale Briardale BrkSide Brookside ClearCr Clear Creek CollgCr College Creek Crawfor Crawford Edwards Edwards
• 61. Gilbert Gilbert Greens Greens GrnHill Green Hills IDOTRR Iowa DOT and Rail Road Landmrk Landmark MeadowV Meadow Village Mitchel Mitchell Names North Ames NoRidge Northridge NPkVill Northpark Villa NridgHt Northridge Heights NWAmes Northwest Ames OldTown Old Town SWISU South & West of Iowa State University Sawyer Sawyer SawyerW Sawyer West Somerst Somerset StoneBr Stone Brook
• 63. RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad Bldg Type (Nominal): Type of dwelling 1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one- family dwelling Duplx Duplex TwnhsE Townhouse End Unit TwnhsI Townhouse Inside Unit House Style (Nominal): Style of dwelling 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished
• 64. 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level Overall Qual (Ordinal): Rates the overall material and finish of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
• 65. Overall Cond (Ordinal): Rates the overall condition of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor Year Built (Discrete): Original construction date Year Remod/Add (Discrete): Remodel date (same as construction date if no remodeling or additions) Roof Style (Nominal): Type of roof Flat Flat
• 66. Gable Gable Gambrel Gabrel (Barn) Hip Hip Mansard Mansard Shed Shed Roof Matl (Nominal): Roof material ClyTile Clay or Tile CompShg Standard (Composite) Shingle Membran Membrane Metal Metal Roll Roll Tar&Grv Gravel & Tar WdShake Wood Shakes WdShngl Wood Shingles Exterior 1 (Nominal): Exterior covering on house
• 67. AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles Exterior 2 (Nominal): Exterior covering on house (if more than
• 68. one material) AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding
• 69. WdShing Wood Shingles Mas Vnr Type (Nominal): Masonry veneer type BrkCmn Brick Common BrkFace Brick Face CBlock Cinder Block None None Stone Stone Mas Vnr Area (Continuous): Masonry veneer area in square feet Exter Qual (Ordinal): Evaluates the quality of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Exter Cond (Ordinal): Evaluates the present condition of the material on the
• 70. exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Foundation (Nominal): Type of foundation BrkTil Brick & Tile CBlock Cinder Block PConc Poured Contrete Slab Slab Stone Stone Wood Wood Bsmt Qual (Ordinal): Evaluates the height of the basement Ex Excellent (100+ inches) Gd Good (90-99 inches)
• 71. TA Typical (80-89 inches) Fa Fair (70-79 inches) Po Poor (<70 inches NA No Basement Bsmt Cond (Ordinal): Evaluates the general condition of the basement Ex Excellent Gd Good TA Typical - slight dampness allowed Fa Fair - dampness or some cracking or settling Po Poor - Severe cracking, settling, or wetness NA No Basement Bsmt Exposure (Ordinal): Refers to walkout or garden level walls Gd Good Exposure Av Average Exposure (split levels or foyers typically score
• 72. average or above) Mn Mimimum Exposure No No Exposure NA No Basement BsmtFin Type 1 (Ordinal): Rating of basement finished area GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFin SF 1 (Continuous): Type 1 finished square feet BsmtFinType 2 (Ordinal): Rating of basement finished area (if multiple types) GLQ Good Living Quarters
• 73. ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFin SF 2 (Continuous): Type 2 finished square feet Bsmt Unf SF (Continuous): Unfinished square feet of basement area Total Bsmt SF (Continuous): Total square feet of basement area Heating (Nominal): Type of heating Floor Floor Furnace GasA Gas forced warm air furnace GasW Gas hot water or steam heat Grav Gravity furnace
• 74. OthW Hot water or steam heat other than gas Wall Wall furnace HeatingQC (Ordinal): Heating quality and condition Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Central Air (Nominal): Central air conditioning N No Y Yes Electrical (Ordinal): Electrical system SBrkr Standard Circuit Breakers & Romex FuseA Fuse Box over 60 AMP and all Romex wiring (Average) FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
• 75. Mix Mixed 1st Flr SF (Continuous): First Floor square feet 2nd Flr SF (Continuous) : Second floor square feet Low Qual Fin SF (Continuous): Low quality finished square feet (all floors) Gr Liv Area (Continuous): Above grade (ground) living area square feet Bsmt Full Bath (Discrete): Basement full bathrooms Bsmt Half Bath (Discrete): Basement half bathrooms Full Bath (Discrete): Full bathrooms above grade Half Bath (Discrete): Half baths above grade Bedroom (Discrete): Bedrooms above grade (does NOT include basement bedrooms)
• 76. Kitchen (Discrete): Kitchens above grade KitchenQual (Ordinal): Kitchen quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor TotRmsAbvGrd (Discrete): Total rooms above grade (does not include bathrooms) Functional (Ordinal): Home functionality (Assume typical unless deductions are warranted) Typ Typical Functionality Min1 Minor Deductions 1 Min2 Minor Deductions 2 Mod Moderate Deductions
• 77. Maj1 Major Deductions 1 Maj2 Major Deductions 2 Sev Severely Damaged Sal Salvage only Fireplaces (Discrete): Number of fireplaces FireplaceQu (Ordinal): Fireplace quality Ex Excellent - Exceptional Masonry Fireplace Gd Good - Masonry Fireplace in main level TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement Fa Fair - Prefabricated Fireplace in basement Po Poor - Ben Franklin Stove NA No Fireplace Garage Type (Nominal): Garage location 2Types More than one type of garage
• 78. Attchd Attached to home Basment Basement Garage BuiltIn uilt-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage Garage Yr Blt (Discrete): Year garage was built Garage Finish (Ordinal) : Interior finish of the garage Fin Finished RFn Rough Finished Unf Unfinished NA No Garage Garage Cars (Discrete): Size of garage in car capacity Garage Area (Continuous): Size of garage in square feet
• 79. Garage Qual (Ordinal): Garage quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage Garage Cond (Ordinal): Garage condition Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage Paved Drive (Ordinal): Paved driveway Y Paved P Partial Pavement
• 80. N Dirt/Gravel Wood Deck SF (Continuous): Wood deck area in square feet Open Porch SF (Continuous): Open porch area in square feet Enclosed Porch (Continuous): Enclosed porch area in square feet 3-Ssn Porch (Continuous): Three season porch area in square feet Screen Porch (Continuous): Screen porch area in square feet Pool Area (Continuous): Pool area in square feet Pool QC (Ordinal): Pool quality Ex Excellent Gd Good TA Average/Typical Fa Fair
• 81. NA No Pool Fence (Ordinal): Fence quality GdPrv Good Privacy MnPrv Minimum Privacy GdWo Good Wood MnWw Minimum Wood/Wire NA No Fence Misc Feature (Nominal): Miscellaneous feature not covered in other categories Elev Elevator Gar2 2nd Garage (if not described in garage section) Othr Other Shed Shed (over 100 SF) TenC Tennis Court NA None Misc Val (Continuous): \$Value of miscellaneous feature Mo Sold (Discrete): Month Sold (MM)
• 82. Yr Sold (Discrete): Year Sold (YYYY) Sale Type (Nominal): Type of sale WD Warranty Deed - Conventional CWD Warranty Deed - Cash VWD Warranty Deed - VA Loan New Home just constructed and sold COD Court Officer Deed/Estate Con Contract 15% Down payment regular terms ConLw Contract Low Down payment and low interest ConLI Contract Low Interest ConLD Contract Low Down Oth Other Sale Condition (Nominal): Condition of sale Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale
• 83. AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes) SalePrice (Continuous): Sale price \$\$ I have to complete EDA assignment using python before JAN. 20th. I have an approximate python code and a report template example. I would like to ask you to complete the report based on these. I will attach the necessary files for analysis and report generation. After completion, please send me a doc. and .py file.
Aktuelle SpracheEnglish
Español
Portugues
Français
Deutsche