SlideShare ist ein Scribd-Unternehmen logo
1 von 36
SAS Notes Shane Zhang
Table of Contents SAS Input/Output Functions in Data Step Simple Statistics Procedures Hypothesis testing – mean and proportion Multiple linear regression Generalized linear regression Cluster Analysis Association Analysis Logistic Regression
SAS Input/Output /*output csv file*/ data _null_;   set a;   file 'xx.csv' dsddlm=',';   put x y z; /*define ftp connection as filename*/ filename test ftp '.cshrc' cd ='/export/home/sz325584'       host='forecast.marketing.fedex.com'       user="sz325584"       pass="xxxx"; /*read one line into a variable*/ data flist; infile "ls ./" pipe length=len;    input @;    input fname $varying200. len; /*create graphs*/ goptions reset=all;   proc gplot data=cars;  plot price*(citympg, hwympg, cylinders enginesize); Symbol v=circle; /*read from ls –l  output*/ data aa; infile 'ls -l ' pipe  dsddlm=' ' missover; input  dir $  owner $; /*read multiple file with same layout */ filename in ('200306.csv','200309.csv' ); data base; informatshp_dt  mmddyy8.;      format sshipym date9.; infile in  dsd delimiter=','  ; /*read in only part of the file,useful for large mainframe tape file*/ data base(drop=i); infile '1.csv' end=eofdsdfirstobs=3 delimiter=',' missover; retain i 0;       do while (i<20);          input  aa $;          output; i=i+1;       end;
Functions in Data Step mdy intnx(‘month’,sshipym,1) put(variable, format) Name=scan(string,2,”&=“)    /*split string by &= and find the second item*/ substr(yyyymm,5,2) index(address,’NY’) /*find position of a pattern in a string*/ call symput(numobs, ‘numobs’)   /*put data step variable value into a sasmacro variable*/
Simple Statistic Procedures /*create histogram*/   proc univariate data=one noprint; var v1;   histogram v1 / normal; run; /*create means, standard deviations */ PROC MEANS DATA=volume;        VAR  adv;        OUTPUT OUT=volume_stat(KEEP=MEAN STD MAX MIN) MEAN(adv)=MEAN STD(adv)=STD MAX(adv)=MAX MIN(adv)=MIN; /*Random select*/     Proc surveyselect data = trees    Method = SRS n = 15 out = sample1;    strata segment;    run;
Variables Types Categorical or nominal variables are ones such as favorite color, which have two or more categories and no way to order the values. Other examples of categorical variables include gender, blood type and favorite ice cream flavor. Ordinal variables can be ordered, but are similar to categorical variables in that there are clear categories. The relative distances or spacing between variables values is not uniform.  Continuous/Interval variables are similar to ordinal variables, except that values are measured in a way where their differences are meaningful. The place number of runners in a race is considered an ordinal scale, but if we consider the actual times of runners rather than their place, this would be an interval scale.
Hypothesis testing A statistical test is a quantitative way to decide whether there is enough evidence to reasonably believe a conjecture to be true.  null hypothesis H0, and the alternative hypothesis Ha. H0 normally assumes no difference in means or in regression analysis, no relationship between predicator and response variable, i.e. coefficient=0 To control type I error, we often set threshold to be 5%, only reject null hypothesis when p<0.05. Or in other words, only accept Ha (there is difference or there is relationship) when evidence is very strong.
One tail or two tailed hypothesis testing To obtain correct results, it is important to determine whether the hypothesis tests are one or two-tailed. When the null and alternative hypotheses are of the form H0: x1= x2, with Ha: x1> x2 or Ha: x1< x2, we call that a one-tailed test, and when the null hypothesis is of the form x1 x2, we call that a two tailed test.
Hypothesis testing on means - Ttest We can use t-tests in the following three situations; We want to test whether the mean is significantly different than a hypothesized value. We want to test whether means for two independent groups are significantly different. We want to test whether means for dependent or paired groups are significantly different.
Ttest Ttest is a special form of one way ANOVA where category variable has only two values. Whether the cereal box avg weight is different from 15 ounce? (two sided)       PROC TTEST DATA= datasetnameH0=15;                              can also be done with proc univariate.        VAR weight; Whether the cereal box avg weight is above 15 ounce?  (one sided) ods graphics on;          proc ttest h0=15 plots(showh0) sides=u   alpha=0.1; var weight; Test whether the means of two independent group are the same.       (control group vs. target group  or different brands of cereal box)       PROC TTEST DATA= datasetname;       CLASS brand;      VAR  weight;
Paired Ttest Test two attributes belong to the same object Eg. Same account, pre campaign sales and post campaign sale. Same student, reading and writing scores. Test whether account sales different after marketing campaign? Note: pre and post sales are dependent groups.        PROC TTEST DATA= datasetname;       PAIRED pre_sale*post_sale;  Or test whether students reading and writing scores are significant different.  PROC TTEST DATA= datasetname;       PAIRED read_score*write_score;
ANOVA When comparing means from more than two groups, use one way ANOVA. Two way ANOVA means there are two CLASS variables (eg CLASS SEMENT INDUSTRY). There are two common ways to run ANOVA in SAS. A seemingly obvious way is PROC ANOVA, the other is PROC GLM, which has the added advantage of allowing with a few more SAS options.
ANOVA H0: All means are equal across brands. Ha: There is a difference between mean salaries of families who vacationed in different seasons.      PROC ANOVA DATA= cereal;      CLASS brand;      MODEL weight= brand;      MEANS brand;
Nonparametric ANOVA Used when we cannot assume normal distribution. For example, when sample size is too small. The mean distribution won’t be normal. Proc npar1way data=sasdata;    class variable; var variables;
Hypothesis testing on proportion A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions.  For example, let's suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks. proc freq data = mydata;   tables race / chisqtestp=(10 10 10 70); run;
One-way MANOVA MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. E.g. examine the differences in read, write and math broken down by program type (prog).  proc glm data = "c:ydatasb2";   class prog;   model read write math = prog; manova h=prog; run; quit;
Multiple Linear Regression A powerful tool to understand relationship between predictor and response variables and predict future values. Linear is in terms of the coefficients. The following are multiple linear models         y=b0+b1*x1+b2*x2         y=b0+b1*x1^2+b2*x1*x2 Assumptions – linear relationship, error term is normal distributed and independent For N independent variables, the number of all possible of model combination is 2^N, very computing intensive. We can use stepwise selection to find a model quickly. Plot the chart before trying models Goptions reset=all;        proc gplot data=paper;         plot strength*amount;         Symbol   i=rc;                 /*impose quadratic regression model on the chart*/         title  ‘Quadratic model’;          plot strength*amount;          symbol  i=rc;               /*impose cubic regression on the chart*/          title ‘Cubic model’;       Run;
Evaluate Model Assumption Normality Normal probability plots of the residuals using proc  univariate Independent observations Plot residuals vs time and other ordering component Durbin-watsonstatitics or the first order autocorrelation statistics for time series data Constant variance Plot residuals vs. predicted value Spearman rank correlation coefficient between absolute value of residuals and predicted values
Model fitness Examine model-fitting statistics such as R^2, adjusted R^2, AIC, SBC If overall model p value<0.05 then at least one of the predictor is significant Each coefficient’s p value Examine residual plots and validate the normality assumption.  Proc reg data=mydata;          model reading= age gender;            plot  r.*p;    /*plot a graph of the residuals vs. the predicted values;*/                     Output out=out r=residuals;           proc univariate data=out; var residuals;                histogram/ normal;
Remedial measures When a straight line is inappropriate Transform the independent variables to obtain linearity Fit a polynomial regression model Fit a nonlinear regression model using proc nlin When there is multicollinearity Exclude redundant independent variables Center the independent variables in polynomial regression model When there are influential observations Make sure there are no data errors Investigate the cause of the data Delete the observations if appropriate and document the situation Transforming the dependent variables Transforming the dependent variable is one of the common approaches to deal with nonnormal data and or nonconstant variances. E.g.
Regression with Categorical Predictors       In proc reg, categorical predictor needs to be coded into dummy variable as   input. In progglm, this is done automatically when the variable is put under class statement.  This is the data step showing how to code dummy variables (0/1 value) using category   “catvar”, which has value 1, 2, 3   DATA mydata; set mydata;       array dum(3) dum1-dum3;     do i = 1 to 3; dum(i)=(catvar=i);     end;      drop i; If an ordinal predictor has only three or four levels then clearly it should coded using dummy coding. There are times when an ordinal predictor can be treated as if it were interval (this is called quasi-interval) especially if the variable has more than five or six levels.
Polynomial Regression A polynomial regression is a special type of multiple linear regression where powers of variables and cross-product(interactions) terms are included in the model. Y=b0+b1*X1+b2*X2+b3*X1^2+b4*X1*X2 Data a;     set a;      var2=var1**2;       var3=var1**3; Proc reg data=a;       model y=var1 var2 var3;
collinearity   Detect collinearity Proc reg data=mydata;     model oxygen_consumption=runtime age weight /vifcollincollinoint; Vif and collin options provide collinearity diagnostics Vif>10 or collin>100 indicate strong collinearity Treatment      polynomial function often introduce collinearity, one treatment is to center the data, so some data becomes negative which reduce x^2 and x^3 correlations. Proc stdize data=paper method=mean out=paper1; var var1; Data paper1;        set paper1;        mvar2=var1**2;        mvar3=var1**3;
Multivariate multiple regression Multivariate multiple regression is used when you have two or more variables that are to be predicted from two or more predictor variables.  In our example, we will predict write and read from female, math, science and social studies (socst) scores.  The mtest statement in the proc reg is used to test hypotheses in multivariate regression models where there are several dependent variables fit to the same regressors proc reg data = "c:ydatasb2";   model write read = female math science socst;   female: mtest female;   math:  mtest math;   science:  mtest science; socst:  mtestsocst;
Autoregression If errors (residuals) are not independent, autoregression model should be used. By simultaneously estimating the regression coefficient and the autoregressive error model parameters, the autoreg procedure corrects the regression estimates. Yt=b0+b1*x1+..+bk*xk+vt Vt=-pt-1*Vt-1-pt-2*Vt-1  +Et proc autoreg data=sales; model sales=price promotion /nlag=3 method=m1 dwprob;
Autoregression and Arima for time series forecast t is the month order, t=_n_; The order of the AR(p) model is chosen by a backward elimination search  proc autoreg data = taxrevenue;         model rev = t d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12/ nlag = 4 DW=4 DWPROB             method=ml backstepslstay=0.05; Use of proc arima to fit ARMA models consists of 3 steps. The first step is model identification, in which the observed series is transformed to be stationary. The only transformation available within proc arima is differencing. The second step is model estimation, in which the orders p and q are selected and the corresponding parameters are estimated. The third step is forecasting, in which the estimated model is used to forecast future values of the observable time series proc arima data=history; ivar=adv nlag=15;            e p=1 q=3;           f lead=12 interval=month id=sshipym out=fcst; /*f statement means forecast*/ Since data is not stationary, we can use differencing adv(1) instead of adv.
Generalized linear model  No longer requires response variable follows normal distribution conditioned on the predictor variablesand constant variance.  Examples:                     distribution Linear regression         normal Logistic regression       binomial   Poisson regression       Poisson Gamma regression      gamma
Poisson Regression Poisson distribution is used to model the count (non negative integer) or occurring rates of rare events (when the mean gets larger, the Poisson distribution approaches normal distribution) Number of ear infection in infants Number of equipment failures Rate of insurance claim Differ from normal distribution Non symmetrical, skewed to the right for rare events Has only one parameter (mean). Actually variance equals mean. Example proc genmod data=skincancer;     class city age;     model cases=city age /offset=log_popdist=poi  link=log type3;      title ‘Poisson regression model for skin cancer rates’; Offset legs the genmod to model rate instead of counts.
Proc reg and proc glm Proc reg  more convenient for regression analysis because of options for plot and model automatic selection. (no class statement) Proc glm more convenient for anova because of class statement. (no plot or model selection options)
Cluster Analysis The primary goal of market segmentation is to better satisfy customer needs or wants. The firm does not want to  Use the same marketing program for all customers Incur the high cost of a unique program for each customer A deck of 52 cards can be grouped as  26 red and 26 black 13 each spades, hearts, diamonds and clubs
Cluster Analysis Scale of measurement will impact the grouping. It is better to standardize input variables before clustering. Two types of clustering Hierarchical clustering Used for small size of data. Can determine number of clusters by finding the local peak for F and T-squared statistics. No theoretical reason to expect a hierarchical structure Non-hierarchical clustering Scale up well with large/complex data Number of clusters need to be specified in advance Initial seed required Combination Two step method is used. First, a hierarchical method is applied on the training data to decide number of clusters and initial choice of seeds. Feed into non-hierarchical method (such as k-means) to apply to whole data set. SAS enterprise Miner automates this two step process.
Custer Analysis
Association Analysis Chi-Square Test Proc freq data=mydata;     table gender*purchase /chisq expected cellchi2 nocolnopercent;     title1 ‘Association between Gender and Purchase’; When more than 20% of cells have expected counts less than five, chi-square test may not be valid.
Regression Overview ,[object Object],[object Object]
SAS Logistics Procedures proc logistic data=abc.logisticddescending;     class gender (ref=‘Male') income (ref=‘Low') /param=ref;     model purchase=gender income /rsquarelackfitctable selection=stepwise;     output out=probs predicted=phat;  “descending” makes the procedure to model P(Y=1) Using the estimate of b1, b1 .. from the training data, we can calculate the p for population dataset. We can use P>0.5 as threshold to predict event happen (i.e.  Will purchase etc.) P=1/(1+e-(b0+b1*X1+b2*X2…) ) Model fit.   When comparing models, lower AIC is the better model.                Intercept only     Intercept and covariates AIC             560			550 SC               570			560 -2 Long L    575			550
SAS Regression Guide

Weitere ähnliche Inhalte

Andere mochten auch

Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packagesKm Ashif
 
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرض
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرضSadarah 5th-027.. معايير الشراكات المجتمعية.. عرض
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرضAbdullah Ali
 
Sadarah 5th-003.. الفريق الفاشل
Sadarah 5th-003.. الفريق الفاشلSadarah 5th-003.. الفريق الفاشل
Sadarah 5th-003.. الفريق الفاشلAbdullah Ali
 
Mobile Internet
Mobile InternetMobile Internet
Mobile InternetEvan Liu
 
Reverse Mortgage Seminar
Reverse Mortgage SeminarReverse Mortgage Seminar
Reverse Mortgage SeminarTerry Cronin
 
Sadarah 5th-015.. الزيارة الفعّالة.. عرض
Sadarah 5th-015.. الزيارة الفعّالة.. عرضSadarah 5th-015.. الزيارة الفعّالة.. عرض
Sadarah 5th-015.. الزيارة الفعّالة.. عرضAbdullah Ali
 
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرض
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرضSadarah pro-001.. عرض استقطاب فريق صدارة.. عرض
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرضAbdullah Ali
 
Duminica XXX-a de peste an (A)
Duminica XXX-a de peste an (A)Duminica XXX-a de peste an (A)
Duminica XXX-a de peste an (A)Radu Capan
 
Doi Sfinti militari
Doi Sfinti militariDoi Sfinti militari
Doi Sfinti militariRadu Capan
 
إصدارات عبدالله السعد.. عرض
إصدارات عبدالله السعد.. عرضإصدارات عبدالله السعد.. عرض
إصدارات عبدالله السعد.. عرضAbdullah Ali
 
Association Executive Institute Big 3 by TeamTRI
Association Executive Institute Big 3 by TeamTRIAssociation Executive Institute Big 3 by TeamTRI
Association Executive Institute Big 3 by TeamTRITRI Leadership Resources
 
MARTA,SEILA,ESTELA ETA ADUR
MARTA,SEILA,ESTELA ETA ADURMARTA,SEILA,ESTELA ETA ADUR
MARTA,SEILA,ESTELA ETA ADURsheila1033
 
Sadarah 5th-012.. تنظيم الملتقيات.. عرض
Sadarah 5th-012.. تنظيم الملتقيات.. عرضSadarah 5th-012.. تنظيم الملتقيات.. عرض
Sadarah 5th-012.. تنظيم الملتقيات.. عرضAbdullah Ali
 
Sarahs 9th Birthday Tea Party
Sarahs 9th Birthday Tea PartySarahs 9th Birthday Tea Party
Sarahs 9th Birthday Tea Partysteviego
 

Andere mochten auch (20)

Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packages
 
Statistical software
Statistical softwareStatistical software
Statistical software
 
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرض
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرضSadarah 5th-027.. معايير الشراكات المجتمعية.. عرض
Sadarah 5th-027.. معايير الشراكات المجتمعية.. عرض
 
Sadarah 5th-003.. الفريق الفاشل
Sadarah 5th-003.. الفريق الفاشلSadarah 5th-003.. الفريق الفاشل
Sadarah 5th-003.. الفريق الفاشل
 
Mobile Internet
Mobile InternetMobile Internet
Mobile Internet
 
Reverse Mortgage Seminar
Reverse Mortgage SeminarReverse Mortgage Seminar
Reverse Mortgage Seminar
 
Sadarah 5th-015.. الزيارة الفعّالة.. عرض
Sadarah 5th-015.. الزيارة الفعّالة.. عرضSadarah 5th-015.. الزيارة الفعّالة.. عرض
Sadarah 5th-015.. الزيارة الفعّالة.. عرض
 
Finding work globally
Finding work globallyFinding work globally
Finding work globally
 
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرض
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرضSadarah pro-001.. عرض استقطاب فريق صدارة.. عرض
Sadarah pro-001.. عرض استقطاب فريق صدارة.. عرض
 
Duminica XXX-a de peste an (A)
Duminica XXX-a de peste an (A)Duminica XXX-a de peste an (A)
Duminica XXX-a de peste an (A)
 
Rocio Reyes_CV
Rocio Reyes_CVRocio Reyes_CV
Rocio Reyes_CV
 
Doi Sfinti militari
Doi Sfinti militariDoi Sfinti militari
Doi Sfinti militari
 
إصدارات عبدالله السعد.. عرض
إصدارات عبدالله السعد.. عرضإصدارات عبدالله السعد.. عرض
إصدارات عبدالله السعد.. عرض
 
Association Executive Institute Big 3 by TeamTRI
Association Executive Institute Big 3 by TeamTRIAssociation Executive Institute Big 3 by TeamTRI
Association Executive Institute Big 3 by TeamTRI
 
MARTA,SEILA,ESTELA ETA ADUR
MARTA,SEILA,ESTELA ETA ADURMARTA,SEILA,ESTELA ETA ADUR
MARTA,SEILA,ESTELA ETA ADUR
 
Sadarah 5th-012.. تنظيم الملتقيات.. عرض
Sadarah 5th-012.. تنظيم الملتقيات.. عرضSadarah 5th-012.. تنظيم الملتقيات.. عرض
Sadarah 5th-012.. تنظيم الملتقيات.. عرض
 
Ballagás2015
Ballagás2015Ballagás2015
Ballagás2015
 
SC
SCSC
SC
 
Routes into Teaching 2016-17
Routes into Teaching 2016-17Routes into Teaching 2016-17
Routes into Teaching 2016-17
 
Sarahs 9th Birthday Tea Party
Sarahs 9th Birthday Tea PartySarahs 9th Birthday Tea Party
Sarahs 9th Birthday Tea Party
 

Ähnlich wie SAS Regression Guide

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfAlemAyahu
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regressionRaman Kannan
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Your Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the fYour Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the frochellscroop
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing frameworkAgnes van Belle
 
Data science classica_hypos
Data science classica_hyposData science classica_hypos
Data science classica_hyposNeeraj Sinha
 
Logistic-Regression-Webinar.pdf
Logistic-Regression-Webinar.pdfLogistic-Regression-Webinar.pdf
Logistic-Regression-Webinar.pdfVishaliKalra2
 
A review of statistics
A review of statisticsA review of statistics
A review of statisticsedisonre
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statisticsteresa_soto
 
Edison S Statistics
Edison S StatisticsEdison S Statistics
Edison S Statisticsteresa_soto
 

Ähnlich wie SAS Regression Guide (20)

working with python
working with pythonworking with python
working with python
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
 
2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data 2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data
 
1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Your Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the fYour Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the f
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Data science classica_hypos
Data science classica_hyposData science classica_hypos
Data science classica_hypos
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
Logistic-Regression-Webinar.pdf
Logistic-Regression-Webinar.pdfLogistic-Regression-Webinar.pdf
Logistic-Regression-Webinar.pdf
 
A review of statistics
A review of statisticsA review of statistics
A review of statistics
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statistics
 
Edison S Statistics
Edison S StatisticsEdison S Statistics
Edison S Statistics
 

Mehr von Yanli Liu

Business English
Business EnglishBusiness English
Business EnglishYanli Liu
 
China History and Today
China History and TodayChina History and Today
China History and TodayYanli Liu
 
Oracle tips and tricks
Oracle tips and tricksOracle tips and tricks
Oracle tips and tricksYanli Liu
 
Air Cargo 101
Air Cargo 101Air Cargo 101
Air Cargo 101Yanli Liu
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean StrategyYanli Liu
 
The World Is Flat
The World Is FlatThe World Is Flat
The World Is FlatYanli Liu
 

Mehr von Yanli Liu (9)

Business English
Business EnglishBusiness English
Business English
 
China History and Today
China History and TodayChina History and Today
China History and Today
 
Oracle tips and tricks
Oracle tips and tricksOracle tips and tricks
Oracle tips and tricks
 
Air Cargo 101
Air Cargo 101Air Cargo 101
Air Cargo 101
 
Finance 101
Finance 101Finance 101
Finance 101
 
Blackswan
BlackswanBlackswan
Blackswan
 
Peak Oil
Peak OilPeak Oil
Peak Oil
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean Strategy
 
The World Is Flat
The World Is FlatThe World Is Flat
The World Is Flat
 

SAS Regression Guide

  • 2. Table of Contents SAS Input/Output Functions in Data Step Simple Statistics Procedures Hypothesis testing – mean and proportion Multiple linear regression Generalized linear regression Cluster Analysis Association Analysis Logistic Regression
  • 3. SAS Input/Output /*output csv file*/ data _null_; set a; file 'xx.csv' dsddlm=','; put x y z; /*define ftp connection as filename*/ filename test ftp '.cshrc' cd ='/export/home/sz325584' host='forecast.marketing.fedex.com' user="sz325584" pass="xxxx"; /*read one line into a variable*/ data flist; infile "ls ./" pipe length=len; input @; input fname $varying200. len; /*create graphs*/ goptions reset=all; proc gplot data=cars; plot price*(citympg, hwympg, cylinders enginesize); Symbol v=circle; /*read from ls –l output*/ data aa; infile 'ls -l ' pipe dsddlm=' ' missover; input dir $ owner $; /*read multiple file with same layout */ filename in ('200306.csv','200309.csv' ); data base; informatshp_dt mmddyy8.; format sshipym date9.; infile in dsd delimiter=',' ; /*read in only part of the file,useful for large mainframe tape file*/ data base(drop=i); infile '1.csv' end=eofdsdfirstobs=3 delimiter=',' missover; retain i 0; do while (i<20); input aa $; output; i=i+1; end;
  • 4. Functions in Data Step mdy intnx(‘month’,sshipym,1) put(variable, format) Name=scan(string,2,”&=“) /*split string by &= and find the second item*/ substr(yyyymm,5,2) index(address,’NY’) /*find position of a pattern in a string*/ call symput(numobs, ‘numobs’) /*put data step variable value into a sasmacro variable*/
  • 5. Simple Statistic Procedures /*create histogram*/ proc univariate data=one noprint; var v1; histogram v1 / normal; run; /*create means, standard deviations */ PROC MEANS DATA=volume; VAR adv; OUTPUT OUT=volume_stat(KEEP=MEAN STD MAX MIN) MEAN(adv)=MEAN STD(adv)=STD MAX(adv)=MAX MIN(adv)=MIN; /*Random select*/ Proc surveyselect data = trees Method = SRS n = 15 out = sample1; strata segment; run;
  • 6. Variables Types Categorical or nominal variables are ones such as favorite color, which have two or more categories and no way to order the values. Other examples of categorical variables include gender, blood type and favorite ice cream flavor. Ordinal variables can be ordered, but are similar to categorical variables in that there are clear categories. The relative distances or spacing between variables values is not uniform. Continuous/Interval variables are similar to ordinal variables, except that values are measured in a way where their differences are meaningful. The place number of runners in a race is considered an ordinal scale, but if we consider the actual times of runners rather than their place, this would be an interval scale.
  • 7. Hypothesis testing A statistical test is a quantitative way to decide whether there is enough evidence to reasonably believe a conjecture to be true. null hypothesis H0, and the alternative hypothesis Ha. H0 normally assumes no difference in means or in regression analysis, no relationship between predicator and response variable, i.e. coefficient=0 To control type I error, we often set threshold to be 5%, only reject null hypothesis when p<0.05. Or in other words, only accept Ha (there is difference or there is relationship) when evidence is very strong.
  • 8. One tail or two tailed hypothesis testing To obtain correct results, it is important to determine whether the hypothesis tests are one or two-tailed. When the null and alternative hypotheses are of the form H0: x1= x2, with Ha: x1> x2 or Ha: x1< x2, we call that a one-tailed test, and when the null hypothesis is of the form x1 x2, we call that a two tailed test.
  • 9. Hypothesis testing on means - Ttest We can use t-tests in the following three situations; We want to test whether the mean is significantly different than a hypothesized value. We want to test whether means for two independent groups are significantly different. We want to test whether means for dependent or paired groups are significantly different.
  • 10. Ttest Ttest is a special form of one way ANOVA where category variable has only two values. Whether the cereal box avg weight is different from 15 ounce? (two sided) PROC TTEST DATA= datasetnameH0=15; can also be done with proc univariate. VAR weight; Whether the cereal box avg weight is above 15 ounce? (one sided) ods graphics on; proc ttest h0=15 plots(showh0) sides=u alpha=0.1; var weight; Test whether the means of two independent group are the same. (control group vs. target group or different brands of cereal box) PROC TTEST DATA= datasetname; CLASS brand; VAR weight;
  • 11. Paired Ttest Test two attributes belong to the same object Eg. Same account, pre campaign sales and post campaign sale. Same student, reading and writing scores. Test whether account sales different after marketing campaign? Note: pre and post sales are dependent groups. PROC TTEST DATA= datasetname; PAIRED pre_sale*post_sale; Or test whether students reading and writing scores are significant different. PROC TTEST DATA= datasetname; PAIRED read_score*write_score;
  • 12. ANOVA When comparing means from more than two groups, use one way ANOVA. Two way ANOVA means there are two CLASS variables (eg CLASS SEMENT INDUSTRY). There are two common ways to run ANOVA in SAS. A seemingly obvious way is PROC ANOVA, the other is PROC GLM, which has the added advantage of allowing with a few more SAS options.
  • 13. ANOVA H0: All means are equal across brands. Ha: There is a difference between mean salaries of families who vacationed in different seasons. PROC ANOVA DATA= cereal; CLASS brand; MODEL weight= brand; MEANS brand;
  • 14. Nonparametric ANOVA Used when we cannot assume normal distribution. For example, when sample size is too small. The mean distribution won’t be normal. Proc npar1way data=sasdata; class variable; var variables;
  • 15. Hypothesis testing on proportion A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions. For example, let's suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks. proc freq data = mydata; tables race / chisqtestp=(10 10 10 70); run;
  • 16. One-way MANOVA MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. E.g. examine the differences in read, write and math broken down by program type (prog). proc glm data = "c:ydatasb2"; class prog; model read write math = prog; manova h=prog; run; quit;
  • 17. Multiple Linear Regression A powerful tool to understand relationship between predictor and response variables and predict future values. Linear is in terms of the coefficients. The following are multiple linear models y=b0+b1*x1+b2*x2 y=b0+b1*x1^2+b2*x1*x2 Assumptions – linear relationship, error term is normal distributed and independent For N independent variables, the number of all possible of model combination is 2^N, very computing intensive. We can use stepwise selection to find a model quickly. Plot the chart before trying models Goptions reset=all; proc gplot data=paper; plot strength*amount; Symbol i=rc; /*impose quadratic regression model on the chart*/ title ‘Quadratic model’; plot strength*amount; symbol i=rc; /*impose cubic regression on the chart*/ title ‘Cubic model’; Run;
  • 18. Evaluate Model Assumption Normality Normal probability plots of the residuals using proc univariate Independent observations Plot residuals vs time and other ordering component Durbin-watsonstatitics or the first order autocorrelation statistics for time series data Constant variance Plot residuals vs. predicted value Spearman rank correlation coefficient between absolute value of residuals and predicted values
  • 19. Model fitness Examine model-fitting statistics such as R^2, adjusted R^2, AIC, SBC If overall model p value<0.05 then at least one of the predictor is significant Each coefficient’s p value Examine residual plots and validate the normality assumption. Proc reg data=mydata; model reading= age gender; plot r.*p; /*plot a graph of the residuals vs. the predicted values;*/ Output out=out r=residuals; proc univariate data=out; var residuals; histogram/ normal;
  • 20. Remedial measures When a straight line is inappropriate Transform the independent variables to obtain linearity Fit a polynomial regression model Fit a nonlinear regression model using proc nlin When there is multicollinearity Exclude redundant independent variables Center the independent variables in polynomial regression model When there are influential observations Make sure there are no data errors Investigate the cause of the data Delete the observations if appropriate and document the situation Transforming the dependent variables Transforming the dependent variable is one of the common approaches to deal with nonnormal data and or nonconstant variances. E.g.
  • 21. Regression with Categorical Predictors In proc reg, categorical predictor needs to be coded into dummy variable as input. In progglm, this is done automatically when the variable is put under class statement. This is the data step showing how to code dummy variables (0/1 value) using category “catvar”, which has value 1, 2, 3 DATA mydata; set mydata; array dum(3) dum1-dum3; do i = 1 to 3; dum(i)=(catvar=i); end; drop i; If an ordinal predictor has only three or four levels then clearly it should coded using dummy coding. There are times when an ordinal predictor can be treated as if it were interval (this is called quasi-interval) especially if the variable has more than five or six levels.
  • 22. Polynomial Regression A polynomial regression is a special type of multiple linear regression where powers of variables and cross-product(interactions) terms are included in the model. Y=b0+b1*X1+b2*X2+b3*X1^2+b4*X1*X2 Data a; set a; var2=var1**2; var3=var1**3; Proc reg data=a; model y=var1 var2 var3;
  • 23. collinearity Detect collinearity Proc reg data=mydata; model oxygen_consumption=runtime age weight /vifcollincollinoint; Vif and collin options provide collinearity diagnostics Vif>10 or collin>100 indicate strong collinearity Treatment polynomial function often introduce collinearity, one treatment is to center the data, so some data becomes negative which reduce x^2 and x^3 correlations. Proc stdize data=paper method=mean out=paper1; var var1; Data paper1; set paper1; mvar2=var1**2; mvar3=var1**3;
  • 24. Multivariate multiple regression Multivariate multiple regression is used when you have two or more variables that are to be predicted from two or more predictor variables. In our example, we will predict write and read from female, math, science and social studies (socst) scores. The mtest statement in the proc reg is used to test hypotheses in multivariate regression models where there are several dependent variables fit to the same regressors proc reg data = "c:ydatasb2"; model write read = female math science socst; female: mtest female; math: mtest math; science: mtest science; socst: mtestsocst;
  • 25. Autoregression If errors (residuals) are not independent, autoregression model should be used. By simultaneously estimating the regression coefficient and the autoregressive error model parameters, the autoreg procedure corrects the regression estimates. Yt=b0+b1*x1+..+bk*xk+vt Vt=-pt-1*Vt-1-pt-2*Vt-1 +Et proc autoreg data=sales; model sales=price promotion /nlag=3 method=m1 dwprob;
  • 26. Autoregression and Arima for time series forecast t is the month order, t=_n_; The order of the AR(p) model is chosen by a backward elimination search proc autoreg data = taxrevenue; model rev = t d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12/ nlag = 4 DW=4 DWPROB method=ml backstepslstay=0.05; Use of proc arima to fit ARMA models consists of 3 steps. The first step is model identification, in which the observed series is transformed to be stationary. The only transformation available within proc arima is differencing. The second step is model estimation, in which the orders p and q are selected and the corresponding parameters are estimated. The third step is forecasting, in which the estimated model is used to forecast future values of the observable time series proc arima data=history; ivar=adv nlag=15; e p=1 q=3; f lead=12 interval=month id=sshipym out=fcst; /*f statement means forecast*/ Since data is not stationary, we can use differencing adv(1) instead of adv.
  • 27. Generalized linear model No longer requires response variable follows normal distribution conditioned on the predictor variablesand constant variance. Examples: distribution Linear regression normal Logistic regression binomial Poisson regression Poisson Gamma regression gamma
  • 28. Poisson Regression Poisson distribution is used to model the count (non negative integer) or occurring rates of rare events (when the mean gets larger, the Poisson distribution approaches normal distribution) Number of ear infection in infants Number of equipment failures Rate of insurance claim Differ from normal distribution Non symmetrical, skewed to the right for rare events Has only one parameter (mean). Actually variance equals mean. Example proc genmod data=skincancer; class city age; model cases=city age /offset=log_popdist=poi link=log type3; title ‘Poisson regression model for skin cancer rates’; Offset legs the genmod to model rate instead of counts.
  • 29. Proc reg and proc glm Proc reg more convenient for regression analysis because of options for plot and model automatic selection. (no class statement) Proc glm more convenient for anova because of class statement. (no plot or model selection options)
  • 30. Cluster Analysis The primary goal of market segmentation is to better satisfy customer needs or wants. The firm does not want to Use the same marketing program for all customers Incur the high cost of a unique program for each customer A deck of 52 cards can be grouped as 26 red and 26 black 13 each spades, hearts, diamonds and clubs
  • 31. Cluster Analysis Scale of measurement will impact the grouping. It is better to standardize input variables before clustering. Two types of clustering Hierarchical clustering Used for small size of data. Can determine number of clusters by finding the local peak for F and T-squared statistics. No theoretical reason to expect a hierarchical structure Non-hierarchical clustering Scale up well with large/complex data Number of clusters need to be specified in advance Initial seed required Combination Two step method is used. First, a hierarchical method is applied on the training data to decide number of clusters and initial choice of seeds. Feed into non-hierarchical method (such as k-means) to apply to whole data set. SAS enterprise Miner automates this two step process.
  • 33. Association Analysis Chi-Square Test Proc freq data=mydata; table gender*purchase /chisq expected cellchi2 nocolnopercent; title1 ‘Association between Gender and Purchase’; When more than 20% of cells have expected counts less than five, chi-square test may not be valid.
  • 34.
  • 35. SAS Logistics Procedures proc logistic data=abc.logisticddescending; class gender (ref=‘Male') income (ref=‘Low') /param=ref; model purchase=gender income /rsquarelackfitctable selection=stepwise; output out=probs predicted=phat; “descending” makes the procedure to model P(Y=1) Using the estimate of b1, b1 .. from the training data, we can calculate the p for population dataset. We can use P>0.5 as threshold to predict event happen (i.e. Will purchase etc.) P=1/(1+e-(b0+b1*X1+b2*X2…) ) Model fit. When comparing models, lower AIC is the better model. Intercept only Intercept and covariates AIC 560 550 SC 570 560 -2 Long L 575 550