SAS Regression Guide

Table of Contents SAS Input/Output Functions in Data Step Simple Statistics Procedures Hypothesis testing – mean and proportion Multiple linear regression Generalized linear regression Cluster Analysis Association Analysis Logistic Regression

SAS Input/Output /*output csv file*/ data _null_; set a; file 'xx.csv' dsddlm=','; put x y z; /*define ftp connection as filename*/ filename test ftp '.cshrc' cd ='/export/home/sz325584' host='forecast.marketing.fedex.com' user="sz325584" pass="xxxx"; /*read one line into a variable*/ data flist; infile "ls ./" pipe length=len; input @; input fname $varying200. len; /*create graphs*/ goptions reset=all; proc gplot data=cars; plot price*(citympg, hwympg, cylinders enginesize); Symbol v=circle; /*read from ls –l output*/ data aa; infile 'ls -l ' pipe dsddlm=' ' missover; input dir $ owner $; /*read multiple file with same layout */ filename in ('200306.csv','200309.csv' ); data base; informatshp_dt mmddyy8.; format sshipym date9.; infile in dsd delimiter=',' ; /*read in only part of the file,useful for large mainframe tape file*/ data base(drop=i); infile '1.csv' end=eofdsdfirstobs=3 delimiter=',' missover; retain i 0; do while (i<20); input aa $; output; i=i+1; end;

Functions in Data Step mdy intnx(‘month’,sshipym,1) put(variable, format) Name=scan(string,2,”&=“) /*split string by &= and find the second item*/ substr(yyyymm,5,2) index(address,’NY’) /*find position of a pattern in a string*/ call symput(numobs, ‘numobs’) /*put data step variable value into a sasmacro variable*/

Simple Statistic Procedures /*create histogram*/ proc univariate data=one noprint; var v1; histogram v1 / normal; run; /*create means, standard deviations */ PROC MEANS DATA=volume; VAR adv; OUTPUT OUT=volume_stat(KEEP=MEAN STD MAX MIN) MEAN(adv)=MEAN STD(adv)=STD MAX(adv)=MAX MIN(adv)=MIN; /*Random select*/ Proc surveyselect data = trees Method = SRS n = 15 out = sample1; strata segment; run;

Variables Types Categorical or nominal variables are ones such as favorite color, which have two or more categories and no way to order the values. Other examples of categorical variables include gender, blood type and favorite ice cream flavor. Ordinal variables can be ordered, but are similar to categorical variables in that there are clear categories. The relative distances or spacing between variables values is not uniform. Continuous/Interval variables are similar to ordinal variables, except that values are measured in a way where their differences are meaningful. The place number of runners in a race is considered an ordinal scale, but if we consider the actual times of runners rather than their place, this would be an interval scale.

Hypothesis testing A statistical test is a quantitative way to decide whether there is enough evidence to reasonably believe a conjecture to be true. null hypothesis H0, and the alternative hypothesis Ha. H0 normally assumes no difference in means or in regression analysis, no relationship between predicator and response variable, i.e. coefficient=0 To control type I error, we often set threshold to be 5%, only reject null hypothesis when p<0.05. Or in other words, only accept Ha (there is difference or there is relationship) when evidence is very strong.

One tail or two tailed hypothesis testing To obtain correct results, it is important to determine whether the hypothesis tests are one or two-tailed. When the null and alternative hypotheses are of the form H0: x1= x2, with Ha: x1> x2 or Ha: x1< x2, we call that a one-tailed test, and when the null hypothesis is of the form x1 x2, we call that a two tailed test.

Hypothesis testing on means - Ttest We can use t-tests in the following three situations; We want to test whether the mean is significantly different than a hypothesized value. We want to test whether means for two independent groups are significantly different. We want to test whether means for dependent or paired groups are significantly different.

Ttest Ttest is a special form of one way ANOVA where category variable has only two values. Whether the cereal box avg weight is different from 15 ounce? (two sided) PROC TTEST DATA= datasetnameH0=15; can also be done with proc univariate. VAR weight; Whether the cereal box avg weight is above 15 ounce? (one sided) ods graphics on; proc ttest h0=15 plots(showh0) sides=u alpha=0.1; var weight; Test whether the means of two independent group are the same. (control group vs. target group or different brands of cereal box) PROC TTEST DATA= datasetname; CLASS brand; VAR weight;

Paired Ttest Test two attributes belong to the same object Eg. Same account, pre campaign sales and post campaign sale. Same student, reading and writing scores. Test whether account sales different after marketing campaign? Note: pre and post sales are dependent groups. PROC TTEST DATA= datasetname; PAIRED pre_sale*post_sale; Or test whether students reading and writing scores are significant different. PROC TTEST DATA= datasetname; PAIRED read_score*write_score;

ANOVA When comparing means from more than two groups, use one way ANOVA. Two way ANOVA means there are two CLASS variables (eg CLASS SEMENT INDUSTRY). There are two common ways to run ANOVA in SAS. A seemingly obvious way is PROC ANOVA, the other is PROC GLM, which has the added advantage of allowing with a few more SAS options.

ANOVA H0: All means are equal across brands. Ha: There is a difference between mean salaries of families who vacationed in different seasons. PROC ANOVA DATA= cereal; CLASS brand; MODEL weight= brand; MEANS brand;

Nonparametric ANOVA Used when we cannot assume normal distribution. For example, when sample size is too small. The mean distribution won’t be normal. Proc npar1way data=sasdata; class variable; var variables;

Hypothesis testing on proportion A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions. For example, let's suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks. proc freq data = mydata; tables race / chisqtestp=(10 10 10 70); run;

One-way MANOVA MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. E.g. examine the differences in read, write and math broken down by program type (prog). proc glm data = "c:ydatasb2"; class prog; model read write math = prog; manova h=prog; run; quit;

Multiple Linear Regression A powerful tool to understand relationship between predictor and response variables and predict future values. Linear is in terms of the coefficients. The following are multiple linear models y=b0+b1*x1+b2*x2 y=b0+b1*x1^2+b2*x1*x2 Assumptions – linear relationship, error term is normal distributed and independent For N independent variables, the number of all possible of model combination is 2^N, very computing intensive. We can use stepwise selection to find a model quickly. Plot the chart before trying models Goptions reset=all; proc gplot data=paper; plot strength*amount; Symbol i=rc; /*impose quadratic regression model on the chart*/ title ‘Quadratic model’; plot strength*amount; symbol i=rc; /*impose cubic regression on the chart*/ title ‘Cubic model’; Run;

Evaluate Model Assumption Normality Normal probability plots of the residuals using proc univariate Independent observations Plot residuals vs time and other ordering component Durbin-watsonstatitics or the first order autocorrelation statistics for time series data Constant variance Plot residuals vs. predicted value Spearman rank correlation coefficient between absolute value of residuals and predicted values

Model fitness Examine model-fitting statistics such as R^2, adjusted R^2, AIC, SBC If overall model p value<0.05 then at least one of the predictor is significant Each coefficient’s p value Examine residual plots and validate the normality assumption. Proc reg data=mydata; model reading= age gender; plot r.*p; /*plot a graph of the residuals vs. the predicted values;*/ Output out=out r=residuals; proc univariate data=out; var residuals; histogram/ normal;

Remedial measures When a straight line is inappropriate Transform the independent variables to obtain linearity Fit a polynomial regression model Fit a nonlinear regression model using proc nlin When there is multicollinearity Exclude redundant independent variables Center the independent variables in polynomial regression model When there are influential observations Make sure there are no data errors Investigate the cause of the data Delete the observations if appropriate and document the situation Transforming the dependent variables Transforming the dependent variable is one of the common approaches to deal with nonnormal data and or nonconstant variances. E.g.

Regression with Categorical Predictors In proc reg, categorical predictor needs to be coded into dummy variable as input. In progglm, this is done automatically when the variable is put under class statement. This is the data step showing how to code dummy variables (0/1 value) using category “catvar”, which has value 1, 2, 3 DATA mydata; set mydata; array dum(3) dum1-dum3; do i = 1 to 3; dum(i)=(catvar=i); end; drop i; If an ordinal predictor has only three or four levels then clearly it should coded using dummy coding. There are times when an ordinal predictor can be treated as if it were interval (this is called quasi-interval) especially if the variable has more than five or six levels.

Polynomial Regression A polynomial regression is a special type of multiple linear regression where powers of variables and cross-product(interactions) terms are included in the model. Y=b0+b1*X1+b2*X2+b3*X1^2+b4*X1*X2 Data a; set a; var2=var1**2; var3=var1**3; Proc reg data=a; model y=var1 var2 var3;

collinearity Detect collinearity Proc reg data=mydata; model oxygen_consumption=runtime age weight /vifcollincollinoint; Vif and collin options provide collinearity diagnostics Vif>10 or collin>100 indicate strong collinearity Treatment polynomial function often introduce collinearity, one treatment is to center the data, so some data becomes negative which reduce x^2 and x^3 correlations. Proc stdize data=paper method=mean out=paper1; var var1; Data paper1; set paper1; mvar2=var1**2; mvar3=var1**3;

Multivariate multiple regression Multivariate multiple regression is used when you have two or more variables that are to be predicted from two or more predictor variables. In our example, we will predict write and read from female, math, science and social studies (socst) scores. The mtest statement in the proc reg is used to test hypotheses in multivariate regression models where there are several dependent variables fit to the same regressors proc reg data = "c:ydatasb2"; model write read = female math science socst; female: mtest female; math: mtest math; science: mtest science; socst: mtestsocst;

Autoregression If errors (residuals) are not independent, autoregression model should be used. By simultaneously estimating the regression coefficient and the autoregressive error model parameters, the autoreg procedure corrects the regression estimates. Yt=b0+b1*x1+..+bk*xk+vt Vt=-pt-1*Vt-1-pt-2*Vt-1 +Et proc autoreg data=sales; model sales=price promotion /nlag=3 method=m1 dwprob;

Autoregression and Arima for time series forecast t is the month order, t=_n_; The order of the AR(p) model is chosen by a backward elimination search proc autoreg data = taxrevenue; model rev = t d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12/ nlag = 4 DW=4 DWPROB method=ml backstepslstay=0.05; Use of proc arima to fit ARMA models consists of 3 steps. The first step is model identification, in which the observed series is transformed to be stationary. The only transformation available within proc arima is differencing. The second step is model estimation, in which the orders p and q are selected and the corresponding parameters are estimated. The third step is forecasting, in which the estimated model is used to forecast future values of the observable time series proc arima data=history; ivar=adv nlag=15; e p=1 q=3; f lead=12 interval=month id=sshipym out=fcst; /*f statement means forecast*/ Since data is not stationary, we can use differencing adv(1) instead of adv.

Generalized linear model No longer requires response variable follows normal distribution conditioned on the predictor variablesand constant variance. Examples: distribution Linear regression normal Logistic regression binomial Poisson regression Poisson Gamma regression gamma

Poisson Regression Poisson distribution is used to model the count (non negative integer) or occurring rates of rare events (when the mean gets larger, the Poisson distribution approaches normal distribution) Number of ear infection in infants Number of equipment failures Rate of insurance claim Differ from normal distribution Non symmetrical, skewed to the right for rare events Has only one parameter (mean). Actually variance equals mean. Example proc genmod data=skincancer; class city age; model cases=city age /offset=log_popdist=poi link=log type3; title ‘Poisson regression model for skin cancer rates’; Offset legs the genmod to model rate instead of counts.

Proc reg and proc glm Proc reg more convenient for regression analysis because of options for plot and model automatic selection. (no class statement) Proc glm more convenient for anova because of class statement. (no plot or model selection options)

Cluster Analysis The primary goal of market segmentation is to better satisfy customer needs or wants. The firm does not want to Use the same marketing program for all customers Incur the high cost of a unique program for each customer A deck of 52 cards can be grouped as 26 red and 26 black 13 each spades, hearts, diamonds and clubs

Cluster Analysis Scale of measurement will impact the grouping. It is better to standardize input variables before clustering. Two types of clustering Hierarchical clustering Used for small size of data. Can determine number of clusters by finding the local peak for F and T-squared statistics. No theoretical reason to expect a hierarchical structure Non-hierarchical clustering Scale up well with large/complex data Number of clusters need to be specified in advance Initial seed required Combination Two step method is used. First, a hierarchical method is applied on the training data to decide number of clusters and initial choice of seeds. Feed into non-hierarchical method (such as k-means) to apply to whole data set. SAS enterprise Miner automates this two step process.

Association Analysis Chi-Square Test Proc freq data=mydata; table gender*purchase /chisq expected cellchi2 nocolnopercent; title1 ‘Association between Gender and Purchase’; When more than 20% of cells have expected counts less than five, chi-square test may not be valid.

Regression Overview ,[object Object],[object Object]

SAS Logistics Procedures proc logistic data=abc.logisticddescending; class gender (ref=‘Male') income (ref=‘Low') /param=ref; model purchase=gender income /rsquarelackfitctable selection=stepwise; output out=probs predicted=phat; “descending” makes the procedure to model P(Y=1) Using the estimate of b1, b1 .. from the training data, we can calculate the p for population dataset. We can use P>0.5 as threshold to predict event happen (i.e. Will purchase etc.) P=1/(1+e-(b0+b1*X1+b2*X2…) ) Model fit. When comparing models, lower AIC is the better model. Intercept only Intercept and covariates AIC 560 550 SC 570 560 -2 Long L 575 550

SAS Regression Guide

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie SAS Regression Guide

Ähnlich wie SAS Regression Guide (20)

Mehr von Yanli Liu

Mehr von Yanli Liu (9)

SAS Regression Guide