1. Econometrics
ASSIGNMENT
ON
‘Multiple Regression Analysis’
Prepared For:-
Dr. Md. Kamal Uddin
Professor
Department of International Business
University of Dhaka
Prepared By:-
Hazera Akter
Roll No: 01
8th Semester , BBA 1St Batch
Department Of International Business
University of Dhaka
Date of Submission
7th April , 2012
2. Assignment Topic
‘Multiple Regression Analysis with Test of
Heteroskedasticity, Autocorrelation and Multicollinearity ’
Table of Contents
Topics Page No.
Analysis Summery
Data Set
ANALYSIS SUMMARY
2
3. In multiple regression analysis, we study the relationship between an explained
variable and a number of explanatory variables. In this Assignment, the current
salary structure has been analyzed with the effects of some influential factors for
setting salary. The purpose of this analysis includes,
Cause analysis: Learn more about the relationship between several independent
variables and a dependent variable.
Impact analysis: Assess the impact of changing an independent variable to the
value of dependent variable.
Time series analysis: Predict values of a time series, using either previous values
of just that one series, or values from other series as well.
In the detailed analysis of the Multiple Regression, The Interpretation incudes,
• Considering the R2 (0.491) value ,we can infer that for overall estimation
this model is not strong.
• The model for Salary estimation for Employee of Coca-Cola company includes
almost all collinear variables.
• But this model is very useful considering for having very low
Heteroskedasticity and Autocorrelation problem.
So, these overall analysis results would help the management of Coca-Cola
company to set or estimate Salary in revised decision round.
Data Set
3
4. A multinational corporation named “The Coca-Cola Company” would like to study on
their employees’ salary structure in their Bangladesh Subsidiary Venture, by
predicting Salary based on some influential factors like Gender, Age, Education
Level of the employees. A sample of 30 employees’ current salary data is
randomly drawn to perform a Regression analysis. The Data set is exhibited
below_
In this Data set,
Dependent Variable, Y= Current Salary
ID Current Gende Job Age Education Work Minority
Salary (Tk) r Seniority Level Experience Class
1 16080 0 81 28.50 16 0.25 0
2 41400 0 73 40.33 16 12.50 1
3 21960 1 83 31.08 15 4.08 0
4 19200 0 93 31.17 16 1.83 1
5 28350 0 83 41.92 19 13.00 0
6 27250 1 80 29.50 18 2.42 0
7 16080 0 79 28.00 15 3.17 0
8 14100 0 67 28.75 15 0.50 0
9 12420 1 96 27.42 15 1.17 1
10 12300 1 77 52.92 12 26.42 0
11 15720 0 84 33.50 15 6.00 1
12 8880 1 88 54.33 12 27.00 0
13 22000 0 93 32.33 17 2.67 0
14 22800 0 98 41.17 15 12.00 0
15 19020 1 64 31.92 19 2.25 1
16 12300 1 94 46.25 12 20.00 0
17 22200 1 81 30.75 19 5.17 0
18 10380 1 72 32.67 15 6.92 1
19 8520 0 70 58.50 15 31.00 0
20 27500 0 89 34.17 17 3.17 0
21 11460 1 79 46.58 15 21.75 1
22 20500 0 83 35.17 16 5.75 0
23 27700 0 85 43.25 20 11.17 1
24 28000 1 65 28.00 16 1.58 1
25 22000 1 65 39.75 19 10.75 0
26 27250 0 78 30.08 19 2.92 0
27 27000 0 83 30.17 17 0.75 1
28 9000 1 70 44.50 12 18.00 0
29 31300 0 91 30.17 18 3.92 1
30 11760 0 70 26.83 15 1.25 0
4
5. Independent Variable,
X1= Sex of Employee
X2= Job Seniority
X3= Age of Employee
X4= Education Level
X5= Work Experience
X6= Minority Classification
Type of Scales Used Here
Attributes of measurement object in this analysis can be measured by different
types of scales:
Nominal Scale: X1= Sex of Employee “ Where Male = 0 and Female = 1”
X6= Minority Classification “ Where White = 0 and Nonwhite = 1”
Ratio Scale: X2= Job Seniority(Years in only in Coca-Cola)
X3= Age of Employee(Years)
X4= Education Level(Scores)
X5= Work Experience(Years- overall job life)
All of these Variable has Numeric Value and can obtain an absolute Zero.
So, In this Multivariate Data Set we have to perform a Multiple Regression
Analysis for predicting Possible Current Salary of an employee.
NOTE: All the analysis has been performed with the “SPSS” Software. For the
ease of presentation of analysis the Variables are discussed with their detailed
names/meanings.
MULTIPLE REGRESSION ANALYSIS RESULTS
5
6. Variables Entered/Removed
Variables Variables
Model Entered Removed Method
1 MINORITY . Enter
CLASSIFICATIO
N, JOB
SENIORITY,
AGE OF
EMPLOYEE,
SEX OF
EMPLOYEE,
EDUCATIONAL
LEVEL, WORK
EXPERIENCEa
a. All requested variables entered.
Model Summaryb
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .701a .491 .358 6458.883
a. Predictors: (Constant), MINORITY CLASSIFICATION, JOB
SENIORITY, AGE OF EMPLOYEE, SEX OF EMPLOYEE,
EDUCATIONAL LEVEL, WORK EXPERIENCE
b. Dependent Variable: CURRENT SALARY
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 9.246E8 6 1.541E8 3.694 .010a
Residual 9.595E8 23 4.172E7
Total 1.884E9 29
6
7. Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -25969.540 23234.542 -1.118 .275
SEX OF EMPLOYEE -2126.081 2778.333 -.133 -.765 .452
JOB SENIORITY 82.398 130.286 .100 .632 .533
AGE OF EMPLOYEE 263.053 829.669 .286 .317 .754
EDUCATIONAL LEVEL 2026.429 707.189 .564 2.865 .009
WORK EXPERIENCE -298.406 870.804 -.329 -.343 .735
MINORITY 1846.496 2528.644 .112 .730 .473
CLASSIFICATION
a. Dependent Variable: CURRENT SALARY
Thus , The estimated Model of Multiple Regression Equation,
Y = −25969.54 −2126.081 X1 + 82.398X2 + 263.053X3 + 2026.429X4
−298.406 X5 +1846.496 X6 + Ui (Regression of y on x) R2=0.491 Ui= Errors
Commentary on resulted Model
This equation suggests that Education Level is far more important than all other
independent variables. The equation says that one more score on education
background, holding all other independent variables constant, results in an
increase in Salary of TK. 2026. That is, if we consider the persons with the
same level of other positions, the one with one more score of education
can be expected to have higher salary of TK. 2026.
After Education level Minority classification is considered highly in setting
salary structure. Here if we consider people with same level in all other
7
8. independent variables (constant), the one White/ Nonwhite (with any
particular race determined by company management) can expected to
have incrementing salary structure and thus higher salary of TK. 2126.
The equation also says that one more year of job seniority, holding all other
independent variables constant, results in an increase in Salary of TK. 82. That
is, if we consider the persons with the same level of other positions, the
one with one more year on job on the Coca-Cola company, can be
expected to have higher salary of TK. 82.
This equation also shows that one more year of Age, holding all other
independent variables constant, results in an increase in Salary of TK. 263. That
is, if we consider the persons with the same level of other positions, the
one with one more year of age, can be expected to have higher salary of
TK. 263.This shows the age of Employee is more influential than their job
years on the company.
Here if we consider people with same level in all other independent variables
(constant), the one with sex male/ female (with any particular sex
determined by company management) can expected to have
discriminatory salary structure and thus lower salary of TK. 2126.Of course,
all these numbers are subject to uncertainty, it will be clear that we should
be dropping the variable X1 completely.
Similarly if we consider two people with same education level and holding
all other independent variables constant, the one with one more year of
experience can expected to have lower salary of TK. 298 2126.Of course,
all these numbers are subject to uncertainty, it will be clear that we should
be dropping the variable X5 completely.
Interpretation of the constant term:
Clearly, that is the salary one would get with no qualification in variable
factors and only with minimum quality to be recruited in the company. But
a negative salary is not possible. So, what would be the salary if a person
just joined the firm?
In Conclusion, we have to state that the sample is not fully representative
from all people working in the company. We can not extrapolate the results
8
9. too far out of this sample range. We can not use the equation to predict
what a new entrant would earn. So at the inference, we can say that this
regression equation model should not be used also for making other generalized
decisions for any salary structure.
Simple Regression for Negative Influencing Factors Show,
Variables Entered/Removedb
Variables Variables
Model Entered Removed Method
1 SEX OF . Enter
a
EMPLOYEE
a. All requested variables entered.
b. Dependent Variable: CURRENT SALARY
Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .343a .118 .086 7705.174
a. Predictors: (Constant), SEX OF EMPLOYEE
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 22191.765 1868.779 11.875 .000
SEX OF EMPLOYEE -5486.380 2838.880 -.343 -1.933 .063
a. Dependent Variable: CURRENT SALARY
It is found that the simple regression of Sex of Employee on Current Salary yet
shows negative influence without having all other variable’s influence. But initial
salary(α) is positive here.
9
10. Now,
Variables Entered/Removedb
Variables Variables
Model Entered Removed Method
1 WORK . Enter
a
EXPERIENCE
a. All requested variables entered.
b. Dependent Variable: CURRENT SALARY
Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .391a .153 .123 7549.967
a. Predictors: (Constant), WORK EXPERIENCE
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 22884.178 1940.377 11.794 .000
WORK EXPERIENCE -355.087 157.964 -.391 -2.248 .033
a. Dependent Variable: CURRENT SALARY
Again, It is found that the simple regression of Work of experience on Current Salary yet
shows negative influence without having all other variable’s influence. But initial
salary(α) is also positive here.
However, after allowing for the effects of Sex of employee and Work of experience, we
find from the multiple regression equation that it also yields lower salary same as simple
regression. So, the omission of variables only yields the positive initial salary(α), but
similar effect of other independent variables.
10
11. HETEROSKEDASTICITY IN MULTIPLE REGRESSION
In multiple regression, one of the assumptions we have made until now that the
errors have a common variance. This is known as the homoskedasticity
assumption. But, if we don’t have a constant variance we say they are
heteroskedastic.
In our Data set analyzing through SPSS we get,
Descriptive Statistics
Mean Std. Deviation N
CURRENT SALARY 19814.33 8060.314 30
SEX OF EMPLOYEE .43 .504 30
JOB SENIORITY 80.47 9.748 30
AGE OF EMPLOYEE 36.3227 8.76549 30
EDUCATIONAL LEVEL 16.00 2.244 30
WORK EXPERIENCE 8.6453 8.87542 30
MINORITY .37 .490 30
CLASSIFICATION
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 10342.00 29286.66 19814.33 5313.421 30
Residual -8926.251 21585.666 .000 6061.042 30
Std. Predicted Value -1.783 1.783 .000 1.000 30
Std. Residual -1.447 3.499 .000 .983 30
a. Dependent Variable: CURRENT SALARY
11
12. Here, Residuals plot trumpet-shaped => Residuals do not have constant variance.
Using the residuals this histogram is associated with dependent variable, leaving
independent variables for ease of getting error variance. The graph shows that it
is not totally normal distribution. There are some disturbances in this data set.
So we get the prevailing, but lower Heteroskedasticity problem here.
Model Summaryb
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .701a .491 .358 6458.883
a. Predictors: (Constant), MINORITY CLASSIFICATION, JOB
SENIORITY, AGE OF EMPLOYEE, SEX OF EMPLOYEE,
EDUCATIONAL LEVEL, WORK EXPERIENCE
b. Dependent Variable: CURRENT SALARY
According to White and Gleijser test, we measure Heteroskedasticity problem
based on R2. So here we don’t reject hypothesis of Homoskedasticity(R 2<0.50).
12
13. In this Normal P-P Plot, we get least square line which is also very near to be
normal. So, we get here also very lower Heteroskedasticity problem.
13
14. Again, regressing Standardized Residual on Standardized Predicted value, we find
very Heteroskedasticity problem for showing no particular trend in this plot.
Although, We have very low Heteroskedasticity problem, we can solve the rest
by
“Possible correction => log transformation of variable weight”
This log linear form’s R2 are not comparable, since the variance of dependent
variable is different.
14
15. AUTOCORRELATION IN MULTIPLE REGRESSION
In multiple Regression analysis, the correlation between error terms, is called
Autocorrelation. For detecting Autocorrelation problem Durbin-Watson test is
the simplest and most commonly used. Here the ϕ for testing hypothesis of
having Autocorrelation in Data set.
Model Summaryb
Model Durbin-Watson
1 2.168a
a. Predictors: (Constant), MINORITY CLASSIFICATION, JOB
SENIORITY, AGE OF EMPLOYEE, SEX OF EMPLOYEE,
EDUCATIONAL LEVEL, WORK EXPERIENCE
b. Dependent Variable: CURRENT SALARY
Coefficientsa
Correlations
Model Zero-order Partial Part
1 SEX OF EMPLOYEE -.343 -.158 -.114
JOB SENIORITY .094 .131 .094
AGE OF EMPLOYEE -.313 .066 .047
EDUCATIONAL LEVEL .659 .513 .426
WORK EXPERIENCE -.391 -.071 -.051
MINORITY .224 .151 .109
CLASSIFICATION
a. Dependent Variable: CURRENT SALARY
15
16. Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 8323.94 31453.22 19814.33 5646.471 30
Residual -7812.773 20206.270 .000 5752.046 30
Std. Predicted Value -2.035 2.061 .000 1.000 30
Std. Residual -1.210 3.128 .000 .891 30
a. Dependent Variable: CURRENT SALARY
16
17. Correlations
MINORITY
EDUCATION WORK CURRENT SEX OF JOB AGE OF CLASSIFICAT
AL LEVEL EXPERIENCE SALARY EMPLOYEE SENIORITY EMPLOYEE ION
Pearson CURRENT .659 -.391 1.000 -.343 .094 -.313 .224
Correlation SALARY -.391
SEX OF -.274 .271 -.343 1.000 -.225 .183 .033
EMPLOYEE
JOB -.085 -.035 .094 -.225 1.000 .003 .000
SENIORITY
AGE OF -.411 .979 -.313 .183 .003 1.000 -.196
EMPLOYEE
EDUCATION 1.000 -.497 .659 -.274 -.085 -.411 .188
AL LEVEL
WORK -.497 1.000 -.391 .271 -.035 .979 -.200
EXPERIENC
E
MINORITY .188 -.200 .224 .033 .000 -.196 1.000
CLASSIFICA
TION
Sig. (1-tailed) CURRENT .000 .016 . .032 .311 .046 .117
SALARY
SEX OF .071 .074 .032 . .116 .166 .432
EMPLOYEE
JOB .327 .428 .311 .116 . .494 .498
SENIORITY
AGE OF .012 .000 .046 .166 .494 . .150
EMPLOYEE
EDUCATION . .003 .000 .071 .327 .012 .160
AL LEVEL
WORK .003 . .016 .074 .428 .000 .144
EXPERIENC
E
MINORITY .160 .144 .117 .432 .498 .150 .
CLASSIFICA
TION
17
18. As here the D-W Statistic is 2.168 which is very near to 2. We know that if D-W
Statistic is 2it indicates zero correlation (ϕ=0) between Error terms. So in our
data set, there is very low Autocorrelation problem.
In solution of Autocorrelation problem, we can apply the LM Test, BKW Test etc.
MULTICOLLINEARITY IN MULTIPLE REGRESSION
One important problem in the application of multiple regression analysis involves
the possible collinearity of the explanatory variables. This condition refers to
situations in which some of the explanatory variables are highly correlated with
each other.
One method of measuring multicollinearity uses the Variance Inflation
Factor(VIF)
For each explanatory variable. We get VIF shown below through SPSS,
Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 SEX OF EMPLOYEE .734 1.362
JOB SENIORITY .939 1.065
AGE OF EMPLOYEE .033 29.964
WORK EXPERIENCE .032 31.372
MINORITY CLASSIFICATION .950 1.053
a. Dependent Variable: EDUCATIONAL LEVEL
18
19. Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 SEX OF EMPLOYEE .848 1.179
JOB SENIORITY .924 1.082
AGE OF EMPLOYEE .810 1.235
MINORITY .937 1.068
CLASSIFICATION
EDUCATIONAL LEVEL .756 1.322
a. Dependent Variable: WORK EXPERIENCE
19
20. Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 JOB SENIORITY .918 1.089
AGE OF EMPLOYEE .031 32.365
MINORITY .947 1.056
CLASSIFICATION
EDUCATIONAL LEVEL .572 1.749
WORK EXPERIENCE .028 35.927
a. Dependent Variable: SEX OF EMPLOYEE
20
21. Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 AGE OF EMPLOYEE .028 35.540
MINORITY .938 1.066
CLASSIFICATION
EDUCATIONAL LEVEL .602 1.662
WORK EXPERIENCE .025 40.063
SEX OF EMPLOYEE .755 1.324
a. Dependent Variable: JOB SENIORITY
21
22. Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 MINORITY .938 1.066
CLASSIFICATION
EDUCATIONAL LEVEL .721 1.388
WORK EXPERIENCE .718 1.392
SEX OF EMPLOYEE .890 1.124
a. Dependent Variable: AGE OF EMPLOYEE
22
23. Coefficientsa
Collinearity Statistics
Model Tolerance VIF
1 EDUCATIONAL LEVEL .610 1.641
WORK EXPERIENCE .025 40.044
SEX OF EMPLOYEE .763 1.311
AGE OF EMPLOYEE .028 35.539
a. Dependent Variable: MINORITY CLASSIFICATION
23
24. The tolerance for a variable is (1 - R-squared) for the regression of that variable
on all the other independents, ignoring the dependent. When tolerance is close
to 0 there is high multicollinearity of that variable with other independents and
the coefficients will be unstable.
VIF is the variance inflation factor, which is simply the reciprocal of tolerance.
Therefore, when VIF is high there is high multicollinearity and instability of the
coefficients.
24
25. As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is
indicated.
From above graph and considering VIF results, we can interpret there is very high
collinearity among the independent variables.
We can solve this problem through,
• Ridge Regression
• Principle component Regression
• Dropping the most influential variables
• Using Ratios or First Differences
• Using Extraneous Estimates
• Getting more data
Concluding Comments :
By analyzing the Multiple Regression, Considering the R2 (0.491) value ,we can
infer that for overall estimation this model is not strong.
Again, we have found that the model for Salary estimation for Employee of Coca-
Cola company includes almost all collinear variables. But this model is very useful
considering for having very low Heteroskedasticity and Autocorrelation problem.
25