This study investigates the factors that influence alumni donations. We use exploratory data analysis and multi factor linear regression techniques to predict alumni giving rate. After analyzing relationships between the variables, we fit linear regression models and perform residual diagnosis to ensure that assumptions of linear regression are not violated. We find that lower student-faculty ratio leads to higher alumni donations and smaller class size is insignificant in predicting alumni donations. We include an additional parameter graduation rate, which is highly correlated with alumni giving rate and interaction between student-faculty ratio and dummy variable private in our final model. The final mode has an adjusted R-squared of 75.76%.
Alumni Donation - Complete exploration and analysis report
1. Alumni GivingRate LinearRegressionAnalysis
Introduction:
Thisstudyinvestigatesthe factorsthatinfluencealumnidonations. We use exploratorydataanalysisand
multi factor linear regression techniques to predict alumni giving rate. After analyzing relationships
between the variables, we fit linear regression models and perform residual diagnosis to ensure that
assumptionsof linearregressionare notviolated.We findthatlowerstudent-faculty ratioleadstohigher
alumni donations and smaller class size is insignificant in predicting alumni donations. We include an
additional parameter graduation rate, which is highlycorrelated with alumni giving rate and interaction
between student-faculty ratio and dummy variable private in our final model. The final mode has an
adjusted R-squared of 75.76%.
Data Description:
In thisstudywe use data from 48 US universities collected from Americaโs Best Colleges, Year 2000 Ed.
๏ School: University Name
๏ % of Classes Under 20: Percentage of classes offered with fewer than 20 students
๏ Student/Faculty Ratio: Ratio of the students to the faculty
๏ Alumni Giving Rate: Percentage of alumni that donated to the university
๏ Private:A categorical variable forprivate orpublicuniversitieswith1for private and 0 for public
We observed that there are no null values and outlier in the dataset and descriptive statistics of all the
variables is presented in table 1.
TABLE 1: Summary statistics for the response and predictor variables
Variable Minimum 1st
Quantile Median Mean 3rd
Quantile Maximum
Alumni GivingRate 7.00 18.75 29.00 29.27 38.50 67.00
PercentClassSize under20 29.00 44.75 59.50 55.73 66.25 77.00
StudentFacultyRatio 3.00 8.00 10.50 11.54 13.50 23.00
Private 0.00 0.00 1.00 0.688 1.00 1.00
We observe that ๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ has positive correlation with ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20 and
strong negative correlation with ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ with correlation coefficients of 0.646 and
2. Alumni GivingRate LinearRegressionAnalysis
โ0.742 respectively. Further, ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20 and ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ have a
correlation of โ0.786 which can lead to multicollinearity issues.
Figure 1: Pairwise scatter plots of all the variables
Figure 2: Box plot for Alumni Giving Rate, Percent class under 20 and Student-Faculty Ratio
Methodology
Researchshowsthat studentswhoare more satisfiedwiththeircontactwithteachersare more likelyto
graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might
leadtoa higherpercentageof satisfiedgraduates,whichinturnmightleadtoincreasesinthe percentage
of alumni who donate.
First,we fita linearregressionmodel withclasssizeandstudent-facultyratioaspredictors andour fitted
regression equation is:
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ฬ = 39.66 + 0.17 โ ๐๐๐๐๐๐๐ก ๐ถ๐๐๐ ๐ ๐๐๐ง๐ ๐ข๐๐๐๐ 20ฬ โ 1.7 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ
3. Alumni GivingRate LinearRegressionAnalysis
From the results of hypothesis tests (๐ป0: ๐ฝ0 = 0; ๐ฝ1 = 0 ; ๐ฝ2 = 0 ๐๐๐ ๐ป ๐:๐ฝ0 โ 0; ๐ฝ1 โ 0 ; ๐ฝ2 โ 0 :
significance of the coefficients) we conclude ๐ฝ0 and ๐ฝ2 are significant and ๐ฝ1is insignificant.
Figure 3: Scatter plots grouped by public and private universities
From figure 3, we observe that alumni donationforprivate andpublicschoolsare significantlydifferent.
Now,we include the dummyvariable ๐๐๐๐๐ฃ๐ก๐ inour initial model. The fitted regression equations are:
Private Schools:
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ฬ = 43.07 + 0.08 โ ๐๐๐๐๐๐๐ก ๐ถ๐๐๐ ๐ ๐๐๐ง๐ ๐ข๐๐๐๐ 20ฬ โ 1.4 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ
Public Schools:
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ฬ = 36.78 + 0.08 โ ๐๐๐๐๐๐๐ก ๐ถ๐๐๐ ๐ ๐๐๐ง๐ ๐ข๐๐๐๐ 20ฬ โ 1.4 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ
From the results of hypothesis tests (๐ป0: ๐ฝ0 = 0; ๐ฝ1 = 0 ; ๐ฝ2 = 0; ๐ฝ3 = 0 ๐๐๐ ๐ป ๐: ๐ฝ0 โ 0; ๐ฝ1 โ
0 ; ๐ฝ2 โ 0 ; ๐ฝ3 โ 0 ), we conclude that only ๐ฝ0 and ๐ฝ2 are significant. The model Adjusted R squared is
.5747, indicating the model is not much better than flipping a coin in terms of predicting power.
Residual analysis:
The QQ plot of model residuals vs fitted values indicate that normality assumption of error term is
violated. When the residuals are plotted against ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20, the residuals show
increasingvariance and against ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ show constant variance. Thisis alignedwiththe
results of NCV Test (๐ป ๐: ๐๐
2 = ๐2 ๐๐๐๐ ๐ก๐๐๐ก ; ๐ป ๐: ๐๐
2 โ ๐2), with a large ๐-๐ฃ๐๐๐ข๐ (0.29). We conclude
that there is no issue of heteroscedasticity. The results of Durbin Watson Test indicate that there is no
first order autocorrelation (๐ท โ ๐ ๐๐ก๐๐ก๐๐ ๐ก๐๐ โถ 1.61378 ; ๐๐ฃ๐๐๐ข๐ โถ 0.172).
4. Alumni GivingRate LinearRegressionAnalysis
Figure 4: Residual plots
Outliers and Influential Points:
Table 2:
Measure Condition Outlier
Y Outlier Studentizedresidual | ๐๐| โฅ 3 Princeton University
X Outlier Leverage โ ๐ ๐ โฅ 2๐/๐ Boston College, U. of Washington, UCB
Influential Point Cookโs D ๐ท๐~๐น๐,๐โ๐ NYU, Princeton,U of Florida,U. of Norte
Dame, U. of Washington
Afterremovingthe outliersandinfluencepoints,the adjustedR-squaredimproved from57.47% to 63.67.
However, ๐ฝ1, ๐ฝ3 are still insignificant. Since, ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20 and ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐
are highlycorrelatedandthe coefficientof ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20 isinsignificant,we remove this
predictorandfita newregressionmodel,withadjustedR-squaredimprovedto64.25%. Resultsof Partial
F-test (table 3) also confirm that ๐๐๐๐๐๐๐ก ๐๐๐๐ ๐ ๐ ๐๐ง๐ ๐ข๐๐๐๐ 20 is insignificant as the p-value is large.
Table 3: Partial F-Test results
Model 1: Alumni.Giving.rate ~SFratio + Private
Model 2: Alumni.Giving.rate ~Per.under.20+ SFratio + Private
Res.Df RSS Df Sum of sq F Pr(>F)
38 1991.6
37 1971.2 1 20.477 0.3844 0.5391
5. Alumni GivingRate LinearRegressionAnalysis
The regression equation is:
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ฬ = 43.36 โ 1.61 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ + 5.53 โ ๐๐๐๐ฃ๐๐ก๐ฬ
Discussion:
For the final model, we included an additional variable ๐๐๐๐_๐๐๐ก๐, a variable part of original Americaโs
Best Colleges,Year2000 Ed dataset (withcorrelationcoefficientof 0.756 with ๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐) to
the initial data set and fitted a linear regression model with all other variables.
The model adjusted R-squaredimproves to 69.79%.
By adding an interaction parameter
๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ โถ ๐๐๐๐ฃ๐๐ก๐ model adjusted
R-squared improves to 71.74%. All the regression
coefficients are significant except for
๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐.
Fitted Model
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐ ๐ ๐ ๐๐๐ก๐ฬ
= โ24.39 โ 0.04 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ + 26.27 โ ๐๐๐๐ฃ๐๐ก๐ฬ โ 1.52 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ
โ ๐๐๐๐ฃ๐๐ก๐ฬ + 0.53 โ ๐๐๐๐_๐๐๐ก๐
Now,we fitaregressionmodelwithout ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐,withmuchimprovedadjustedR-squared
value of 75.76%. Residual diagnosis show that error terms follow normal distribution with constant
variance (Shapiro-Wilk Test: p-value 0.16).
Our final regression model is:
๐ด๐๐ข๐๐๐ ๐๐๐ฃ๐๐๐ ๐๐๐ก๐ฬ = โ25.32 + 26.92โ ๐๐๐๐ฃ๐๐ก๐ฬ โ 1.55 โ ๐๐ก๐ข๐๐๐๐ก ๐๐๐๐ข๐๐ก๐ฆ ๐๐๐ก๐๐ฬ โ ๐๐๐๐ฃ๐๐ก๐ฬ + 0.54 โ ๐๐๐๐_๐๐๐ก๐