1. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Kimberly Nguyen MA 576: Project Due: Friday 5/29/16
The Pursuit of Happiness and Statistics
Introduction
Everything we do in life is for the pursuit of happiness. We spend countless nights
cramming for Professor Carvalho’s exams, recopying his notes, and attending every one of his
office hours, meanwhile dreaming about graduating college with a high grade point average and
having a high paying job. But is this how we achieve happiness? Is the isolation we feel from
friends and family while we are studying in Mugar actually hindering our happiness? If we
dedicate some of that time to having more sex, could we elevate our emotional state? And once
we are done with college, how do we choose between a careers that we actually enjoy versus
careers that pay well? Which of these factors truly determine one’s state of happiness? Based on
a survey of 39 employed students in an MBA class at the University of Chicago, these are the
questions we will attempt to answer.
From these students, a total of five variables were collected. The response we are
considering is level of happiness which was measured on a 10 point scale with 1 representing a
suicidal state, 5 representing a state of just “surviving life”, and 10 representing a euphoric state.
4 potential predictors were collected: money, sex, love, and work. Money is a continuous variable
measuring family income in thousands of dollars. Sex refers not to gender, but a 0, 1 dummy
variable where 1 represents a satisfactory level of sexual activity. Love is a factor variable
ranging from 1-3 measuring a student’s feeling of belonging in the context of family, friends,
and community. 1 indicates a feeling of loneliness and isolation, 2 indicates the student has a few
secure relationships, and 3 indicates a deep feeling of belonging. And lastly, work is a factor
variable ranging from 1-5, with 1 indicating the student is seeking other employment, 3 being
indifferent about their job, and 5 indicating that the student enjoys their job.1
In real life, I believe all four of these predictors are important contributors to one’s
emotional state, thus my preliminary inference is that these will all be statistically significant in
our model. Especially in the context of college students, these factors are even more of a priority.
Students are still young enough to be financially dependent on their families, and thus the money
variable, which measures their family’s income could still be a significant factor in these
2. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
students’ lives. Love measures the social aspect of one’s life which is especially important in
college years where one is developing the friendships they’ll have for the rest of their lives. And
work is where these students spend a significant amount of time, and obtaining a good job is the
reason students are event attending college, thus whether or not they enjoy their job should have
significant effect on their state of happiness. If I was to throw any variable out, it would be sex
because having satisfactory sex is not necessarily a priority shared amongst all college students.
We will allow the data to verify or disprove these assumptions.
Data Clean Up
Due to the limited sample size, we had to do a bit of house cleaning with our data. Since
we had so many variables/levels in our predictors, and only 39 observations, we had to collapse
the levels of a majority of our predictors and our response to avoid sparsity. The response
happiness went from having levels 1 through 10, to having 4 levels: 1-4, 5-6, 7-8, and 9-10.
Money and sex were not changed. Love was transformed from having 3 levels to 2 levels: 1-2,
and 3. Lastly, work went from having a scale of 1-5 to 1-2, 3, and 4-5. The levels still represent
the same aspect of their respective variables. For example, in work 1-2 means the student is
either looking for a new job or currently dislike your job, and 3 means the student is indifferent,
and 4-5 means the student enjoys their job.
Preliminary Data Analysis
Looking at the pairwise graph between all the variables, we can gain some idea about the
possible correlation between the predictors. Simply from observing the graph, it does not seem
that there exists correlation between any of the predictors. This logically make sense because
sex, family income, love, and work are all very different components of life that generally would
not have an effect on the other.
3. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
We will start by looking at the mosaic plots between the factor variables and our
response. It seems that there is not difference in happiness level whether or not a student is
having satisfactory sex. In fact, the effect of satisfactory sex on happiness seems to be counter
intuitive to what one would think. There are more people who have satisfactory sex in the lower
levels of happiness than people who don’t have satisfactory sex.
Observing the mosaic plot between happiness and love, it is apparent that those who feel
most loved, experience a higher level of happiness than those who feel less loved. Most of the
people who are in love level 3 fall into levels 7-8 and 9-10 of happiness.
There also seems to be a significant difference in level of happiness between the people
who enjoy their jobs and those who don’t. A larger proportion of the people who don’t enjoy
their jobs fall into the 1-4 level of happiness.
4. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
And a larger proportion of people who enjoy their job fall into the 7-8 and 9-10 levels of
happiness.
Disqualification of Linear Model
Running a linear model here would not make sense for multiple reasons. When running a
linear model, we assume that our response is a continuous and normally distributed random
variable. However, here our response is categorical, not continuous. And not only is it not
continuous, but the response is ordinal, i.e. a categorical variable that has an order and this order
is important information that should not be ignored. R does not even allow you to run a linear
model on ordinal data.
However, if we ignore all this and revert back to the response being on a scale from 1 to
10 and treat this as a continuous response, we can run a linear model. But even then, the
assumption of a null plot of residuals is violated. The plot fans out left and right from the middle
of the plot.
5. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Fitting the Model
Fortunately, through the power of generalized linear models, we can correctly model
ordinal data using a proportional odds model. After running a proportional odds model with
happy as the response, and the 4 predictors- money, sex, love, and work we put that model
through an AIC model selection, and end up with a model that predicts the probability of being
in each level of happiness based on love and work.
As I expected, sex was not a valuable variable in predicting the probability of being in a
certain level of happiness. This agrees with our previous discussion where we claimed that
having satisfactory sex is not a priority in many people’s lives, especially considering these are
MBA student at a top ranking university who have a lot more things to worry about.
Surprisingly, the amount of income a student’s family earns also does not have a significant hand
in determining happiness level. Looking back, this makes sense as these are MBA students, not
undergraduates and thus they have probably already financially separated themselves from their
families. And they are also employed MBA students, further distancing their connection to their
families’ incomes.
We did not attempt to fit any interactions because qualitatively, it would not make sense
for any of the predictors to interact with each other. As previously mentioned, in the preliminary
6. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
data analysis, logically, satisfactory sex, family income, job satisfaction, and feeling of
belonging do not generally not have an effect on each other. In addition we had no reason to
believe we needed to transform any variables and thus no transformations were attempted. Any
transformations would have complicated the interpretation of the model.
Coefficient Interpretation
In a proportional odds model, the probability of being at most in category j is:
𝜸𝒋 = P(y≤j) = 𝒍𝒐𝒈𝒊𝒕−𝟏
(𝜽𝒋 − 𝑿 𝑻
𝜷)
And thus, the log odds of being at most in category j relative to not being in category j is:
logit (𝜸𝒋)=log(
𝜸 𝒋
𝟏−𝜸 𝒋
) =𝜽𝒋 − 𝑿 𝑻
𝜷
From this we can derive that the odds ratio is:
𝑶𝑹𝒋(𝑿 𝟏, 𝑿 𝟐) =
𝒐𝒅𝒅𝒔 𝒋(𝑿 𝟏)
𝒐𝒅𝒅𝒔 𝒋(𝑿 𝟐)
=𝒆(𝑿 𝟐−𝑿 𝟏) 𝑻 𝜷
for all categories j
and thus
𝒐𝒅𝒅𝒔𝒋(𝑿 𝟏)= 𝒐𝒅𝒅𝒔𝒋(𝑿 𝟐)* 𝒆(𝑿 𝟐−𝑿 𝟏) 𝑻 𝜷
We will interpret the coefficients in terms of log odds and odds ratios:
Intercepts
o 1-4 | 5-6: An individual who feels the lowest level of love and belonging (love=1-
2), and dislikes/is looking for a new job (work =1-2) has a log odds of being at
most at level 1-4 of 0.9121.
o 5-6 | 7-8: An individual who feels the lowest level of love and belonging (love=1-
2), and dislikes/is looking for a new job (work=1-2) has a log odds of being at
most at level 5-6 of happiness of 2.9392.
7. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
o 7-8 | 9-10: An individual who feels the lowest level of love and belonging
(love=1-2), and dislikes/is looking for a new job (work =1-2) has a log odds of
being at most at level 7-8 of happiness of 8.5585.
LOVE3: Holding work constant, for a specific level of happiness, the log odds of being
at most that happy decreases by 4.033 for people who have a higher sense of love and
belonging (love=3) than those who have a lower sense of love and belonging (love=1-2).
In other words, for a specific level of happiness, holding work constant, the odds of a
person in love level 1-2 being at most in that that specific level of happiness is 𝑒4.033
times more than a person in love level 3. In simpler terms, this means that a person who
has a higher sense of loving and belonging has a higher probability of being in a higher
level of happiness.
WORK3: Holding love constant, for a specific level of happiness, the log odds of being
at most that happy is 1.871 less for those who are indifferent about their current job
(work = 3) versus those who either dislike or are looking for a new job (work =1-2). In
other words, for a specific level of happiness, holding work constant, the odds of a person
in work level 1-2 being at most in that specific level of happiness is 𝑒1.871
times more
than a person in work level 3. In simpler terms, this means that a person who is
indifferent about their current job has a higher probability of being in a high level of
happiness than a person who is looking for a new job/dislikes their job.
Work4-5: Holding love constant, for a specific level of happiness, the log odds of being
at most that happy is 3.448 less for those who enjoy their job (work=4-5) than those who
dislike their job or are looking for new employment (work=1-2). In other words, for a
specific level of happiness, holding work constant, the odds of a person in work level 1-2
being at most in that specific level of happiness is 𝑒3.448
times higher than a person in
work level 4-5. In simpler terms, a person who enjoys their job has a higher probability of
being in a higher happiness level than a person who dislikes or is looking for a new job.
Significance of Coefficients
We will now test the statistical significance of the coefficients. All tests in this paper will
be tested at a significance level of .05.
8. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Love3
𝑯 𝟎: 𝛽𝐿𝑜𝑣𝑒3 = 0
𝑯 𝟏: 𝛽𝐿𝑜𝑣𝑒3 ≠ 0
The conclusion from this test is that we reject the null and say that this coefficient is
statistically significant. The coefficient being statistically different from zero means that, holding
work constant, there is a significant difference in the log odds of being at most in a specific
happiness level between individuals who feel more loved and less loved.
Work3
𝑯 𝟎: 𝛽 𝑊𝑜𝑟𝑘3 = 0
𝑯 𝟏: 𝛽 𝑊𝑜𝑟𝑘3 ≠ 0
The conclusion from this test is that we fail to reject the null and conclude that coefficient
is not statistically significant. This means that there is not a significant difference in the log odds
of being at most in a specific happiness level between individuals who are looking for a new
job/dislike their job and individuals who are indifferent about their job.
Work4-5
𝑯 𝟎: 𝛽 𝑊𝑜𝑟𝑘4−5 = 0
𝑯 𝟏: 𝛽 𝑊𝑜𝑟𝑘4−5 ≠ 0
In this test, we reject the null and conclude that this coefficient is statistically significant.
This means that there is a significant difference in the log odds of being at most in a specific
happiness level between individuals who dislike/are looking for a new job versus those who
actually enjoy their jobs.
Test statistic: t= 3.204245
P-value: 0.00135417
Test statistic: t= 1.688251
P-value: 0.09136302
Test statistic: t= 2.945382
P-value: 0.003225561
9. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Goodness of Fit
We will now conduct a goodness of fit test of our model. Asymptotically the deviance
has a chi-square distribution. Although the size of our dataset is not ideal, we will continue with
the test anyways:
𝑯 𝟎: Current model has an adequate fit
𝑯 𝟏: Saturated model
Test statistic: 𝜒2
= 𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑑𝑒𝑙-𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒𝑙=57.36982 - 0 =57.36982
Degrees of freedom: 𝐷𝐹𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑑𝑒𝑙-𝐷𝐹𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒𝑙=(n-p)-(n-n) = (39-6)-(39-39) = 33
P-value: 0.005355549
From the goodness of fit test, we strongly reject the null. Unfortunately this means that
the model we fit is not sufficient in predicting the happiness level of a student based on love and
work. This is highly likely due to the small sample size of the dataset.
Prediction versus Observations
In the following chart, we plotted the predictions for the probability of being in each level
of happiness for each combination of love and work level, and plotted the actual observed
proportions from our data. Agreeing with our goodness of fit test, a majority of our predictions
were extremely off from the predicted value.
10. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Conclusion
From the beginning there were quite a few issues with this dataset. The small size of the
dataset, coupled with the relatively large quantity of groups that each person could be placed in
created by the various combinations of the levels of predictors and response made our dataset
extremely sparse. Even after combining levels of the predictors and response, we still ended up
with groups that contained either few or no observations. As seen in the chart below under
“Freq”, there were 24 total possible categories but 10 of them contained no observations.
11. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
Unfortunately, I would hold the sparsity/small sample size of our data set responsible for the lack
of fit we discovered in our model. To improve the model we need to collect more observations
and perhaps more relevant predictors such as grade point average or state of health. Thus I would
not rely on the specific numbers we obtained in our model to make any real predictions.
Nonetheless, the qualitative conclusions that we can draw from this model can still be
valuable. Making a lot of money is the end goal in a lot of people’s lives, however our model
confirms that the cliché “money doesn’t buy you happiness” is true. Money is not a significant
factor in determining happiness, in both our model and in our lives. In addition, whether or not
one has satisfactory sex is not important in determining happiness either. An individual could
have good sex, but still lack the emotional connection necessary to fully enjoy the sex. Thus it
makes sense that the variable love was significant in our model. The variable love represented an
individual’s feeling of love and belonging with their friends, family, and community. Knowing
you have a group of people that you can trust and rely on is invaluable. Money cannot buy you
that type of security. In addition, it make sense that the variable work was significant in our
model. An individual generally spends 8 hours a day at their job. That’s half of the time we are
awake each day. Thus whether or one enjoys one’s job has an adverse effect on a person’s mood.
An individual could have a very high paying job, but the euphoria of depositing a paycheck does
not last long enough to dull the pain of an unstimulating, monotonous, and unfulfilling job.
Receiving high marks in school, making enough money to live comfortably, and having
satisfying sex are all important parts of life. However, we should not let the pursuit of these
things hinder us from the things that truly make us happy: friends, family, and community.
Especially for those of us about to graduate college, we should keep this in mind when we are
choosing between the career we want to pursue, and the career that pays a lot. If these two things
do not coincide, then it may be more valuable to us to take the job that we would enjoy more.
But we should bear in mind the most crucial factor in determining emotional state:
sample size. Having an adequate sample size has a 100% chance of elevating a statistician’s
level of happiness.