•1 gefällt mir•443 views

Melden

Teilen

- 1. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Kimberly Nguyen MA 576: Project Due: Friday 5/29/16 The Pursuit of Happiness and Statistics Introduction Everything we do in life is for the pursuit of happiness. We spend countless nights cramming for Professor Carvalho’s exams, recopying his notes, and attending every one of his office hours, meanwhile dreaming about graduating college with a high grade point average and having a high paying job. But is this how we achieve happiness? Is the isolation we feel from friends and family while we are studying in Mugar actually hindering our happiness? If we dedicate some of that time to having more sex, could we elevate our emotional state? And once we are done with college, how do we choose between a careers that we actually enjoy versus careers that pay well? Which of these factors truly determine one’s state of happiness? Based on a survey of 39 employed students in an MBA class at the University of Chicago, these are the questions we will attempt to answer. From these students, a total of five variables were collected. The response we are considering is level of happiness which was measured on a 10 point scale with 1 representing a suicidal state, 5 representing a state of just “surviving life”, and 10 representing a euphoric state. 4 potential predictors were collected: money, sex, love, and work. Money is a continuous variable measuring family income in thousands of dollars. Sex refers not to gender, but a 0, 1 dummy variable where 1 represents a satisfactory level of sexual activity. Love is a factor variable ranging from 1-3 measuring a student’s feeling of belonging in the context of family, friends, and community. 1 indicates a feeling of loneliness and isolation, 2 indicates the student has a few secure relationships, and 3 indicates a deep feeling of belonging. And lastly, work is a factor variable ranging from 1-5, with 1 indicating the student is seeking other employment, 3 being indifferent about their job, and 5 indicating that the student enjoys their job.1 In real life, I believe all four of these predictors are important contributors to one’s emotional state, thus my preliminary inference is that these will all be statistically significant in our model. Especially in the context of college students, these factors are even more of a priority. Students are still young enough to be financially dependent on their families, and thus the money variable, which measures their family’s income could still be a significant factor in these
- 2. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 students’ lives. Love measures the social aspect of one’s life which is especially important in college years where one is developing the friendships they’ll have for the rest of their lives. And work is where these students spend a significant amount of time, and obtaining a good job is the reason students are event attending college, thus whether or not they enjoy their job should have significant effect on their state of happiness. If I was to throw any variable out, it would be sex because having satisfactory sex is not necessarily a priority shared amongst all college students. We will allow the data to verify or disprove these assumptions. Data Clean Up Due to the limited sample size, we had to do a bit of house cleaning with our data. Since we had so many variables/levels in our predictors, and only 39 observations, we had to collapse the levels of a majority of our predictors and our response to avoid sparsity. The response happiness went from having levels 1 through 10, to having 4 levels: 1-4, 5-6, 7-8, and 9-10. Money and sex were not changed. Love was transformed from having 3 levels to 2 levels: 1-2, and 3. Lastly, work went from having a scale of 1-5 to 1-2, 3, and 4-5. The levels still represent the same aspect of their respective variables. For example, in work 1-2 means the student is either looking for a new job or currently dislike your job, and 3 means the student is indifferent, and 4-5 means the student enjoys their job. Preliminary Data Analysis Looking at the pairwise graph between all the variables, we can gain some idea about the possible correlation between the predictors. Simply from observing the graph, it does not seem that there exists correlation between any of the predictors. This logically make sense because sex, family income, love, and work are all very different components of life that generally would not have an effect on the other.
- 3. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 We will start by looking at the mosaic plots between the factor variables and our response. It seems that there is not difference in happiness level whether or not a student is having satisfactory sex. In fact, the effect of satisfactory sex on happiness seems to be counter intuitive to what one would think. There are more people who have satisfactory sex in the lower levels of happiness than people who don’t have satisfactory sex. Observing the mosaic plot between happiness and love, it is apparent that those who feel most loved, experience a higher level of happiness than those who feel less loved. Most of the people who are in love level 3 fall into levels 7-8 and 9-10 of happiness. There also seems to be a significant difference in level of happiness between the people who enjoy their jobs and those who don’t. A larger proportion of the people who don’t enjoy their jobs fall into the 1-4 level of happiness.
- 4. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 And a larger proportion of people who enjoy their job fall into the 7-8 and 9-10 levels of happiness. Disqualification of Linear Model Running a linear model here would not make sense for multiple reasons. When running a linear model, we assume that our response is a continuous and normally distributed random variable. However, here our response is categorical, not continuous. And not only is it not continuous, but the response is ordinal, i.e. a categorical variable that has an order and this order is important information that should not be ignored. R does not even allow you to run a linear model on ordinal data. However, if we ignore all this and revert back to the response being on a scale from 1 to 10 and treat this as a continuous response, we can run a linear model. But even then, the assumption of a null plot of residuals is violated. The plot fans out left and right from the middle of the plot.
- 5. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Fitting the Model Fortunately, through the power of generalized linear models, we can correctly model ordinal data using a proportional odds model. After running a proportional odds model with happy as the response, and the 4 predictors- money, sex, love, and work we put that model through an AIC model selection, and end up with a model that predicts the probability of being in each level of happiness based on love and work. As I expected, sex was not a valuable variable in predicting the probability of being in a certain level of happiness. This agrees with our previous discussion where we claimed that having satisfactory sex is not a priority in many people’s lives, especially considering these are MBA student at a top ranking university who have a lot more things to worry about. Surprisingly, the amount of income a student’s family earns also does not have a significant hand in determining happiness level. Looking back, this makes sense as these are MBA students, not undergraduates and thus they have probably already financially separated themselves from their families. And they are also employed MBA students, further distancing their connection to their families’ incomes. We did not attempt to fit any interactions because qualitatively, it would not make sense for any of the predictors to interact with each other. As previously mentioned, in the preliminary
- 6. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 data analysis, logically, satisfactory sex, family income, job satisfaction, and feeling of belonging do not generally not have an effect on each other. In addition we had no reason to believe we needed to transform any variables and thus no transformations were attempted. Any transformations would have complicated the interpretation of the model. Coefficient Interpretation In a proportional odds model, the probability of being at most in category j is: 𝜸𝒋 = P(y≤j) = 𝒍𝒐𝒈𝒊𝒕−𝟏 (𝜽𝒋 − 𝑿 𝑻 𝜷) And thus, the log odds of being at most in category j relative to not being in category j is: logit (𝜸𝒋)=log( 𝜸 𝒋 𝟏−𝜸 𝒋 ) =𝜽𝒋 − 𝑿 𝑻 𝜷 From this we can derive that the odds ratio is: 𝑶𝑹𝒋(𝑿 𝟏, 𝑿 𝟐) = 𝒐𝒅𝒅𝒔 𝒋(𝑿 𝟏) 𝒐𝒅𝒅𝒔 𝒋(𝑿 𝟐) =𝒆(𝑿 𝟐−𝑿 𝟏) 𝑻 𝜷 for all categories j and thus 𝒐𝒅𝒅𝒔𝒋(𝑿 𝟏)= 𝒐𝒅𝒅𝒔𝒋(𝑿 𝟐)* 𝒆(𝑿 𝟐−𝑿 𝟏) 𝑻 𝜷 We will interpret the coefficients in terms of log odds and odds ratios: Intercepts o 1-4 | 5-6: An individual who feels the lowest level of love and belonging (love=1- 2), and dislikes/is looking for a new job (work =1-2) has a log odds of being at most at level 1-4 of 0.9121. o 5-6 | 7-8: An individual who feels the lowest level of love and belonging (love=1- 2), and dislikes/is looking for a new job (work=1-2) has a log odds of being at most at level 5-6 of happiness of 2.9392.
- 7. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 o 7-8 | 9-10: An individual who feels the lowest level of love and belonging (love=1-2), and dislikes/is looking for a new job (work =1-2) has a log odds of being at most at level 7-8 of happiness of 8.5585. LOVE3: Holding work constant, for a specific level of happiness, the log odds of being at most that happy decreases by 4.033 for people who have a higher sense of love and belonging (love=3) than those who have a lower sense of love and belonging (love=1-2). In other words, for a specific level of happiness, holding work constant, the odds of a person in love level 1-2 being at most in that that specific level of happiness is 𝑒4.033 times more than a person in love level 3. In simpler terms, this means that a person who has a higher sense of loving and belonging has a higher probability of being in a higher level of happiness. WORK3: Holding love constant, for a specific level of happiness, the log odds of being at most that happy is 1.871 less for those who are indifferent about their current job (work = 3) versus those who either dislike or are looking for a new job (work =1-2). In other words, for a specific level of happiness, holding work constant, the odds of a person in work level 1-2 being at most in that specific level of happiness is 𝑒1.871 times more than a person in work level 3. In simpler terms, this means that a person who is indifferent about their current job has a higher probability of being in a high level of happiness than a person who is looking for a new job/dislikes their job. Work4-5: Holding love constant, for a specific level of happiness, the log odds of being at most that happy is 3.448 less for those who enjoy their job (work=4-5) than those who dislike their job or are looking for new employment (work=1-2). In other words, for a specific level of happiness, holding work constant, the odds of a person in work level 1-2 being at most in that specific level of happiness is 𝑒3.448 times higher than a person in work level 4-5. In simpler terms, a person who enjoys their job has a higher probability of being in a higher happiness level than a person who dislikes or is looking for a new job. Significance of Coefficients We will now test the statistical significance of the coefficients. All tests in this paper will be tested at a significance level of .05.
- 8. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Love3 𝑯 𝟎: 𝛽𝐿𝑜𝑣𝑒3 = 0 𝑯 𝟏: 𝛽𝐿𝑜𝑣𝑒3 ≠ 0 The conclusion from this test is that we reject the null and say that this coefficient is statistically significant. The coefficient being statistically different from zero means that, holding work constant, there is a significant difference in the log odds of being at most in a specific happiness level between individuals who feel more loved and less loved. Work3 𝑯 𝟎: 𝛽 𝑊𝑜𝑟𝑘3 = 0 𝑯 𝟏: 𝛽 𝑊𝑜𝑟𝑘3 ≠ 0 The conclusion from this test is that we fail to reject the null and conclude that coefficient is not statistically significant. This means that there is not a significant difference in the log odds of being at most in a specific happiness level between individuals who are looking for a new job/dislike their job and individuals who are indifferent about their job. Work4-5 𝑯 𝟎: 𝛽 𝑊𝑜𝑟𝑘4−5 = 0 𝑯 𝟏: 𝛽 𝑊𝑜𝑟𝑘4−5 ≠ 0 In this test, we reject the null and conclude that this coefficient is statistically significant. This means that there is a significant difference in the log odds of being at most in a specific happiness level between individuals who dislike/are looking for a new job versus those who actually enjoy their jobs. Test statistic: t= 3.204245 P-value: 0.00135417 Test statistic: t= 1.688251 P-value: 0.09136302 Test statistic: t= 2.945382 P-value: 0.003225561
- 9. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Goodness of Fit We will now conduct a goodness of fit test of our model. Asymptotically the deviance has a chi-square distribution. Although the size of our dataset is not ideal, we will continue with the test anyways: 𝑯 𝟎: Current model has an adequate fit 𝑯 𝟏: Saturated model Test statistic: 𝜒2 = 𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑑𝑒𝑙-𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒𝑙=57.36982 - 0 =57.36982 Degrees of freedom: 𝐷𝐹𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑑𝑒𝑙-𝐷𝐹𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑 𝑀𝑜𝑑𝑒𝑙=(n-p)-(n-n) = (39-6)-(39-39) = 33 P-value: 0.005355549 From the goodness of fit test, we strongly reject the null. Unfortunately this means that the model we fit is not sufficient in predicting the happiness level of a student based on love and work. This is highly likely due to the small sample size of the dataset. Prediction versus Observations In the following chart, we plotted the predictions for the probability of being in each level of happiness for each combination of love and work level, and plotted the actual observed proportions from our data. Agreeing with our goodness of fit test, a majority of our predictions were extremely off from the predicted value.
- 10. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Conclusion From the beginning there were quite a few issues with this dataset. The small size of the dataset, coupled with the relatively large quantity of groups that each person could be placed in created by the various combinations of the levels of predictors and response made our dataset extremely sparse. Even after combining levels of the predictors and response, we still ended up with groups that contained either few or no observations. As seen in the chart below under “Freq”, there were 24 total possible categories but 10 of them contained no observations.
- 11. 1: George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889 Unfortunately, I would hold the sparsity/small sample size of our data set responsible for the lack of fit we discovered in our model. To improve the model we need to collect more observations and perhaps more relevant predictors such as grade point average or state of health. Thus I would not rely on the specific numbers we obtained in our model to make any real predictions. Nonetheless, the qualitative conclusions that we can draw from this model can still be valuable. Making a lot of money is the end goal in a lot of people’s lives, however our model confirms that the cliché “money doesn’t buy you happiness” is true. Money is not a significant factor in determining happiness, in both our model and in our lives. In addition, whether or not one has satisfactory sex is not important in determining happiness either. An individual could have good sex, but still lack the emotional connection necessary to fully enjoy the sex. Thus it makes sense that the variable love was significant in our model. The variable love represented an individual’s feeling of love and belonging with their friends, family, and community. Knowing you have a group of people that you can trust and rely on is invaluable. Money cannot buy you that type of security. In addition, it make sense that the variable work was significant in our model. An individual generally spends 8 hours a day at their job. That’s half of the time we are awake each day. Thus whether or one enjoys one’s job has an adverse effect on a person’s mood. An individual could have a very high paying job, but the euphoria of depositing a paycheck does not last long enough to dull the pain of an unstimulating, monotonous, and unfulfilling job. Receiving high marks in school, making enough money to live comfortably, and having satisfying sex are all important parts of life. However, we should not let the pursuit of these things hinder us from the things that truly make us happy: friends, family, and community. Especially for those of us about to graduate college, we should keep this in mind when we are choosing between the career we want to pursue, and the career that pays a lot. If these two things do not coincide, then it may be more valuable to us to take the job that we would enjoy more. But we should bear in mind the most crucial factor in determining emotional state: sample size. Having an adequate sample size has a 100% chance of elevating a statistician’s level of happiness.