This document discusses a case study analyzing gender trends in baby names:
1. It focuses on analyzing a smaller subset of the most popular names over time.
2. The document computes summary statistics like the proportion of babies given each name that were boys or girls and the number of years each name was in the top 1000 for boys and girls.
3. Names are classified as either having dual or separate gender usage over time based on the ratio of boy to girl name assignments each year.
1. Stat405 ddply case study
Hadley Wickham
Tuesday, 5 October 2010
2. 1. Homework
2. Project
3. Case study: gender trends
1. Focus on smaller subset
2. Develop summary statistic
3. Classify names
Tuesday, 5 October 2010
3. Homework
Explain your code!
Comments should explain why not what
Check your indenting - if it’s not indented
correctly, it’s very hard to read
Tuesday, 5 October 2010
4. # Really bad:
# Set x equal to ten.
x <- 10
# Bad:
# Figure out if all windows are bars
allbars <- all(windows %in% c("B", "BB", "BBB"))
# Better:
# all() / any() combination used to prevent errors in the
# case of three DDs.
# Better:
# Check to see if DD will create a triple
# if (length(unique(windows)) == 2)
Tuesday, 5 October 2010
5. # Best (but still not perfect:
## DD wild 4 cases and subcases
#### 1c) 3 DD's
#### 2c) 2 DD's
#### 2c) 2 DD's
#### the prize is quadrupled
#### 3c) 1 DD
#### prize doubled
## 3c.1) 1 DD and 2 of a kind
## 3c.2) 1 DD for any bars
## 3c.3) 1 DD for Cherries
#### 4c) NO DD's
## 4c.1) Just any bar
## 4c.2) Just cherries
Tuesday, 5 October 2010
7. Tips from last year
Proof read - far too many projects with
obvious mistakes.
Include a section on the data, giving a quick
English run-down of what you did to the
data. Only appendix should technical details.
Presentation matters - you should be proud
of your work, so take a little time to put it in a
nice wrapper.
Tuesday, 5 October 2010
8. Easy ways to lose
points
Overplotting
Code style violations
Forgetting about the denominator of a
ratio
Tuesday, 5 October 2010
9. Team Assessment
Your individual grades will be weighted by
effort.
Each team member should turn in a
(confidential) team evaluation sheet.
Don’t forget to assess yourself.
Tuesday, 5 October 2010
11. Questions
For names that are used for both boys
and girls, how has usage changed?
Can we use names that clearly have the
incorrect sex to estimate error rates over
time?
Tuesday, 5 October 2010
12. Getting started
options(stringsAsFactors = FALSE)
library(plyr)
library(ggplot2)
bnames <- read.csv("baby-names2.csv.bz2")
Tuesday, 5 October 2010
13. First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.
Tuesday, 5 October 2010
14. First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.
For this task, what attributes of a name are
likely to be useful?
Tuesday, 5 October 2010
15. Your turn
For each name, calculate the total proportion
of boys, the total proportion of girls, the
number of years the name was in the top
1000 as a girls name, the number of years
the name was in the top 1000 as a boys
name
Hint: Start with a single name and figure out
how to solve the problem. Hint: Use
summarise
Tuesday, 5 October 2010
16. times <- ddply(bnames, "name", summarise,
boys = sum(prop[sex == "boy"]),
boys_n = sum(sex == "boy"),
girls = sum(prop[sex == "girl"]),
girls_n = sum(sex == "girl"),
.progress = "text"
)
Useful for slow operations
# But this is rather painful
Tuesday, 5 October 2010
17. # For this task, data much easier to work with
# if put sex in columns instead of rows. We'll learn
# more about reshaping in a couple of weeks
# install.packages("reshape2")
library(reshape2)
bnames2 <- dcast(bnames, year + name ~ sex,
value_var = "prop")
# No information unless we have both boys and
# girls for that name in that year
both <- subset(bnames2, !is.na(boy) & !is.na(girl))
dim(both)
head(both)
Tuesday, 5 October 2010
18. Your turn
Summarise each name with the number
of years its made the list for both boys
and girls, the average proportion of
babies given that name.
Which names would you include for
further investigation?
Tuesday, 5 October 2010
19. both_sum <- ddply(both, "name", summarise,
years = length(name),
avg_usage = mean(boy + girl) / 2
)
# No point at looking at names that only appear once
both_sum <- subset(both_sum, years > 1)
qplot(years, avg_usage, data = both_sum)
Tuesday, 5 October 2010
20. # Now save our selections
selected_names <- subset(both_sum,
years > 20 & avg_usage > 0.005)$name
selected <- subset(both, name %in% selected_names)
nrow(selected) / nrow(both)
Tuesday, 5 October 2010
21. Your turn
Explore how the gender assignment of
these names has changed over time.
What is a good summary to use to
compare boy popularity to girl popularity?
Tuesday, 5 October 2010
22. qplot(year, boy - girl, data = selected,
geom = "line", group = name)
qplot(year, abs(boy - girl), data = selected,
geom = "line", group = name,
colour = sign(boy - girl))
qplot(year, boy / girl, data = selected,
geom = "line", group = name)
qplot(year, log10(boy / girl), data = selected,
geom = "line", group = name)
selected$lratio <- with(selected, log10(boy / girl))
qplot(lratio, name, data = selected)
qplot(lratio, reorder(name, lratio), data = selected)
qplot(abs(lratio), reorder(name, lratio),
data = selected)
Tuesday, 5 October 2010
23. Your turn
Compute the mean and range of lratio for
each name.
Plot and come up with cutoffs that you
think separate the two groups.
Tuesday, 5 October 2010