SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Stat405           ddply case study


                           Hadley Wickham
Tuesday, 5 October 2010
1. Homework
                2. Project
                3. Case study: gender trends
                          1. Focus on smaller subset
                          2. Develop summary statistic
                          3. Classify names


Tuesday, 5 October 2010
Homework

                    Explain your code!
                    Comments should explain why not what
                    Check your indenting - if it’s not indented
                    correctly, it’s very hard to read




Tuesday, 5 October 2010
# Really bad:
     # Set x equal to ten.
     x <- 10

     # Bad:
     # Figure out if all windows are bars
     allbars <- all(windows %in% c("B", "BB", "BBB"))

     # Better:
     # all() / any() combination used to prevent errors in the
     # case of three DDs.

     # Better:
     # Check to see if DD will create a triple
     # if (length(unique(windows)) == 2)


Tuesday, 5 October 2010
# Best (but still not perfect:

              ## DD wild 4 cases and subcases
              #### 1c) 3 DD's
              #### 2c) 2 DD's
              #### 2c) 2 DD's
              ####     the prize is quadrupled
              #### 3c) 1 DD
              ####     prize doubled
              ##       3c.1) 1 DD and 2 of a kind
              ##       3c.2) 1 DD for any bars
              ##       3c.3) 1 DD for Cherries
              #### 4c) NO DD's
              ##       4c.1) Just any bar
              ##       4c.2) Just cherries


Tuesday, 5 October 2010
Project

Tuesday, 5 October 2010
Tips from last year
                    Proof read - far too many projects with
                    obvious mistakes.
                    Include a section on the data, giving a quick
                    English run-down of what you did to the
                    data. Only appendix should technical details.
                    Presentation matters - you should be proud
                    of your work, so take a little time to put it in a
                    nice wrapper.


Tuesday, 5 October 2010
Easy ways to lose
                               points

                    Overplotting
                    Code style violations
                    Forgetting about the denominator of a
                    ratio




Tuesday, 5 October 2010
Team Assessment

                    Your individual grades will be weighted by
                    effort.
                    Each team member should turn in a
                    (confidential) team evaluation sheet.
                    Don’t forget to assess yourself.




Tuesday, 5 October 2010
Case study

Tuesday, 5 October 2010
Questions

                  For names that are used for both boys
                  and girls, how has usage changed?
                  Can we use names that clearly have the
                  incorrect sex to estimate error rates over
                  time?




Tuesday, 5 October 2010
Getting started

                options(stringsAsFactors = FALSE)
                library(plyr)
                library(ggplot2)

                bnames <- read.csv("baby-names2.csv.bz2")




Tuesday, 5 October 2010
First task
                    Too many names (~7000): need to identify
                    smaller subset (~100) likely to be
                    interesting.
                    Outside of class, would look at more, but
                    starting with a subset for easier
                    exploration is a good idea.




Tuesday, 5 October 2010
First task
                    Too many names (~7000): need to identify
                    smaller subset (~100) likely to be
                    interesting.
                    Outside of class, would look at more, but
                    starting with a subset for easier
                    exploration is a good idea.

                      For this task, what attributes of a name are
                      likely to be useful?

Tuesday, 5 October 2010
Your turn
                    For each name, calculate the total proportion
                    of boys, the total proportion of girls, the
                    number of years the name was in the top
                    1000 as a girls name, the number of years
                    the name was in the top 1000 as a boys
                    name
                    Hint: Start with a single name and figure out
                    how to solve the problem. Hint: Use
                    summarise


Tuesday, 5 October 2010
times <- ddply(bnames, "name", summarise,
       boys = sum(prop[sex == "boy"]),
       boys_n = sum(sex == "boy"),
       girls = sum(prop[sex == "girl"]),
       girls_n = sum(sex == "girl"),
       .progress = "text"
     )
                      Useful for slow operations


     # But this is rather painful




Tuesday, 5 October 2010
# For this task, data much easier to work with
     # if put sex in columns instead of rows. We'll learn
     # more about reshaping in a couple of weeks
     # install.packages("reshape2")
     library(reshape2)
     bnames2 <- dcast(bnames, year + name ~ sex,
       value_var = "prop")

     # No information unless we have both boys and
     # girls for that name in that year
     both <- subset(bnames2, !is.na(boy) & !is.na(girl))
     dim(both)
     head(both)

Tuesday, 5 October 2010
Your turn

                    Summarise each name with the number
                    of years its made the list for both boys
                    and girls, the average proportion of
                    babies given that name.
                    Which names would you include for
                    further investigation?



Tuesday, 5 October 2010
both_sum <- ddply(both, "name", summarise,
       years = length(name),
       avg_usage = mean(boy + girl) / 2
     )

     # No point at looking at names that only appear once
     both_sum <- subset(both_sum, years > 1)

     qplot(years, avg_usage, data = both_sum)




Tuesday, 5 October 2010
# Now save our selections

     selected_names <- subset(both_sum,
       years > 20 & avg_usage > 0.005)$name

     selected <- subset(both, name %in% selected_names)

     nrow(selected) / nrow(both)




Tuesday, 5 October 2010
Your turn

                    Explore how the gender assignment of
                    these names has changed over time.
                    What is a good summary to use to
                    compare boy popularity to girl popularity?




Tuesday, 5 October 2010
qplot(year, boy - girl, data = selected,
       geom = "line", group = name)
     qplot(year, abs(boy - girl), data = selected,
       geom = "line", group = name,
       colour = sign(boy - girl))

     qplot(year, boy / girl, data = selected,
       geom = "line", group = name)
     qplot(year, log10(boy / girl), data = selected,
       geom = "line", group = name)

     selected$lratio <- with(selected, log10(boy / girl))
     qplot(lratio, name, data = selected)
     qplot(lratio, reorder(name, lratio), data = selected)
     qplot(abs(lratio), reorder(name, lratio),
       data = selected)

Tuesday, 5 October 2010
Your turn

                Compute the mean and range of lratio for
                each name.
                Plot and come up with cutoffs that you
                think separate the two groups.




Tuesday, 5 October 2010
rng <- ddply(selected, "name", summarise,
       diff = diff(range(lratio, na.rm = T)),
       mean = mean(lratio, na.rm = T)
     )

     qplot(diff, abs(mean), data = rng)
     qplot(diff, abs(mean), data = rng, geom = "text",
     label = name)

     rng$dual <- abs(rng$mean) < 2
     arrange(rng, mean, dual)

     selected <- join(selected, rng[c("name", "dual")]


Tuesday, 5 October 2010
qplot(year, lratio, data = selected, geom = "line",
       group = name) + facet_wrap(~ dual)

     qplot(year, lratio, data = subset(selected, dual),
       geom = "line") + facet_wrap(~ name)

     qplot(year, boy / (boy + girl),
       data = subset(selected, dual), geom = "line") +
       facet_wrap(~ name)




Tuesday, 5 October 2010
Next time


                    Now that we’ve separated the two
                    groups, we’ll explore each in more detail.




Tuesday, 5 October 2010

Weitere ähnliche Inhalte

Ähnlich wie 13 case-study

G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
David Fetter
 
Operations is a Strategic Weapon
Operations is a Strategic WeaponOperations is a Strategic Weapon
Operations is a Strategic Weapon
John Willis
 

Ähnlich wie 13 case-study (20)

12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
06 data
06 data06 data
06 data
 
22 spam
22 spam22 spam
22 spam
 
Web Development With Ruby - From Simple To Complex
Web Development With Ruby - From Simple To ComplexWeb Development With Ruby - From Simple To Complex
Web Development With Ruby - From Simple To Complex
 
ppt
pptppt
ppt
 
When Tdd Goes Awry
When Tdd Goes AwryWhen Tdd Goes Awry
When Tdd Goes Awry
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 
Quepy
QuepyQuepy
Quepy
 
What's with an image?
What's with an image? What's with an image?
What's with an image?
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
G so c_and_commitfests_and_pointy_hair_oh_my_sfpug_20131008
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
Meet Couch DB
Meet Couch DBMeet Couch DB
Meet Couch DB
 
04 Reports
04 Reports04 Reports
04 Reports
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Operations is a Strategic Weapon
Operations is a Strategic WeaponOperations is a Strategic Weapon
Operations is a Strategic Weapon
 
2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm
 
Noboxing plugin
Noboxing pluginNoboxing plugin
Noboxing plugin
 
04 reports
04 reports04 reports
04 reports
 

Mehr von Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
02 large
02 large02 large
02 large
 
01 intro
01 intro01 intro
01 intro
 

13 case-study

  • 1. Stat405 ddply case study Hadley Wickham Tuesday, 5 October 2010
  • 2. 1. Homework 2. Project 3. Case study: gender trends 1. Focus on smaller subset 2. Develop summary statistic 3. Classify names Tuesday, 5 October 2010
  • 3. Homework Explain your code! Comments should explain why not what Check your indenting - if it’s not indented correctly, it’s very hard to read Tuesday, 5 October 2010
  • 4. # Really bad: # Set x equal to ten. x <- 10 # Bad: # Figure out if all windows are bars allbars <- all(windows %in% c("B", "BB", "BBB")) # Better: # all() / any() combination used to prevent errors in the # case of three DDs. # Better: # Check to see if DD will create a triple # if (length(unique(windows)) == 2) Tuesday, 5 October 2010
  • 5. # Best (but still not perfect: ## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries Tuesday, 5 October 2010
  • 7. Tips from last year Proof read - far too many projects with obvious mistakes. Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details. Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper. Tuesday, 5 October 2010
  • 8. Easy ways to lose points Overplotting Code style violations Forgetting about the denominator of a ratio Tuesday, 5 October 2010
  • 9. Team Assessment Your individual grades will be weighted by effort. Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself. Tuesday, 5 October 2010
  • 10. Case study Tuesday, 5 October 2010
  • 11. Questions For names that are used for both boys and girls, how has usage changed? Can we use names that clearly have the incorrect sex to estimate error rates over time? Tuesday, 5 October 2010
  • 12. Getting started options(stringsAsFactors = FALSE) library(plyr) library(ggplot2) bnames <- read.csv("baby-names2.csv.bz2") Tuesday, 5 October 2010
  • 13. First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. Tuesday, 5 October 2010
  • 14. First task Too many names (~7000): need to identify smaller subset (~100) likely to be interesting. Outside of class, would look at more, but starting with a subset for easier exploration is a good idea. For this task, what attributes of a name are likely to be useful? Tuesday, 5 October 2010
  • 15. Your turn For each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise Tuesday, 5 October 2010
  • 16. times <- ddply(bnames, "name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text" ) Useful for slow operations # But this is rather painful Tuesday, 5 October 2010
  • 17. # For this task, data much easier to work with # if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks # install.packages("reshape2") library(reshape2) bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop") # No information unless we have both boys and # girls for that name in that year both <- subset(bnames2, !is.na(boy) & !is.na(girl)) dim(both) head(both) Tuesday, 5 October 2010
  • 18. Your turn Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name. Which names would you include for further investigation? Tuesday, 5 October 2010
  • 19. both_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2 ) # No point at looking at names that only appear once both_sum <- subset(both_sum, years > 1) qplot(years, avg_usage, data = both_sum) Tuesday, 5 October 2010
  • 20. # Now save our selections selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name selected <- subset(both, name %in% selected_names) nrow(selected) / nrow(both) Tuesday, 5 October 2010
  • 21. Your turn Explore how the gender assignment of these names has changed over time. What is a good summary to use to compare boy popularity to girl popularity? Tuesday, 5 October 2010
  • 22. qplot(year, boy - girl, data = selected, geom = "line", group = name) qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl)) qplot(year, boy / girl, data = selected, geom = "line", group = name) qplot(year, log10(boy / girl), data = selected, geom = "line", group = name) selected$lratio <- with(selected, log10(boy / girl)) qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected) qplot(abs(lratio), reorder(name, lratio), data = selected) Tuesday, 5 October 2010
  • 23. Your turn Compute the mean and range of lratio for each name. Plot and come up with cutoffs that you think separate the two groups. Tuesday, 5 October 2010
  • 24. rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T) ) qplot(diff, abs(mean), data = rng) qplot(diff, abs(mean), data = rng, geom = "text", label = name) rng$dual <- abs(rng$mean) < 2 arrange(rng, mean, dual) selected <- join(selected, rng[c("name", "dual")] Tuesday, 5 October 2010
  • 25. qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual) qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name) qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name) Tuesday, 5 October 2010
  • 26. Next time Now that we’ve separated the two groups, we’ll explore each in more detail. Tuesday, 5 October 2010