SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Hadley Wickham
Stat405ddply case study
Wednesday, 7 October 2009
1. Feedback & homework & project
2. Overall goal: dual-sex names vs.
errors
3. Selecting smaller subset
4. Classification
5. Individual exploration
Wednesday, 7 October 2009
I’ll try and go slower when writing things
on the board. Remind me!
Too much homework? Will try to reduce
from now on. This week’s homework is a
bit different.
Feedback
Wednesday, 7 October 2009
Homework
If you need more practice, all function
drills, along with answers, are available on
line.
Running behind on grading, sorry :(
Common mistakes
Wednesday, 7 October 2009
even <- function(x) {
is_even <- x %% 2 == 0
if (is_even) {
print("Even!")
} else {
print("Odd!")
}
}
# Problems
# * does it work with vectors?
# * can we easily define odd in terms of even?
Wednesday, 7 October 2009
even <- function(x) {
x %% 2 == 0
}
even(1:10)
odd <- function(x) {
!even(x)
}
# In general, always should return something useful
# from functions, rather than printing or plotting
Wednesday, 7 October 2009
area <- function(r) {
a <- pi * r ^ 2
a
}
# Not necessary!
area <- function(r) {
pi * r ^ 2
}
Wednesday, 7 October 2009
# Choose from a, b and c with equal probability
x <- runif(1)
if (x < 1/3) {
"a"
} else (x < 2/3) {
"b"
} else {
"c"
}
# OR
sample(c("a","b","c"), 1)
Wednesday, 7 October 2009
Still working on grading. Will have back
to you by next Wednesday (no class on
Monday).
Next project due Oct 30.
Basically same as last time, but working
with baby names and you need to include
an external data source.
Project
Wednesday, 7 October 2009
For names that are used for both boys
and girls, how has usage changed?
Can we use names that clearly have the
incorrect sex to estimate error rates over
time?
Questions
Wednesday, 7 October 2009
Getting started
options(stringsAsFactors = FALSE)
library(plyr)
library(ggplot2)
bnames <- read.csv("baby-names.csv")
Wednesday, 7 October 2009
First task
Identify a smaller subset of names that
been in the top 1000 for both boys and
girls. ~7000 names in total, we want to
focus on ~100.
In real-life would probably use more, but
starting with a subset for easier
exploration is still a good idea.
Wednesday, 7 October 2009
First task
Identify a smaller subset of names that
been in the top 1000 for both boys and
girls. ~7000 names in total, we want to
focus on ~100.
In real-life would probably use more, but
starting with a subset for easier
exploration is still a good idea.
Take two minutes to brainstorm what
variables we might to create to do this.
Wednesday, 7 October 2009
Your turn
Summarise each name with: the total
proportion of boys, the total proportion of
girls, the number of years the name was in
the top 1000 as a girls name, the number of
years the name was in the top 1000 as a
boys name
Hint: Start with a single name and figure out
how to solve the problem. Hint: Use
summarise
Wednesday, 7 October 2009
times <- ddply(bnames, c("name"), summarise,
boys = sum(prop[sex == "boy"]),
boys_n = sum(sex == "boy"),
girls = sum(prop[sex == "girl"]),
girls_n = sum(sex == "girl"),
.progress = "text"
)
nrow(times)
times <- subset(times, boys_n > 1 & girls_n > 1)
Wednesday, 7 October 2009
boys_n
girls_n
20
40
60
80
100
120
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
20 40 60 80 100 120
Wednesday, 7 October 2009
pmin(boys_n, girls_n)
count
0
20
40
60
80
20 40 60 80 100 120
New functions:
pmin(a, b)
pmax(a,b)
Wednesday, 7 October 2009
qplot(boys_n, girls_n, data = times)
qplot(pmin(boys_n, girls_n), data = times,
binwidth = 1)
times$both <- with(times, boys_n > 10 & girls_n > 10)
# Still a few too many names. Lets focus on names
# that have managed a certain level of popularity.
qplot(pmin(boys, girls), data = subset(times, both),
binwidth = 0.01)
qplot(pmax(boys, girls), data = subset(times, both),
binwidth = 0.1)
qplot(boys + girls, data = subset(times, both),
binwidth = 0.1)
Wednesday, 7 October 2009
# Now save our selections
both_sexes <- subset(times, both &
boys + girls > 0.4)
selected_names <- both_sexes$name
selected <- subset(bnames, name %in% selected_names)
nrow(selected) / nrow(bnames)
Wednesday, 7 October 2009
Next problem is to classify which names
are dual-sex, and which are errors.
To do that, we’ll need to calculate yearly
summaries for each of those names, and
use our knowledge of names to come up
with a good classification criterion.
Yearly summaries
Wednesday, 7 October 2009
Your turn
For each name, in each year, figure out
the total number of boys and girls.
Think of ways to summarise the
difference between the number of boys
and girls, and start visualising the data.
Wednesday, 7 October 2009
bysex <- ddply(selected, c("name", "year"),
summarise,
boys = sum(prop[sex == "boy"]),
girls = sum(prop[sex == "girl"]),
.progress = "text"
)
# It's useful to have a symmetric means of comparing
# the relative abundance of boys and girls - the log
# ratio is good for this.
bysex$lratio <- log10(bysex$boys / bysex$girls)
bysex$lratio[!is.finite(bysex$lratio)] <- NA
Wednesday, 7 October 2009
year
lratio
−2
−1
0
1
2
1880 1900 1920 1940 1960 1980 2000
Wednesday, 7 October 2009
lratio
reorder(name,lratio,na.rm=T)
Susan
Linda
Karen
Lisa
Barbara
Sandra
Donna
Patricia
Amanda
Jennifer
Nancy
Melissa
Jessica
Sharon
Michelle
Betty
Mary
Dorothy
Virginia
Helen
Margaret
Ruth
Elizabeth
Sarah
Anna
Alice
Mildred
Emma
Marie
Martha
Lillian
Bertha
Clara
Grace
Minnie
Edna
Annie
Kimberly
Edith
Ethel
Florence
Rose
Louise
Irene
Doris
Julia
Frances
Carol
Ashley
Shirley
Willie
Jerry
Ryan
Joe
Louis
Anthony
Daniel
Eric
Joshua
Jason
Fred
Henry
Jack
Christopher
Kevin
George
Matthew
Arthur
Walter
Harold
Kenneth
Brian
Michael
Paul
Albert
Charles
Frank
Joseph
James
Harry
Robert
John
David
Donald
Thomas
Edward
William
Richard
Larry
Mark
Ronald
●●●●●●●●●● ●●●● ●●●●●●●●●●●
●●●●●●●●●●●●● ●●●●● ●● ●●●●● ●●
●●●●●●●●● ●●●● ● ●●●● ●
●● ●●● ●●● ●●●●●● ●●
●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●● ●●
●●● ●● ●●●●●●● ●●●● ●
●●●●●●●●● ●●●● ●
●●● ●●●●●●●●●●●●●●●●●● ●● ●●● ●● ●●● ● ●●●● ●● ●● ●
●●● ●●●●●●●●●
●● ●● ●●●● ●●●●●●●●●●● ●● ● ●
●●●● ● ●●●●●●●●●●● ●●
●●●● ●●●●●●●●●● ●●●●
●●●●●●●● ●●●● ●●●
●●●●●●●●●●● ●●● ● ●●
●●●●● ●●●●●●●● ●●●
●●●●●●●●●●●●●●●●● ●●●●●●●●
● ●● ●●●● ●●●● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●● ●● ●●●●●●
●● ●●●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●●●● ●●● ●●●●● ●● ●●●●● ●
●● ●● ●●●●●●●●●●●●
●●●●● ●●●● ●● ●●●● ● ●●● ●●● ●●● ●●●●● ●● ●●●●●● ●●● ●●●●● ●●● ●● ●●●●
●● ●●●● ●● ●●●● ●●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●● ●●
●● ●● ● ●●● ● ●●●● ●●●●● ●●●●●● ●●● ●●● ●●●●● ●●●●● ●● ●●●●● ●
●● ●●● ● ●● ●●●● ●● ●●● ●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●
●● ●● ●● ● ●●●●● ●●●●●●● ●
● ●●●●●● ●●●●●● ●●● ● ●●●●● ●● ●● ●● ●●●●●●●●●●
●● ●●●●●●● ●● ● ●●●● ●●●● ●●●● ●●●●●●●●●●
●●●●● ● ●● ●●● ●●● ●●●●●●●●●●●
●●●● ● ●● ●●● ●●● ●●●●● ● ●● ●● ●● ● ●
●●●●●●●● ●●● ●●●●●● ●● ●● ●●●●
●●● ●●● ●●● ●● ●●●●●●●●●
●●●●●● ●●●●● ●● ●●●● ●●●●● ●●● ●●● ●●
● ●● ● ●●●●●● ● ●● ●●●●● ●● ●● ●●● ●●
●● ●●● ●●●● ●●●● ● ●● ● ●● ● ●● ●●●●●
●● ●● ●●●●●●● ●● ●●●●● ●●● ●●●
●●● ●● ● ●●● ●●● ●● ● ●● ●● ●● ●●●
● ●●●●● ●●●●● ● ●●●●● ●●● ● ●●●●
●● ●● ●● ●● ●●●● ●● ●● ●●● ●● ●● ●● ●● ● ●● ●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●● ●●●●●
● ●● ●●●● ●● ● ●●● ●● ●● ●●
●●● ●● ●● ●● ● ●●● ●● ● ●●●●●●●●●● ●●●●● ●
●●● ● ●● ●●●● ●●●● ● ●● ●● ●● ●●●● ●●● ●●●●● ●
●●●● ● ●● ●●● ● ●● ●●● ●● ●●●●●●●● ●●●●●
●●● ●●● ●● ●●● ●● ●●●●● ●●● ●● ●●●●●●●●●●● ●●●●
●● ●●●● ●●● ●●●●
●●●●●●●● ●●●●● ●●●●●●●●● ●●●● ●● ●●●●●●● ●●●●
●●●●● ● ●● ●●●
●●● ●●●●●● ●●●● ●●●●●●● ● ●●● ●●● ●●● ●●●●●●●●●●●●●●● ●●●●● ●●●● ●●● ●●●●●●● ●●
●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●
●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●
●●●● ●●●●● ●●● ● ● ●● ●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●
●●●● ●●● ●● ●●● ●●●● ●● ●●● ●●● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●● ●
●●● ●● ●●●● ●● ●● ●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●
●●●●●●●●● ●● ●
●●●●●●●●●●●●●●●●
●● ●●● ●●●●●●●●●●●●●●●●
●●● ● ●● ● ●● ●●●●● ●●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●● ●●●●
●●●●●●●●●●●●●● ●●
●●●●●●●●●●●●●●●●●●●●●●● ● ●
●●●●●●●●●●●●
●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●●●●●●●●●
●●●●●●●●●●●●●● ●
● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●
●● ● ●●● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●● ●●●●●
●●● ●●●●●●●●●● ●●●
●●●●●● ●●●●●●●●
●●●●●●●●●●●●●●
●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●
●●● ●●●●● ●●●●●●●● ●● ●● ●●●●●●●●
●● ●●●● ●●●●●● ●●●
●●● ●●●●● ●●●●● ●●● ●●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●● ●● ●●●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●●
●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●● ●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ●●● ●● ●● ●●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●
●● ●● ●●● ●● ●●● ●●●
● ●●● ●●● ●● ● ● ●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●
●●● ●●●●● ●●●● ● ●● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●
●●●●●●●●●● ●● ●● ●●●●● ●●●●
●●● ●● ●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●● ●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●
●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●● ●●● ●●●●●●●●
●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●
● ● ●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●
● ●●●● ●● ●●●●●●●●●
●●●●●●●●●●●●●
● ●●●●●●●●●●
−2 −1 0 1 2
Wednesday, 7 October 2009
abs(lratio)
reorder(name,lratio,na.rm=T)
Susan
Linda
Karen
Lisa
Barbara
Sandra
Donna
Patricia
Amanda
Jennifer
Nancy
Melissa
Jessica
Sharon
Michelle
Betty
Mary
Dorothy
Virginia
Helen
Margaret
Ruth
Elizabeth
Sarah
Anna
Alice
Mildred
Emma
Marie
Martha
Lillian
Bertha
Clara
Grace
Minnie
Edna
Annie
Kimberly
Edith
Ethel
Florence
Rose
Louise
Irene
Doris
Julia
Frances
Carol
Ashley
Shirley
Willie
Jerry
Ryan
Joe
Louis
Anthony
Daniel
Eric
Joshua
Jason
Fred
Henry
Jack
Christopher
Kevin
George
Matthew
Arthur
Walter
Harold
Kenneth
Brian
Michael
Paul
Albert
Charles
Frank
Joseph
James
Harry
Robert
John
David
Donald
Thomas
Edward
William
Richard
Larry
Mark
Ronald
● ●● ● ●● ● ●● ●● ●●●● ●●●●●●●●●●
● ●●●● ● ● ●●● ● ●●● ●●●●● ●● ● ●●●● ●
●● ●●● ●●● ●● ●●●●● ●●●●
● ●●●●● ● ●●●●● ● ●●●
●●●●● ●● ●●●●●●●● ● ●● ● ●●●●●● ●●●● ●●●●●
● ● ●● ●● ●● ● ● ●●●●●●●
● ● ● ● ●● ●● ●●●●●●
● ● ●● ●●● ●●●●●● ● ● ● ●● ●● ●● ●● ●●●●● ● ●●●●● ●● ●●●●
● ● ●●●●●● ●●●●
● ●● ●● ●● ●● ● ●●●●●●●●●● ●●●
● ● ● ●●●● ●● ● ●●●● ● ●●●
●●●●●●●● ●●●●● ●●● ●●
●●● ●●●● ●● ●●●● ●●
● ● ●● ● ●●● ●● ●● ●●●●●
●●●●●●●●●● ●●●●● ●
● ●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ●
●● ●●●● ●● ● ● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●● ●●●●●●●● ●●●●● ●●●●●●●●●●
● ●● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ●●●● ●●●●●●●●●● ●●●●● ●● ●● ● ●● ●●
●●● ●●● ●● ●●●●● ●●●
● ●●● ●● ●● ●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●● ●●● ●● ●●● ●●
● ●● ●● ●● ●●● ● ●●● ● ●●●● ●● ●● ● ●●● ●● ● ● ● ●● ●●●●● ●●●●●●●● ●●●● ●●●● ●●●● ●●●
●●● ●●● ● ●●●● ●●●● ●● ●● ● ● ●● ●● ● ●● ●●● ●●●●● ●●●●● ●● ● ●● ●●
● ●●● ●●● ●● ● ●●● ●● ●●●● ● ●● ●●●● ● ●●● ● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●● ●●●● ●●
● ●● ●● ●●● ● ●● ●●●●● ●●●●
●●●●● ● ●●● ● ●● ●● ●●●● ● ●● ●● ●● ●● ●● ● ● ● ●●●●●●
● ●●●● ●●● ●● ●●● ● ● ●●● ●●●● ● ●● ● ● ●●●●●●●
● ● ●● ●●● ●● ● ●● ● ●● ●● ●●●●●●●●
●● ●●●● ●●● ●● ● ●●● ●● ●●● ●●●● ●●●
●●●● ● ●● ●● ●●●●● ● ●●●●● ●● ● ● ●
●● ●● ● ●●● ●● ●● ●●●●● ●●●
●● ●●●●●● ●● ●● ●●● ●●● ●● ●●●● ●● ● ●● ●
●● ●●● ●●●● ●●● ●● ●● ●●● ●●●● ●●● ●
● ●●● ●● ●●●●● ●●●● ●●● ●●● ●●●●●●
●●● ●● ●●●● ● ●●●● ● ●●●● ●●● ●●
●●●● ●●●●●● ● ●● ●●● ●● ●●●● ●●
●● ● ●●●● ●●● ●●● ●●● ●●● ●●●● ● ●
● ●●●● ●● ●● ●●●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●●●●●●
● ● ● ● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●●
●● ●● ● ●●● ●●● ●●● ●● ●●●
● ●●● ●●●● ●●● ● ●● ●●● ●●●●● ● ●● ●●●● ● ●●
●●●●● ●●● ●●●●● ●●● ●● ●● ●●● ● ●● ●●● ●● ● ●●
●● ●●●● ●● ●●●● ●● ●●● ●●●● ●● ●● ●●● ●●●
●●●● ● ●● ●● ●●● ●●● ●● ●● ● ●● ●● ● ● ●●●●●●● ●●●●●
●●● ●● ●● ● ●● ●●●
●● ● ●●● ●●● ● ●● ●● ●●● ● ●●● ●●●●●● ●● ●●●●●●● ●● ●
●●●● ●●● ●● ●●
● ● ●● ● ● ●●●● ● ● ●● ●●●●● ●●● ● ●●● ●● ● ●● ● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ●
● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●●● ●● ●● ●●● ●●●●●●
● ● ● ● ●● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●
● ● ● ●●● ●● ●●● ●●●● ●● ● ●● ●● ●● ●● ●●● ●●●● ●● ● ● ● ● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ●●● ●● ● ●● ● ● ●●●● ●●●●
●●● ●● ●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●● ● ●●● ●●●●● ●● ●●●●●●●● ● ● ● ●● ● ●●
●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ● ● ●●● ●●●●●●●● ●●●● ●● ● ●● ●●●●●● ● ●● ●● ●●
●●●●● ●●● ●● ●●●● ● ● ●● ●●●●●●● ●●●●●●●●● ●
●●●● ●●● ●● ●●● ●● ●● ●● ●●● ●● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ● ●● ●
●●● ●● ●●●● ●● ●● ●●●●●●●●●●●●
●● ●● ● ●● ●●●● ●●●●●●● ● ●●●● ●
●●●●●● ●●●●●●●●●●● ●
●●● ●●●●●● ●●●●●●
●● ●●●●●●● ●● ●
●● ●●●●●●●● ●●●●●●
●● ●● ● ●●●● ●●● ●●●●●●●● ●
●●● ● ●● ● ●● ●●●●● ● ●●●●● ● ● ●● ●●●●● ●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●● ●● ●●● ●
● ●● ●● ●●● ●●● ●●● ●●
●●●●●● ●●●●●●●●● ●●●●●●●● ● ●
●●●●●● ●●●●●●
●●● ●● ●● ● ●● ● ●● ●●● ● ●● ●● ●● ●●●●● ● ●● ●●●●● ●●●●●●● ●●●●●●●● ●●●●● ●●●● ● ●●● ●● ● ●● ●●●●●●●
● ●● ●●● ●●●●●●● ● ●
● ●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●
●● ● ●●● ●● ●●●●●●● ● ●●●●● ●●●●●●●●●●● ● ●● ●●●●●
●●● ●● ●●●●●●●● ●●●
●●●● ● ● ● ●● ●●●●●
●●●●● ●●●●●●●●●
●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●● ● ● ●●●●●
●●● ●●● ●● ●● ●●●●●● ●● ● ● ●●●● ●●●●
●● ● ●●● ● ●● ●●● ●●●
●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●● ●● ●● ●● ● ●●● ●●●●●● ●● ●●● ●●● ●●●●●● ●● ● ●●●●●●●●●●●●●
● ●●●● ●● ● ● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●● ●●●●●●●● ●● ●● ●●●●●
●● ●● ●● ●●● ●●●● ●●● ●●● ●●● ● ●● ● ●● ●● ●●● ●●●● ●●●●●●●●● ●●● ● ● ● ●● ●●●● ●●●● ●●●●● ●●●● ●●●●●●● ●●●●●●● ●
●● ●●● ●● ●● ●●● ●●●● ●● ●●●● ●● ●●●● ●●●●●●●●●●● ●● ●●●●●●●●●● ● ●● ● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●
●● ●● ●● ● ●● ●●● ●●●
● ●● ● ●●● ●● ● ● ●●●● ●● ●●● ●● ● ●●● ●●●●● ●●●●●●●●●●●●●●●●● ●● ●● ●●●●●● ●●● ●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
● ●●● ●●●● ●●● ●●●●● ●●● ● ●● ●●●●●● ●●● ●●●●●●● ●● ●●●●●●●●●●● ● ● ●● ●●●●● ●●● ● ●● ● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●● ● ●
●●● ●●●●● ●●●● ● ●● ●● ●●●● ● ●● ●● ●● ●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●
● ●●●●●●●● ● ●● ●● ●●●●● ●● ●●
●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●● ● ●●● ●●●● ●● ●●● ●●● ●●●● ●● ●●● ● ●●● ●●●●●●●● ●●● ●●●●
●●●●●● ●●● ●●● ●● ● ●●●●●●● ●●●●●●● ●● ● ●●● ● ●● ●●
●● ●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●● ●●●●●●●●● ● ●● ● ●●●● ●●● ●●● ●●● ●● ●● ●● ●●●●●●●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●● ●●
● ● ●●●●●●●●● ●● ●●●● ●●● ●●●● ● ●● ●●●● ● ●●● ●●● ●●●●●●● ●●●●●●●●●●●●
● ●●●● ●● ●●●●●●●● ●
●●●● ●●●●●●●●●
● ●●● ●●● ●●● ●
0.5 1.0 1.5 2.0 2.5
Wednesday, 7 October 2009
theme_set(theme_grey(10))
qplot(year, lratio, data = bysex, group = name,
geom = "line")
qplot(lratio, reorder(name, lratio, na.rm = T),
data = bysex)
qplot(abs(lratio), reorder(name, lratio, na.rm = T),
data = bysex)
qplot(abs(lratio), reorder(name, lratio, na.rm = T),
data = bysex) +
geom_point(data = both_sexes, colour = "red")
Wednesday, 7 October 2009
year
lratio
−2
−1
0
1
2
1880 1900 1920 1940 1960 1980 2000
What characteristics of
each name might we want
to use to classify them into
dual-sex with sex-errors?
Wednesday, 7 October 2009
Your turn
Compute the mean and range of lratio for
each name.
Plot and come up with cutoffs that you
think separate the two groups.
Wednesday, 7 October 2009
rng <- ddply(bysex, "name", summarise,
diff = diff(range(lratio, na.rm = T)),
mean = mean(lratio, na.rm = T)
)
qplot(diff, abs(mean), data = rng)
qplot(diff, abs(mean), data = rng,
colour = abs(mean) < 1.75 | diff > 0.9)
shared_names <- subset(rng, abs(mean) < 1.75 |
diff > 0.9)$name
qplot(abs(lratio), reorder(name, lratio, na.rm=T),
data = subset(bysex, name %in% shared_names))
qplot(year, lratio, geom = "line", group = name,
data = subset(bysex, name %in% shared_names))
Wednesday, 7 October 2009
Now that we’ve separated the two
groups, we’ll explore each in more detail.
Next time
Wednesday, 7 October 2009

Weitere ähnliche Inhalte

Ähnlich wie 13 Ddply

Optimal Nudging. Presentation UD.
Optimal Nudging. Presentation UD.Optimal Nudging. Presentation UD.
Optimal Nudging. Presentation UD.r-uribe
 
5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) CodeJeremy Kendall
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmerdaoswald
 
jQuery Plugin Creation
jQuery Plugin CreationjQuery Plugin Creation
jQuery Plugin Creationbenalman
 
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?daoswald
 
Dear compiler please don't be my nanny v2
Dear compiler  please don't be my nanny v2Dear compiler  please don't be my nanny v2
Dear compiler please don't be my nanny v2Dino Dini
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nutsJames Cox
 
BangaloreJUG introduction to kotlin
BangaloreJUG   introduction to kotlinBangaloreJUG   introduction to kotlin
BangaloreJUG introduction to kotlinChandra Sekhar Nayak
 
ATS/LF for Coq users
ATS/LF for Coq usersATS/LF for Coq users
ATS/LF for Coq usersKiwamu Okabe
 
Metasepi team meeting #14: ATS programming on MCU
Metasepi team meeting #14: ATS programming on MCUMetasepi team meeting #14: ATS programming on MCU
Metasepi team meeting #14: ATS programming on MCUKiwamu Okabe
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationOlivier Teytaud
 
Unit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundUnit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundSebastiano Panichella
 
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extension
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extensionDrupal camp paris 2013: présentation de Behat, Mink et Drupal extension
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extensionDidier B2F
 
Översättning av django-program
Översättning av django-programÖversättning av django-program
Översättning av django-programmikaelmoutakis
 
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary	[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary EnlightenmentProject
 
Its not about the tooling
Its not about the toolingIts not about the tooling
Its not about the toolingBram Vogelaar
 
Can we fix dev-oops ?
Can we fix dev-oops ?Can we fix dev-oops ?
Can we fix dev-oops ?Kris Buytaert
 

Ähnlich wie 13 Ddply (20)

Optimal Nudging. Presentation UD.
Optimal Nudging. Presentation UD.Optimal Nudging. Presentation UD.
Optimal Nudging. Presentation UD.
 
20 Polishing
20 Polishing20 Polishing
20 Polishing
 
5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code5 Ways to Awesome-ize Your (PHP) Code
5 Ways to Awesome-ize Your (PHP) Code
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmer
 
jQuery Plugin Creation
jQuery Plugin CreationjQuery Plugin Creation
jQuery Plugin Creation
 
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
 
Dear compiler please don't be my nanny v2
Dear compiler  please don't be my nanny v2Dear compiler  please don't be my nanny v2
Dear compiler please don't be my nanny v2
 
Scalability without going nuts
Scalability without going nutsScalability without going nuts
Scalability without going nuts
 
BangaloreJUG introduction to kotlin
BangaloreJUG   introduction to kotlinBangaloreJUG   introduction to kotlin
BangaloreJUG introduction to kotlin
 
ATS/LF for Coq users
ATS/LF for Coq usersATS/LF for Coq users
ATS/LF for Coq users
 
Metasepi team meeting #14: ATS programming on MCU
Metasepi team meeting #14: ATS programming on MCUMetasepi team meeting #14: ATS programming on MCU
Metasepi team meeting #14: ATS programming on MCU
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
Unit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundUnit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth Round
 
PyOhio Recursion Slides
PyOhio Recursion SlidesPyOhio Recursion Slides
PyOhio Recursion Slides
 
14 Ddply
14 Ddply14 Ddply
14 Ddply
 
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extension
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extensionDrupal camp paris 2013: présentation de Behat, Mink et Drupal extension
Drupal camp paris 2013: présentation de Behat, Mink et Drupal extension
 
Översättning av django-program
Översättning av django-programÖversättning av django-program
Översättning av django-program
 
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary	[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
 
Its not about the tooling
Its not about the toolingIts not about the tooling
Its not about the tooling
 
Can we fix dev-oops ?
Can we fix dev-oops ?Can we fix dev-oops ?
Can we fix dev-oops ?
 

Mehr von Hadley Wickham (20)

27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

13 Ddply

  • 1. Hadley Wickham Stat405ddply case study Wednesday, 7 October 2009
  • 2. 1. Feedback & homework & project 2. Overall goal: dual-sex names vs. errors 3. Selecting smaller subset 4. Classification 5. Individual exploration Wednesday, 7 October 2009
  • 3. I’ll try and go slower when writing things on the board. Remind me! Too much homework? Will try to reduce from now on. This week’s homework is a bit different. Feedback Wednesday, 7 October 2009
  • 4. Homework If you need more practice, all function drills, along with answers, are available on line. Running behind on grading, sorry :( Common mistakes Wednesday, 7 October 2009
  • 5. even <- function(x) { is_even <- x %% 2 == 0 if (is_even) { print("Even!") } else { print("Odd!") } } # Problems # * does it work with vectors? # * can we easily define odd in terms of even? Wednesday, 7 October 2009
  • 6. even <- function(x) { x %% 2 == 0 } even(1:10) odd <- function(x) { !even(x) } # In general, always should return something useful # from functions, rather than printing or plotting Wednesday, 7 October 2009
  • 7. area <- function(r) { a <- pi * r ^ 2 a } # Not necessary! area <- function(r) { pi * r ^ 2 } Wednesday, 7 October 2009
  • 8. # Choose from a, b and c with equal probability x <- runif(1) if (x < 1/3) { "a" } else (x < 2/3) { "b" } else { "c" } # OR sample(c("a","b","c"), 1) Wednesday, 7 October 2009
  • 9. Still working on grading. Will have back to you by next Wednesday (no class on Monday). Next project due Oct 30. Basically same as last time, but working with baby names and you need to include an external data source. Project Wednesday, 7 October 2009
  • 10. For names that are used for both boys and girls, how has usage changed? Can we use names that clearly have the incorrect sex to estimate error rates over time? Questions Wednesday, 7 October 2009
  • 11. Getting started options(stringsAsFactors = FALSE) library(plyr) library(ggplot2) bnames <- read.csv("baby-names.csv") Wednesday, 7 October 2009
  • 12. First task Identify a smaller subset of names that been in the top 1000 for both boys and girls. ~7000 names in total, we want to focus on ~100. In real-life would probably use more, but starting with a subset for easier exploration is still a good idea. Wednesday, 7 October 2009
  • 13. First task Identify a smaller subset of names that been in the top 1000 for both boys and girls. ~7000 names in total, we want to focus on ~100. In real-life would probably use more, but starting with a subset for easier exploration is still a good idea. Take two minutes to brainstorm what variables we might to create to do this. Wednesday, 7 October 2009
  • 14. Your turn Summarise each name with: the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise Wednesday, 7 October 2009
  • 15. times <- ddply(bnames, c("name"), summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text" ) nrow(times) times <- subset(times, boys_n > 1 & girls_n > 1) Wednesday, 7 October 2009
  • 16. boys_n girls_n 20 40 60 80 100 120 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 40 60 80 100 120 Wednesday, 7 October 2009
  • 17. pmin(boys_n, girls_n) count 0 20 40 60 80 20 40 60 80 100 120 New functions: pmin(a, b) pmax(a,b) Wednesday, 7 October 2009
  • 18. qplot(boys_n, girls_n, data = times) qplot(pmin(boys_n, girls_n), data = times, binwidth = 1) times$both <- with(times, boys_n > 10 & girls_n > 10) # Still a few too many names. Lets focus on names # that have managed a certain level of popularity. qplot(pmin(boys, girls), data = subset(times, both), binwidth = 0.01) qplot(pmax(boys, girls), data = subset(times, both), binwidth = 0.1) qplot(boys + girls, data = subset(times, both), binwidth = 0.1) Wednesday, 7 October 2009
  • 19. # Now save our selections both_sexes <- subset(times, both & boys + girls > 0.4) selected_names <- both_sexes$name selected <- subset(bnames, name %in% selected_names) nrow(selected) / nrow(bnames) Wednesday, 7 October 2009
  • 20. Next problem is to classify which names are dual-sex, and which are errors. To do that, we’ll need to calculate yearly summaries for each of those names, and use our knowledge of names to come up with a good classification criterion. Yearly summaries Wednesday, 7 October 2009
  • 21. Your turn For each name, in each year, figure out the total number of boys and girls. Think of ways to summarise the difference between the number of boys and girls, and start visualising the data. Wednesday, 7 October 2009
  • 22. bysex <- ddply(selected, c("name", "year"), summarise, boys = sum(prop[sex == "boy"]), girls = sum(prop[sex == "girl"]), .progress = "text" ) # It's useful to have a symmetric means of comparing # the relative abundance of boys and girls - the log # ratio is good for this. bysex$lratio <- log10(bysex$boys / bysex$girls) bysex$lratio[!is.finite(bysex$lratio)] <- NA Wednesday, 7 October 2009
  • 23. year lratio −2 −1 0 1 2 1880 1900 1920 1940 1960 1980 2000 Wednesday, 7 October 2009
  • 24. lratio reorder(name,lratio,na.rm=T) Susan Linda Karen Lisa Barbara Sandra Donna Patricia Amanda Jennifer Nancy Melissa Jessica Sharon Michelle Betty Mary Dorothy Virginia Helen Margaret Ruth Elizabeth Sarah Anna Alice Mildred Emma Marie Martha Lillian Bertha Clara Grace Minnie Edna Annie Kimberly Edith Ethel Florence Rose Louise Irene Doris Julia Frances Carol Ashley Shirley Willie Jerry Ryan Joe Louis Anthony Daniel Eric Joshua Jason Fred Henry Jack Christopher Kevin George Matthew Arthur Walter Harold Kenneth Brian Michael Paul Albert Charles Frank Joseph James Harry Robert John David Donald Thomas Edward William Richard Larry Mark Ronald ●●●●●●●●●● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ●● ●●●●● ●● ●●●●●●●●● ●●●● ● ●●●● ● ●● ●●● ●●● ●●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●● ●● ●● ●●● ●● ●●●●●●● ●●●● ● ●●●●●●●●● ●●●● ● ●●● ●●●●●●●●●●●●●●●●●● ●● ●●● ●● ●●● ● ●●●● ●● ●● ● ●●● ●●●●●●●●● ●● ●● ●●●● ●●●●●●●●●●● ●● ● ● ●●●● ● ●●●●●●●●●●● ●● ●●●● ●●●●●●●●●● ●●●● ●●●●●●●● ●●●● ●●● ●●●●●●●●●●● ●●● ● ●● ●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●● ●●●● ●●●● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●● ●● ●●●●●● ●● ●●●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●●●● ●●● ●●●●● ●● ●●●●● ● ●● ●● ●●●●●●●●●●●● ●●●●● ●●●● ●● ●●●● ● ●●● ●●● ●●● ●●●●● ●● ●●●●●● ●●● ●●●●● ●●● ●● ●●●● ●● ●●●● ●● ●●●● ●●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●● ●● ●● ●● ● ●●● ● ●●●● ●●●●● ●●●●●● ●●● ●●● ●●●●● ●●●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●● ●● ●●● ●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●● ●● ● ●●●●● ●●●●●●● ● ● ●●●●●● ●●●●●● ●●● ● ●●●●● ●● ●● ●● ●●●●●●●●●● ●● ●●●●●●● ●● ● ●●●● ●●●● ●●●● ●●●●●●●●●● ●●●●● ● ●● ●●● ●●● ●●●●●●●●●●● ●●●● ● ●● ●●● ●●● ●●●●● ● ●● ●● ●● ● ● ●●●●●●●● ●●● ●●●●●● ●● ●● ●●●● ●●● ●●● ●●● ●● ●●●●●●●●● ●●●●●● ●●●●● ●● ●●●● ●●●●● ●●● ●●● ●● ● ●● ● ●●●●●● ● ●● ●●●●● ●● ●● ●●● ●● ●● ●●● ●●●● ●●●● ● ●● ● ●● ● ●● ●●●●● ●● ●● ●●●●●●● ●● ●●●●● ●●● ●●● ●●● ●● ● ●●● ●●● ●● ● ●● ●● ●● ●●● ● ●●●●● ●●●●● ● ●●●●● ●●● ● ●●●● ●● ●● ●● ●● ●●●● ●● ●● ●●● ●● ●● ●● ●● ● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●● ●●●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ● ●●● ●● ● ●●●●●●●●●● ●●●●● ● ●●● ● ●● ●●●● ●●●● ● ●● ●● ●● ●●●● ●●● ●●●●● ● ●●●● ● ●● ●●● ● ●● ●●● ●● ●●●●●●●● ●●●●● ●●● ●●● ●● ●●● ●● ●●●●● ●●● ●● ●●●●●●●●●●● ●●●● ●● ●●●● ●●● ●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●●● ●● ●●●●●●● ●●●● ●●●●● ● ●● ●●● ●●● ●●●●●● ●●●● ●●●●●●● ● ●●● ●●● ●●● ●●●●●●●●●●●●●●● ●●●●● ●●●● ●●● ●●●●●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●● ●●●● ●●●●● ●●● ● ● ●● ●●● ●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●● ●●● ●●●● ●● ●●● ●●● ●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●●● ●● ●●●● ●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●● ●●● ● ●● ● ●● ●●●●● ●●●●●●● ● ●● ●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●●● ●● ●●●●● ● ●● ●●● ● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●● ●●●●● ●●● ●●●●●●●●●● ●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●● ●●●●● ●●●●●●●● ●● ●● ●●●●●●●● ●● ●●●● ●●●●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●● ●●●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●●●●●●●●●● ●● ●● ●●●●● ●● ●● ●● ●●● ●●●● ●●●●●● ●●●●●● ●●● ●● ●●●●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●●● ●● ●● ●●● ●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●● ●● ●●● ●● ●●● ●●● ● ●●● ●●● ●● ● ● ●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●● ●●● ●●●●● ●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●● ●●●● ● ●● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●●●●●●● ●● ●● ●●●●● ●●●● ●●● ●● ●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●● ●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●● ●●● ●●●●●●●● ●●●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●● −2 −1 0 1 2 Wednesday, 7 October 2009
  • 25. abs(lratio) reorder(name,lratio,na.rm=T) Susan Linda Karen Lisa Barbara Sandra Donna Patricia Amanda Jennifer Nancy Melissa Jessica Sharon Michelle Betty Mary Dorothy Virginia Helen Margaret Ruth Elizabeth Sarah Anna Alice Mildred Emma Marie Martha Lillian Bertha Clara Grace Minnie Edna Annie Kimberly Edith Ethel Florence Rose Louise Irene Doris Julia Frances Carol Ashley Shirley Willie Jerry Ryan Joe Louis Anthony Daniel Eric Joshua Jason Fred Henry Jack Christopher Kevin George Matthew Arthur Walter Harold Kenneth Brian Michael Paul Albert Charles Frank Joseph James Harry Robert John David Donald Thomas Edward William Richard Larry Mark Ronald ● ●● ● ●● ● ●● ●● ●●●● ●●●●●●●●●● ● ●●●● ● ● ●●● ● ●●● ●●●●● ●● ● ●●●● ● ●● ●●● ●●● ●● ●●●●● ●●●● ● ●●●●● ● ●●●●● ● ●●● ●●●●● ●● ●●●●●●●● ● ●● ● ●●●●●● ●●●● ●●●●● ● ● ●● ●● ●● ● ● ●●●●●●● ● ● ● ● ●● ●● ●●●●●● ● ● ●● ●●● ●●●●●● ● ● ● ●● ●● ●● ●● ●●●●● ● ●●●●● ●● ●●●● ● ● ●●●●●● ●●●● ● ●● ●● ●● ●● ● ●●●●●●●●●● ●●● ● ● ● ●●●● ●● ● ●●●● ● ●●● ●●●●●●●● ●●●●● ●●● ●● ●●● ●●●● ●● ●●●● ●● ● ● ●● ● ●●● ●● ●● ●●●●● ●●●●●●●●●● ●●●●● ● ● ●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ● ●● ●●●● ●● ● ● ●●●● ●●●● ●●●● ●● ●●● ●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ●●● ●●●●●●●● ●●●●● ●●●●●●●●●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ●●●● ●●●●●●●●●● ●●●●● ●● ●● ● ●● ●● ●●● ●●● ●● ●●●●● ●●● ● ●●● ●● ●● ●● ●●●● ●●●● ●●● ●●● ●● ●● ● ●● ●● ●●●●●●●●●●●● ●●● ●● ●●● ●● ● ●● ●● ●● ●●● ● ●●● ● ●●●● ●● ●● ● ●●● ●● ● ● ● ●● ●●●●● ●●●●●●●● ●●●● ●●●● ●●●● ●●● ●●● ●●● ● ●●●● ●●●● ●● ●● ● ● ●● ●● ● ●● ●●● ●●●●● ●●●●● ●● ● ●● ●● ● ●●● ●●● ●● ● ●●● ●● ●●●● ● ●● ●●●● ● ●●● ● ●●●● ●●● ●●●● ●●● ●● ●●●●●●●●● ●●●● ●● ● ●● ●● ●●● ● ●● ●●●●● ●●●● ●●●●● ● ●●● ● ●● ●● ●●●● ● ●● ●● ●● ●● ●● ● ● ● ●●●●●● ● ●●●● ●●● ●● ●●● ● ● ●●● ●●●● ● ●● ● ● ●●●●●●● ● ● ●● ●●● ●● ● ●● ● ●● ●● ●●●●●●●● ●● ●●●● ●●● ●● ● ●●● ●● ●●● ●●●● ●●● ●●●● ● ●● ●● ●●●●● ● ●●●●● ●● ● ● ● ●● ●● ● ●●● ●● ●● ●●●●● ●●● ●● ●●●●●● ●● ●● ●●● ●●● ●● ●●●● ●● ● ●● ● ●● ●●● ●●●● ●●● ●● ●● ●●● ●●●● ●●● ● ● ●●● ●● ●●●●● ●●●● ●●● ●●● ●●●●●● ●●● ●● ●●●● ● ●●●● ● ●●●● ●●● ●● ●●●● ●●●●●● ● ●● ●●● ●● ●●●● ●● ●● ● ●●●● ●●● ●●● ●●● ●●● ●●●● ● ● ● ●●●● ●● ●● ●●●●●● ●● ●●● ●● ●● ●● ●●● ●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●● ●● ●● ● ●●● ●●● ●●● ●● ●●● ● ●●● ●●●● ●●● ● ●● ●●● ●●●●● ● ●● ●●●● ● ●● ●●●●● ●●● ●●●●● ●●● ●● ●● ●●● ● ●● ●●● ●● ● ●● ●● ●●●● ●● ●●●● ●● ●●● ●●●● ●● ●● ●●● ●●● ●●●● ● ●● ●● ●●● ●●● ●● ●● ● ●● ●● ● ● ●●●●●●● ●●●●● ●●● ●● ●● ● ●● ●●● ●● ● ●●● ●●● ● ●● ●● ●●● ● ●●● ●●●●●● ●● ●●●●●●● ●● ● ●●●● ●●● ●● ●● ● ● ●● ● ● ●●●● ● ● ●● ●●●●● ●●● ● ●●● ●● ● ●● ● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ● ●● ●●●● ●●● ● ● ● ● ●●● ●● ● ●●●●●●● ● ●●●● ●●●●● ● ●●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●●● ●● ●● ●●● ●●●●●● ● ● ● ● ●● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ●● ●●● ●●●● ●● ● ●● ●● ●● ●● ●●● ●●●● ●● ● ● ● ● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ●●● ●● ● ●● ● ● ●●●● ●●●● ●●● ●● ●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●● ● ●●● ●●●●● ●● ●●●●●●●● ● ● ● ●● ● ●● ●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ● ● ●●● ●●●●●●●● ●●●● ●● ● ●● ●●●●●● ● ●● ●● ●● ●●●●● ●●● ●● ●●●● ● ● ●● ●●●●●●● ●●●●●●●●● ● ●●●● ●●● ●● ●●● ●● ●● ●● ●●● ●● ● ●●● ●●● ●●● ●●● ● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●● ●●● ●● ● ●● ● ●●● ●● ●●●● ●● ●● ●●●●●●●●●●●● ●● ●● ● ●● ●●●● ●●●●●●● ● ●●●● ● ●●●●●● ●●●●●●●●●●● ● ●●● ●●●●●● ●●●●●● ●● ●●●●●●● ●● ● ●● ●●●●●●●● ●●●●●● ●● ●● ● ●●●● ●●● ●●●●●●●● ● ●●● ● ●● ● ●● ●●●●● ● ●●●●● ● ● ●● ●●●●● ●● ●●●●●●●● ●● ●●●●●● ●●●● ● ●●● ●● ●●● ● ● ●● ●● ●●● ●●● ●●● ●● ●●●●●● ●●●●●●●●● ●●●●●●●● ● ● ●●●●●● ●●●●●● ●●● ●● ●● ● ●● ● ●● ●●● ● ●● ●● ●● ●●●●● ● ●● ●●●●● ●●●●●●● ●●●●●●●● ●●●●● ●●●● ● ●●● ●● ● ●● ●●●●●●● ● ●● ●●● ●●●●●●● ● ● ● ●●●●● ●● ●●●●●●●●●●●●●●●● ●●●● ●● ● ●●● ●● ●●●●●●● ● ●●●●● ●●●●●●●●●●● ● ●● ●●●●● ●●● ●● ●●●●●●●● ●●● ●●●● ● ● ● ●● ●●●●● ●●●●● ●●●●●●●●● ●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●● ● ● ●●●●● ●●● ●●● ●● ●● ●●●●●● ●● ● ● ●●●● ●●●● ●● ● ●●● ● ●● ●●● ●●● ●●● ●●● ●● ●●●● ● ●●● ● ●●●●● ●● ●● ●●●●●● ●●●●●●●●●● ●●●●● ●● ●● ●● ● ●●● ●●●●●● ●● ●●● ●●● ●●●●●● ●● ● ●●●●●●●●●●●●● ● ●●●● ●● ● ● ●● ●● ●●● ●●●● ●● ●●●● ●●● ●●● ●●●●●● ●● ●● ●●●●●●●● ●● ●● ●●●●● ●● ●● ●● ●●● ●●●● ●●● ●●● ●●● ● ●● ● ●● ●● ●●● ●●●● ●●●●●●●●● ●●● ● ● ● ●● ●●●● ●●●● ●●●●● ●●●● ●●●●●●● ●●●●●●● ● ●● ●●● ●● ●● ●●● ●●●● ●● ●●●● ●● ●●●● ●●●●●●●●●●● ●● ●●●●●●●●●● ● ●● ● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ● ●● ●● ●● ● ●● ●●● ●●● ● ●● ● ●●● ●● ● ● ●●●● ●● ●●● ●● ● ●●● ●●●●● ●●●●●●●●●●●●●●●●● ●● ●● ●●●●●● ●●● ●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●● ●●●● ●●● ●●●●● ●●● ● ●● ●●●●●● ●●● ●●●●●●● ●● ●●●●●●●●●●● ● ● ●● ●●●●● ●●● ● ●● ● ●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●● ●●● ● ● ●●● ●●●●● ●●●● ● ●● ●● ●●●● ● ●● ●● ●● ●●●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●● ● ●●●●●●●● ● ●● ●● ●●●●● ●● ●● ●●● ●● ●● ●●●● ●● ●●●●●● ●●●● ●●●●●● ● ●●● ●●●● ●● ●●● ●●● ●●●● ●● ●●● ● ●●● ●●●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●● ● ●●●●●●● ●●●●●●● ●● ● ●●● ● ●● ●● ●● ●● ●●●● ●● ●● ● ●●●● ●● ●●● ●● ●●●●● ●●●●● ●●●●●●● ●●●●●●●●● ● ●● ● ●●●● ●●● ●●● ●●● ●● ●● ●● ●●●●●●●●● ●● ●●●●●●●●● ●●●● ●●●● ●●● ●● ●● ● ● ●●●●●●●●● ●● ●●●● ●●● ●●●● ● ●● ●●●● ● ●●● ●●● ●●●●●●● ●●●●●●●●●●●● ● ●●●● ●● ●●●●●●●● ● ●●●● ●●●●●●●●● ● ●●● ●●● ●●● ● 0.5 1.0 1.5 2.0 2.5 Wednesday, 7 October 2009
  • 26. theme_set(theme_grey(10)) qplot(year, lratio, data = bysex, group = name, geom = "line") qplot(lratio, reorder(name, lratio, na.rm = T), data = bysex) qplot(abs(lratio), reorder(name, lratio, na.rm = T), data = bysex) qplot(abs(lratio), reorder(name, lratio, na.rm = T), data = bysex) + geom_point(data = both_sexes, colour = "red") Wednesday, 7 October 2009
  • 27. year lratio −2 −1 0 1 2 1880 1900 1920 1940 1960 1980 2000 What characteristics of each name might we want to use to classify them into dual-sex with sex-errors? Wednesday, 7 October 2009
  • 28. Your turn Compute the mean and range of lratio for each name. Plot and come up with cutoffs that you think separate the two groups. Wednesday, 7 October 2009
  • 29. rng <- ddply(bysex, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T) ) qplot(diff, abs(mean), data = rng) qplot(diff, abs(mean), data = rng, colour = abs(mean) < 1.75 | diff > 0.9) shared_names <- subset(rng, abs(mean) < 1.75 | diff > 0.9)$name qplot(abs(lratio), reorder(name, lratio, na.rm=T), data = subset(bysex, name %in% shared_names)) qplot(year, lratio, geom = "line", group = name, data = subset(bysex, name %in% shared_names)) Wednesday, 7 October 2009
  • 30. Now that we’ve separated the two groups, we’ll explore each in more detail. Next time Wednesday, 7 October 2009