4. Fair Good Very Good
6000
5000
4000
3000
2000
1000
0
count
Premium Ideal
6000
5000
4000
3000
2000
1000
0
0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000
price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
5. Fair Good Very Good
6000
5000
4000
3000
2000
1000
0
count
Premium Ideal
6000
What makes it
5000
difficult to
compare the
4000
distributions?
3000
2000
Brainstorm for 1
minute.
1000
0
0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000
price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
6. Problems
Each histogram far away from the others,
but we know stacking is hard to read →
use another way of displaying densities
Varying relative abundance makes
comparisons difficult → rescale to ensure
constant area
Tuesday, 31 August 2010
7. # Large distances make comparisons hard
qplot(price, data = diamonds, binwidth = 500) +
facet_wrap(~ cut)
# Stacked heights hard to compare
qplot(price, data = diamonds, binwidth = 500, fill = cut)
# Much better - but still have differing relative abundance
qplot(price, data = diamonds, binwidth = 500,
geom = "freqpoly", colour = cut)
# Instead of displaying count on y-axis, display density
# .. indicates that variable isn't in original data
qplot(price, ..density.., data = diamonds, binwidth = 500,
geom = "freqpoly", colour = cut)
# To use with histogram, you need to be explicit
qplot(price, ..density.., data = diamonds, binwidth = 500,
geom = "histogram") + facet_wrap(~ cut)
Tuesday, 31 August 2010
8. Your turn
Use this technique to explore the
relationship between price and clarity,
and carat and clarity.
Tuesday, 31 August 2010
10. Idea ggplot
Small points shape = I(".")
Transparency alpha = I(1/50)
Jittering geom = "jitter"
Smooth curve geom = "smooth"
geom = "bin2d" or
2d bins geom = "hex"
Density contours geom = "density2d"
Tuesday, 31 August 2010
11. # There are two ways to add additional geoms
# 1) A vector of geom names:
qplot(price, carat, data = diamonds,
geom = c("point", "smooth"))
# 2) Add on extra geoms
qplot(price, carat, data = diamonds) + geom_smooth()
# This how you get help about a specific geom:
?geom_smooth
# or go to http://had.co.nz/ggplot2/geom_smooth.html
Tuesday, 31 August 2010
12. # To set aesthetics to a particular value, you need
# to wrap that value in I()
qplot(price, carat, data = diamonds, colour = "blue")
qplot(price, carat, data = diamonds, colour = I("blue"))
# Practical application: varying alpha
qplot(price, carat, data = diamonds, alpha = I(1/10))
qplot(price, carat, data = diamonds, alpha = I(1/50))
qplot(price, carat, data = diamonds, alpha = I(1/100))
qplot(price, carat, data = diamonds, alpha = I(1/250))
Tuesday, 31 August 2010
13. Your turn
Explore the relationship between carat,
price and clarity, using these techniques.
Which did you find most useful?
Tuesday, 31 August 2010
15. Motivation
Look at histograms and scatterplots of x,
y, z from the diamonds dataset
Which values are clearly incorrect? Which
values might we be able to correct?
(Remember measurements are in millimetres,
1 inch = 25 mm)
Tuesday, 31 August 2010
16. Plots
qplot(x, data = diamonds, binwidth = 0.1)
qplot(y, data = diamonds, binwidth = 0.1)
qplot(z, data = diamonds, binwidth = 0.1)
qplot(x, y, data = diamonds)
qplot(x, z, data = diamonds)
qplot(y, z, data = diamonds)
Tuesday, 31 August 2010
17. Modifying data
To modify, must first know how to extract,
or subset. Many different methods
available in R. We’ll start with most
explicit then learn some shortcuts next
time.
Basic structure:
df$varname
df[row index, column index]
Tuesday, 31 August 2010
18. $
Remember str(diamonds) ?
That hints at how to extract individual
variables:
diamonds$carat
diamonds$price
Tuesday, 31 August 2010
19. blank include all
integer +ve: include
-ve: exclude
logical include TRUEs
character lookup by name
Tuesday, 31 August 2010
24. # The most complicated to understand, but
# the most powerful. Lets you extract a
# subset defined by some characteristic of
# the data
x_big <- diamonds$x > 10
head(x_big)
sum(x_big)
mean(x_big)
table(x_big)
diamonds$x[x_big]
diamonds[x_big, ]
Tuesday, 31 August 2010
25. small <- diamonds[diamonds$carat < 1, ]
lowqual <- diamonds[diamonds$clarity
%in% c("I1", "SI2", "SI1"), ]
# Comparison functions:
# < > <= >= != == %in%
a
# Boolean operators: & | ! b
small <- diamonds$carat < 1 & a | b
diamonds$price > 500 a & b
lowqual <- diamonds$colour == "D" | a & !b
diamonds$cut == "Fair" xor(a, b)
Tuesday, 31 August 2010
26. Useful table(zeros)
functions for sum(zeros)
logical vectors mean(zeros)
TRUE = 1; FALSE = 0
Tuesday, 31 August 2010
27. Your turn
Select the diamonds that have:
Equal x and y dimensions.
Depth between 55 and 70.
Carat smaller than the mean.
Cost more than $10,000 per carat.
Are of good quality or better.
Tuesday, 31 August 2010
28. Saving results
# Prints to screen
diamonds[diamonds$x > 10, ]
# Saves to new data frame
big <- diamonds[diamonds$x > 10, ]
# Overwrites existing data frame. Dangerous!
diamonds <- diamonds[diamonds$x < 10,]
Tuesday, 31 August 2010
30. Your turn
Create a logical vector that selects
diamonds with equal x & y. Create a new
dataset that only contains these values.
Create a logical vector that selects
diamonds with incorrect/unusual x, y, or z
values. Create a new dataset that omits
these values. (Hint: do this one variable
at a time)
Tuesday, 31 August 2010