SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Stat405
     Graphical extensions & missing values



                          Hadley Wickham
Tuesday, 31 August 2010
1. Graphical extensions (1d & 2d)
                2. Subsetting




Tuesday, 31 August 2010
1d
                 extensions
Tuesday, 31 August 2010
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

         2000

         1000

            0
 count




                                 Premium                   Ideal

         6000

         5000

         4000

         3000

         2000

         1000

            0
                0         5000    10000 15000   0   5000   10000 15000   0   5000   10000 15000
                                price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

         2000

         1000

            0
 count




                                 Premium                   Ideal

         6000
                                                                         What makes it
         5000
                                                                         difficult to
                                                                         compare the
         4000
                                                                         distributions?
         3000

         2000
                                                                         Brainstorm for 1
                                                                         minute.
         1000

            0
                0         5000    10000 15000   0   5000   10000 15000   0   5000   10000 15000
                                price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
Problems

                    Each histogram far away from the others,
                    but we know stacking is hard to read →
                    use another way of displaying densities
                    Varying relative abundance makes
                    comparisons difficult → rescale to ensure
                    constant area



Tuesday, 31 August 2010
# Large distances make comparisons hard
     qplot(price, data = diamonds, binwidth = 500) +
       facet_wrap(~ cut)

     # Stacked heights hard to compare
     qplot(price, data = diamonds, binwidth = 500, fill = cut)

     # Much better - but still have differing relative abundance
     qplot(price, data = diamonds, binwidth = 500,
       geom = "freqpoly", colour = cut)

     # Instead of displaying count on y-axis, display density
     # .. indicates that variable isn't in original data
     qplot(price, ..density.., data = diamonds, binwidth = 500,
       geom = "freqpoly", colour = cut)

     # To use with histogram, you need to be explicit
     qplot(price, ..density.., data = diamonds, binwidth = 500,
       geom = "histogram") + facet_wrap(~ cut)

Tuesday, 31 August 2010
Your turn


                    Use this technique to explore the
                    relationship between price and clarity,
                    and carat and clarity.




Tuesday, 31 August 2010
2d
                 extensions
Tuesday, 31 August 2010
Idea             ggplot
                     Small points       shape = I(".")

                   Transparency        alpha = I(1/50)

                          Jittering    geom = "jitter"

                  Smooth curve         geom = "smooth"
                                       geom = "bin2d" or
                          2d bins         geom = "hex"

             Density contours         geom = "density2d"
Tuesday, 31 August 2010
# There are two ways to add additional geoms
     # 1) A vector of geom names:
     qplot(price, carat, data = diamonds,
       geom = c("point", "smooth"))

     # 2) Add on extra geoms
     qplot(price, carat, data = diamonds) + geom_smooth()

     # This how you get help about a specific geom:
     ?geom_smooth
     # or go to http://had.co.nz/ggplot2/geom_smooth.html



Tuesday, 31 August 2010
# To set aesthetics to a particular value, you need
     # to wrap that value in I()

     qplot(price, carat, data = diamonds, colour = "blue")
     qplot(price, carat, data = diamonds, colour = I("blue"))

     # Practical application:   varying alpha
     qplot(price, carat, data   = diamonds, alpha   =   I(1/10))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/50))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/100))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/250))




Tuesday, 31 August 2010
Your turn

                    Explore the relationship between carat,
                    price and clarity, using these techniques.
                    Which did you find most useful?




Tuesday, 31 August 2010
Subsetting

Tuesday, 31 August 2010
Motivation

                    Look at histograms and scatterplots of x,
                    y, z from the diamonds dataset
                    Which values are clearly incorrect? Which
                    values might we be able to correct?
                    (Remember measurements are in millimetres,
                    1 inch = 25 mm)




Tuesday, 31 August 2010
Plots

                qplot(x,   data = diamonds, binwidth = 0.1)
                qplot(y,   data = diamonds, binwidth = 0.1)
                qplot(z,   data = diamonds, binwidth = 0.1)
                qplot(x,   y, data = diamonds)
                qplot(x,   z, data = diamonds)
                qplot(y,   z, data = diamonds)




Tuesday, 31 August 2010
Modifying data
                    To modify, must first know how to extract,
                    or subset. Many different methods
                    available in R. We’ll start with most
                    explicit then learn some shortcuts next
                    time.
                    Basic structure:
                    df$varname
                    df[row index, column index]


Tuesday, 31 August 2010
$

                    Remember str(diamonds) ?
                    That hints at how to extract individual
                    variables:
                    diamonds$carat
                    diamonds$price



Tuesday, 31 August 2010
blank     include all


                          integer   +ve: include
                                    -ve: exclude

                          logical   include TRUEs


                          character lookup by name


Tuesday, 31 August 2010
Integer subsetting



Tuesday, 31 August 2010
# Nothing
     str(diamonds[, ])

     # Positive integers & nothing
     diamonds[1:6, ] # same as head(diamonds)
     diamonds[, 1:4] # watch out!

     # Two positive integers in rows & columns
     diamonds[1:10, 1:4]

     # Repeating input repeats output
     diamonds[c(1,1,1,2,2), 1:4]

     # Negative integers drop values
     diamonds[-(1:53900), -1]


Tuesday, 31 August 2010
# Useful technique: Order by one or more columns
     diamonds <- diamonds[order(diamonds$price), ]

     # Useful technique: Combine two tables
     carats <- data.frame(table(carat = diamonds$carat))
     mtch <- match(diamonds$carat, carats$carat)
     diamonds$carat_count <- carats$Freq[mtch]




Tuesday, 31 August 2010
Logical subsetting



Tuesday, 31 August 2010
# The most complicated to understand, but
     # the most powerful. Lets you extract a
     # subset defined by some characteristic of
     # the data
     x_big <- diamonds$x > 10

     head(x_big)
     sum(x_big)
     mean(x_big)
     table(x_big)

     diamonds$x[x_big]
     diamonds[x_big, ]


Tuesday, 31 August 2010
small <- diamonds[diamonds$carat < 1, ]
     lowqual <- diamonds[diamonds$clarity
       %in% c("I1", "SI2", "SI1"), ]

     # Comparison functions:
     # < > <= >= != == %in%
                                                  a
     # Boolean operators: & | !                   b
     small <- diamonds$carat < 1 &              a | b
       diamonds$price > 500                     a & b
     lowqual <- diamonds$colour == "D" |        a & !b
       diamonds$cut == "Fair"                  xor(a, b)




Tuesday, 31 August 2010
Useful                table(zeros)
         functions for                sum(zeros)
       logical vectors                mean(zeros)
                TRUE = 1; FALSE = 0



Tuesday, 31 August 2010
Your turn
                    Select the diamonds that have:
                    Equal x and y dimensions.
                    Depth between 55 and 70.
                    Carat smaller than the mean.
                    Cost more than $10,000 per carat.
                    Are of good quality or better.


Tuesday, 31 August 2010
Saving results
                # Prints to screen
                diamonds[diamonds$x > 10, ]

                # Saves to new data frame
                big <- diamonds[diamonds$x > 10, ]

                # Overwrites existing data frame. Dangerous!
                diamonds <- diamonds[diamonds$x < 10,]



Tuesday, 31 August 2010
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Tuesday, 31 August 2010
Your turn
                    Create a logical vector that selects
                    diamonds with equal x & y. Create a new
                    dataset that only contains these values.
                    Create a logical vector that selects
                    diamonds with incorrect/unusual x, y, or z
                    values. Create a new dataset that omits
                    these values. (Hint: do this one variable
                    at a time)


Tuesday, 31 August 2010
equal_dim <- diamonds$x == diamonds$y
     equal <- diamonds[equal_dim, ]

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0
     zeros <- x_zero | y_zero | z_zero

     bad <- y_big | z_big | zeros
     good <- diamonds[!bad, ]


Tuesday, 31 August 2010

Weitere ähnliche Inhalte

Was ist angesagt?

Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svmsesejun
 
Vibrational Rotational Spectrum of HCl and DCl
Vibrational Rotational Spectrum of HCl and DClVibrational Rotational Spectrum of HCl and DCl
Vibrational Rotational Spectrum of HCl and DClTianna Drew
 
Finite Element Analysis Made Easy Lr
Finite Element Analysis Made Easy LrFinite Element Analysis Made Easy Lr
Finite Element Analysis Made Easy Lrguesta32562
 
Admissions in india 2015
Admissions in india 2015Admissions in india 2015
Admissions in india 2015Edhole.com
 
Huang_presentation.pdf
Huang_presentation.pdfHuang_presentation.pdf
Huang_presentation.pdfgrssieee
 
4831603 physics-formula-list-form-4
4831603 physics-formula-list-form-44831603 physics-formula-list-form-4
4831603 physics-formula-list-form-4hongtee82
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svmsesejun
 

Was ist angesagt? (12)

Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
Vibrational Rotational Spectrum of HCl and DCl
Vibrational Rotational Spectrum of HCl and DClVibrational Rotational Spectrum of HCl and DCl
Vibrational Rotational Spectrum of HCl and DCl
 
SSA slides
SSA slidesSSA slides
SSA slides
 
Finite Element Analysis Made Easy Lr
Finite Element Analysis Made Easy LrFinite Element Analysis Made Easy Lr
Finite Element Analysis Made Easy Lr
 
Table 7a,7b kft 131
Table 7a,7b  kft 131Table 7a,7b  kft 131
Table 7a,7b kft 131
 
Cost
CostCost
Cost
 
Image denoising
Image denoisingImage denoising
Image denoising
 
Admissions in india 2015
Admissions in india 2015Admissions in india 2015
Admissions in india 2015
 
Huang_presentation.pdf
Huang_presentation.pdfHuang_presentation.pdf
Huang_presentation.pdf
 
4831603 physics-formula-list-form-4
4831603 physics-formula-list-form-44831603 physics-formula-list-form-4
4831603 physics-formula-list-form-4
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
 
Math IA
Math IAMath IA
Math IA
 

Andere mochten auch

Andere mochten auch (6)

16 Git
16 Git16 Git
16 Git
 
Yet another object system for R
Yet another object system for RYet another object system for R
Yet another object system for R
 
13 case-study
13 case-study13 case-study
13 case-study
 
07 Problem Solving
07 Problem Solving07 Problem Solving
07 Problem Solving
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
06 Data
06 Data06 Data
06 Data
 

Ähnlich wie 03 extensions

Startup Cofounder Wanted (And Awesome Geeky Puzzles)
Startup Cofounder Wanted (And Awesome Geeky Puzzles)Startup Cofounder Wanted (And Awesome Geeky Puzzles)
Startup Cofounder Wanted (And Awesome Geeky Puzzles)YourFutureCofounder
 
Sequential Selection of Correlated Ads by POMDPs
Sequential Selection of Correlated Ads by POMDPsSequential Selection of Correlated Ads by POMDPs
Sequential Selection of Correlated Ads by POMDPsShuai Yuan
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Introduction to Raphaël
Introduction to RaphaëlIntroduction to Raphaël
Introduction to RaphaëlSencha
 

Ähnlich wie 03 extensions (10)

24 modelling
24 modelling24 modelling
24 modelling
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
Startup Cofounder Wanted (And Awesome Geeky Puzzles)
Startup Cofounder Wanted (And Awesome Geeky Puzzles)Startup Cofounder Wanted (And Awesome Geeky Puzzles)
Startup Cofounder Wanted (And Awesome Geeky Puzzles)
 
Sequential Selection of Correlated Ads by POMDPs
Sequential Selection of Correlated Ads by POMDPsSequential Selection of Correlated Ads by POMDPs
Sequential Selection of Correlated Ads by POMDPs
 
11 Data Structures
11 Data Structures11 Data Structures
11 Data Structures
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
RBootcamp Day 4
RBootcamp Day 4RBootcamp Day 4
RBootcamp Day 4
 
Introduction to Raphaël
Introduction to RaphaëlIntroduction to Raphaël
Introduction to Raphaël
 
Clustering
ClusteringClustering
Clustering
 

Mehr von Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 

03 extensions

  • 1. Stat405 Graphical extensions & missing values Hadley Wickham Tuesday, 31 August 2010
  • 2. 1. Graphical extensions (1d & 2d) 2. Subsetting Tuesday, 31 August 2010
  • 3. 1d extensions Tuesday, 31 August 2010
  • 4. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 5. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 What makes it 5000 difficult to compare the 4000 distributions? 3000 2000 Brainstorm for 1 minute. 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 6. Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area Tuesday, 31 August 2010
  • 7. # Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 8. Your turn Use this technique to explore the relationship between price and clarity, and carat and clarity. Tuesday, 31 August 2010
  • 9. 2d extensions Tuesday, 31 August 2010
  • 10. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Tuesday, 31 August 2010
  • 11. # There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This how you get help about a specific geom: ?geom_smooth # or go to http://had.co.nz/ggplot2/geom_smooth.html Tuesday, 31 August 2010
  • 12. # To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: varying alpha qplot(price, carat, data = diamonds, alpha = I(1/10)) qplot(price, carat, data = diamonds, alpha = I(1/50)) qplot(price, carat, data = diamonds, alpha = I(1/100)) qplot(price, carat, data = diamonds, alpha = I(1/250)) Tuesday, 31 August 2010
  • 13. Your turn Explore the relationship between carat, price and clarity, using these techniques. Which did you find most useful? Tuesday, 31 August 2010
  • 15. Motivation Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Tuesday, 31 August 2010
  • 16. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Tuesday, 31 August 2010
  • 17. Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time. Basic structure: df$varname df[row index, column index] Tuesday, 31 August 2010
  • 18. $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Tuesday, 31 August 2010
  • 19. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 31 August 2010
  • 21. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Two positive integers in rows & columns diamonds[1:10, 1:4] # Repeating input repeats output diamonds[c(1,1,1,2,2), 1:4] # Negative integers drop values diamonds[-(1:53900), -1] Tuesday, 31 August 2010
  • 22. # Useful technique: Order by one or more columns diamonds <- diamonds[order(diamonds$price), ] # Useful technique: Combine two tables carats <- data.frame(table(carat = diamonds$carat)) mtch <- match(diamonds$carat, carats$carat) diamonds$carat_count <- carats$Freq[mtch] Tuesday, 31 August 2010
  • 24. # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) sum(x_big) mean(x_big) table(x_big) diamonds$x[x_big] diamonds[x_big, ] Tuesday, 31 August 2010
  • 25. small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% a # Boolean operators: & | ! b small <- diamonds$carat < 1 & a | b diamonds$price > 500 a & b lowqual <- diamonds$colour == "D" | a & !b diamonds$cut == "Fair" xor(a, b) Tuesday, 31 August 2010
  • 26. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Tuesday, 31 August 2010
  • 27. Your turn Select the diamonds that have: Equal x and y dimensions. Depth between 55 and 70. Carat smaller than the mean. Cost more than $10,000 per carat. Are of good quality or better. Tuesday, 31 August 2010
  • 28. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Tuesday, 31 August 2010
  • 29. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Tuesday, 31 August 2010
  • 30. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Tuesday, 31 August 2010
  • 31. equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Tuesday, 31 August 2010