SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Stat405                 Data


                            Hadley Wickham
Monday, 14 September 2009
1. Group work
               2. Motivating problem
               3. Loading & saving data
               4. Factors & characters




Monday, 14 September 2009
Group project
                   Want to help your groups become
                   effective teams.
                   We’ll spend 15 minutes getting you into
                   teams, and establishing expectations.
                   See handouts.
                   Final project weighting for team
                   citizenship.


Monday, 14 September 2009
Firing & Quitting
                   You may fire a non-participating team
                   member, but you need to meet with me
                   and issue a written warning.
                   If you feel that you are doing all the work
                   in your team, you may quit. You’ll also
                   need to meet with me and give a written
                   warning to the rest of your team.


Monday, 14 September 2009
State regulated payoffs: how can be
sure they’re honest?             CC by-nc-nd: http://www.flickr.com/photos/amoleji/2979221622/

Monday, 14 September 2009
Where are we going?
                   In the next few weeks we will be
                   focussing our attention on some slot
                   machine data. We want to figure out if
                   the slot machine is paying out at the rate
                   the manufacturer claims.
                   To do this, we’ll need to learn more about
                   data formats and how to write functions.


Monday, 14 September 2009
Loading data
                   read.table(): white space separated
                   read.table(sep="t"): tab separated
                   read.csv(): comma separated
                   read.fwf(): fixed width
                   load(): R binary format
                   All take file argument


Monday, 14 September 2009
Why csv?

                   Simple.
                   Compatible with all statistics software.
                   Human readable (in 20 years time you will
                   still be able to extract data from it).




Monday, 14 September 2009
Your turn
                   Download baseball and slots csv files from
                   website. Practice using read.csv() to
                   load into R.
                   Guess the name of the function you might
                   use to write the R object back to a csv file
                   on disk. Practice using it.
                   What happens if you read in a file you
                   wrote with this method?


Monday, 14 September 2009
batting <- read.csv("batting.csv")
     players <- read.csv("players.csv")
     slots <- read.csv("slots.csv")

     write.csv(slots, "slots-2.csv")
     slots2 <- read.csv("slots-2.csv")
     str(slots)
     str(slots2)

     # Better
     write.table(slots, file = "slots-3.csv",
       sep=",", row = F)
     slots3 <- read.csv("slots-3.csv")


Monday, 14 September 2009
Working directory
                   Remember to set your working directory.
                   From the terminal (linux or mac): the
                   working directory is the directory you’re in
                   when you start R
                   On windows: setwd(choose.dir())
                   On the mac: ⌘-D


Monday, 14 September 2009
Saving data

               # For long-term
               write.table(slots, file = "slots-3.csv",
                 sep=",", row = F)

               # For short-term caching
               save(slots, file = "slots.rdata")




Monday, 14 September 2009
.csv             .rdata

                            read.csv()          load()
                write.table(sep = ",",
                       row = F)                 save()

                  Only data frames          Any R object
                   Can be read by any
                        program
                                              Only by R
                                          Short term caching of
                            Long term    expensive computations

Monday, 14 September 2009
Cleaning
                   I cleaned up slots.csv for you to practice
                   with. The original data was slots.txt.
                   Your next task is to performing the
                   cleaning yourself.
                   This should always be the first step in an
                   analysis: ensure that your data is available
                   as a clean csv file. Do this in once in a
                   file called clean.r.


Monday, 14 September 2009
Your turn

                   Take two minutes to find as many
                   differences as possible between
                   slots.txt and slots.csv.
                   What did I do to clean up the file?




Monday, 14 September 2009
Cleaning

                   • Convert from space delimited to csv
                   • Add variable names
                   • Convert uninformative numbers to
                     informative labels




Monday, 14 September 2009
Variable names
                   names(slots)
                   names(slots) <- c("w1", "w2", "w3",
                   "prize", "night")
                   dput(names(slots))


                   This is a general pattern we’ll see a lot of


Monday, 14 September 2009
Factors
                   • R’s way of storing categorical data
                   • Have ordered levels() which:
                        • Control order on plots and in table()
                        • Are preserved across subsets
                        • Affect contrasts in linear models



Monday, 14 September 2009
#     Creating a factor
         x     <- sample(5, 20, rep = T)
         a     <- factor(x)
         b     <- factor(x, levels = 1:10)
         c     <- factor(x, labels = letters[1:5])

         levels(a); levels(b); levels(c)
         table(a); table(b); table(c)




Monday, 14 September 2009
# Subsets
         b2 <- b[1:5]
         levels(b2)
         table(b2)

         # Remove extra levels
         b2[, drop=T]
         factor(b2)

         # Convert to character
         b3 <- as.character(b)
         table(b3)
         table(b3[1:5])

Monday, 14 September 2009
as.numeric(a)
         as.numeric(b)
         as.numeric(c)

         d <- factor(x, labels = 2^(1:5))
         as.numeric(d)
         as.character(d)
         as.numeric(as.character(d))




Monday, 14 September 2009
Character vs. factor
                   Characters don’t remember all levels.
                   Tables of characters always ordered
                   alphabetically
                   By default, strings converted to factors
                   when loading data frames.
                   Use stringsAsFactors = F to turn off for
                   one data frame, or
                   options(stringsAsFactors = F)


Monday, 14 September 2009
Character vs. factor

                   Use a factor when there is a well-defined
                   set of all possible values.
                   Use a character vector when there are
                   potentially infinite possibilities.




Monday, 14 September 2009
Quiz
                   Take one minute to decide which data
                   type is most appropriate for each of the
                   following variables collected in a medical
                   experiment:
                   Subject id, name, treatment, sex,
                   address, race, eye colour, birth city, birth
                   state.


Monday, 14 September 2009
Your turn
                   Convert w1, w2 and w3 to      0 Blank (0)
                   factors with labels from      1 Single Bar (B)
                   adjacent table                2 Double Bar (BB)
                   Rearrange levels in terms     3 Triple Bar (BBB)
                   of value: DD, 7, BBB, BB,     5 Double Diamond (DD)
                   B, C, 0
                                                 6 Cherries (C)
                   Save as a csv file
                                                 7 Seven (7)
                   Read in and look at levels.
                   Compare to input with
                   stringsAsFactors = F

Monday, 14 September 2009
slots <- read.table("slots.txt")
     names(slots) <- c("w1", "w2", "w3", "prize", "night")

     levels <- c(0, 1, 2, 3, 5, 6, 7)
     labels <- c("0", "B", "BB", "BBB", "DD", "C", "7")

     slots$w1 <- factor(slots$w1, levels = levels, labels = labels)
     slots$w2 <- factor(slots$w2, levels = levels, labels = labels)
     slots$w3 <- factor(slots$w3, levels = levels, labels = labels)

     write.table(slots, "slots.csv", sep=",", row=F)




Monday, 14 September 2009

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (7)

Yet another object system for R
Yet another object system for RYet another object system for R
Yet another object system for R
 
16 Git
16 Git16 Git
16 Git
 
03 extensions
03 extensions03 extensions
03 extensions
 
07 Problem Solving
07 Problem Solving07 Problem Solving
07 Problem Solving
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
13 case-study
13 case-study13 case-study
13 case-study
 
27 development
27 development27 development
27 development
 

Ähnlich wie 06 Data

Building A Framework On Rack
Building A Framework On RackBuilding A Framework On Rack
Building A Framework On RackMatt Todd
 
Microservices and functional programming
Microservices and functional programmingMicroservices and functional programming
Microservices and functional programmingMichael Neale
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
Inline assembly language programs in c
Inline assembly language programs in cInline assembly language programs in c
Inline assembly language programs in cTech_MX
 
MacRuby - When objective-c and Ruby meet
MacRuby - When objective-c and Ruby meetMacRuby - When objective-c and Ruby meet
MacRuby - When objective-c and Ruby meetMatt Aimonetti
 
Introduction to Scala for Java Developers
Introduction to Scala for Java DevelopersIntroduction to Scala for Java Developers
Introduction to Scala for Java DevelopersMichael Galpin
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Framework Design Guidelines
Framework Design GuidelinesFramework Design Guidelines
Framework Design Guidelinesbrada
 
Ruby meetup 7_years_in_testing
Ruby meetup 7_years_in_testingRuby meetup 7_years_in_testing
Ruby meetup 7_years_in_testingDigital Natives
 

Ähnlich wie 06 Data (20)

06 data
06 data06 data
06 data
 
08 Functions
08 Functions08 Functions
08 Functions
 
StORM preview
StORM previewStORM preview
StORM preview
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
04 reports
04 reports04 reports
04 reports
 
04 Reports
04 Reports04 Reports
04 Reports
 
Building A Framework On Rack
Building A Framework On RackBuilding A Framework On Rack
Building A Framework On Rack
 
Ruby Scripting
Ruby ScriptingRuby Scripting
Ruby Scripting
 
Microservices and functional programming
Microservices and functional programmingMicroservices and functional programming
Microservices and functional programming
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Inline assembly language programs in c
Inline assembly language programs in cInline assembly language programs in c
Inline assembly language programs in c
 
14 Ddply
14 Ddply14 Ddply
14 Ddply
 
MacRuby - When objective-c and Ruby meet
MacRuby - When objective-c and Ruby meetMacRuby - When objective-c and Ruby meet
MacRuby - When objective-c and Ruby meet
 
Vim Vi Improved
Vim Vi ImprovedVim Vi Improved
Vim Vi Improved
 
Introduction to Scala for Java Developers
Introduction to Scala for Java DevelopersIntroduction to Scala for Java Developers
Introduction to Scala for Java Developers
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Framework Design Guidelines
Framework Design GuidelinesFramework Design Guidelines
Framework Design Guidelines
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
Ruby meetup 7_years_in_testing
Ruby meetup 7_years_in_testingRuby meetup 7_years_in_testing
Ruby meetup 7_years_in_testing
 

Mehr von Hadley Wickham (20)

27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 

Kürzlich hochgeladen

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Kürzlich hochgeladen (20)

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

06 Data

  • 1. Stat405 Data Hadley Wickham Monday, 14 September 2009
  • 2. 1. Group work 2. Motivating problem 3. Loading & saving data 4. Factors & characters Monday, 14 September 2009
  • 3. Group project Want to help your groups become effective teams. We’ll spend 15 minutes getting you into teams, and establishing expectations. See handouts. Final project weighting for team citizenship. Monday, 14 September 2009
  • 4. Firing & Quitting You may fire a non-participating team member, but you need to meet with me and issue a written warning. If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team. Monday, 14 September 2009
  • 5. State regulated payoffs: how can be sure they’re honest? CC by-nc-nd: http://www.flickr.com/photos/amoleji/2979221622/ Monday, 14 September 2009
  • 6. Where are we going? In the next few weeks we will be focussing our attention on some slot machine data. We want to figure out if the slot machine is paying out at the rate the manufacturer claims. To do this, we’ll need to learn more about data formats and how to write functions. Monday, 14 September 2009
  • 7. Loading data read.table(): white space separated read.table(sep="t"): tab separated read.csv(): comma separated read.fwf(): fixed width load(): R binary format All take file argument Monday, 14 September 2009
  • 8. Why csv? Simple. Compatible with all statistics software. Human readable (in 20 years time you will still be able to extract data from it). Monday, 14 September 2009
  • 9. Your turn Download baseball and slots csv files from website. Practice using read.csv() to load into R. Guess the name of the function you might use to write the R object back to a csv file on disk. Practice using it. What happens if you read in a file you wrote with this method? Monday, 14 September 2009
  • 10. batting <- read.csv("batting.csv") players <- read.csv("players.csv") slots <- read.csv("slots.csv") write.csv(slots, "slots-2.csv") slots2 <- read.csv("slots-2.csv") str(slots) str(slots2) # Better write.table(slots, file = "slots-3.csv", sep=",", row = F) slots3 <- read.csv("slots-3.csv") Monday, 14 September 2009
  • 11. Working directory Remember to set your working directory. From the terminal (linux or mac): the working directory is the directory you’re in when you start R On windows: setwd(choose.dir()) On the mac: ⌘-D Monday, 14 September 2009
  • 12. Saving data # For long-term write.table(slots, file = "slots-3.csv", sep=",", row = F) # For short-term caching save(slots, file = "slots.rdata") Monday, 14 September 2009
  • 13. .csv .rdata read.csv() load() write.table(sep = ",", row = F) save() Only data frames Any R object Can be read by any program Only by R Short term caching of Long term expensive computations Monday, 14 September 2009
  • 14. Cleaning I cleaned up slots.csv for you to practice with. The original data was slots.txt. Your next task is to performing the cleaning yourself. This should always be the first step in an analysis: ensure that your data is available as a clean csv file. Do this in once in a file called clean.r. Monday, 14 September 2009
  • 15. Your turn Take two minutes to find as many differences as possible between slots.txt and slots.csv. What did I do to clean up the file? Monday, 14 September 2009
  • 16. Cleaning • Convert from space delimited to csv • Add variable names • Convert uninformative numbers to informative labels Monday, 14 September 2009
  • 17. Variable names names(slots) names(slots) <- c("w1", "w2", "w3", "prize", "night") dput(names(slots)) This is a general pattern we’ll see a lot of Monday, 14 September 2009
  • 18. Factors • R’s way of storing categorical data • Have ordered levels() which: • Control order on plots and in table() • Are preserved across subsets • Affect contrasts in linear models Monday, 14 September 2009
  • 19. # Creating a factor x <- sample(5, 20, rep = T) a <- factor(x) b <- factor(x, levels = 1:10) c <- factor(x, labels = letters[1:5]) levels(a); levels(b); levels(c) table(a); table(b); table(c) Monday, 14 September 2009
  • 20. # Subsets b2 <- b[1:5] levels(b2) table(b2) # Remove extra levels b2[, drop=T] factor(b2) # Convert to character b3 <- as.character(b) table(b3) table(b3[1:5]) Monday, 14 September 2009
  • 21. as.numeric(a) as.numeric(b) as.numeric(c) d <- factor(x, labels = 2^(1:5)) as.numeric(d) as.character(d) as.numeric(as.character(d)) Monday, 14 September 2009
  • 22. Character vs. factor Characters don’t remember all levels. Tables of characters always ordered alphabetically By default, strings converted to factors when loading data frames. Use stringsAsFactors = F to turn off for one data frame, or options(stringsAsFactors = F) Monday, 14 September 2009
  • 23. Character vs. factor Use a factor when there is a well-defined set of all possible values. Use a character vector when there are potentially infinite possibilities. Monday, 14 September 2009
  • 24. Quiz Take one minute to decide which data type is most appropriate for each of the following variables collected in a medical experiment: Subject id, name, treatment, sex, address, race, eye colour, birth city, birth state. Monday, 14 September 2009
  • 25. Your turn Convert w1, w2 and w3 to 0 Blank (0) factors with labels from 1 Single Bar (B) adjacent table 2 Double Bar (BB) Rearrange levels in terms 3 Triple Bar (BBB) of value: DD, 7, BBB, BB, 5 Double Diamond (DD) B, C, 0 6 Cherries (C) Save as a csv file 7 Seven (7) Read in and look at levels. Compare to input with stringsAsFactors = F Monday, 14 September 2009
  • 26. slots <- read.table("slots.txt") names(slots) <- c("w1", "w2", "w3", "prize", "night") levels <- c(0, 1, 2, 3, 5, 6, 7) labels <- c("0", "B", "BB", "BBB", "DD", "C", "7") slots$w1 <- factor(slots$w1, levels = levels, labels = labels) slots$w2 <- factor(slots$w2, levels = levels, labels = labels) slots$w3 <- factor(slots$w3, levels = levels, labels = labels) write.table(slots, "slots.csv", sep=",", row=F) Monday, 14 September 2009