SlideShare a Scribd company logo
1 of 24
Download to read offline
Data manipulation in R
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 3, 2017
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 1 / 20
The good, the bad, & the ugly
• R isn’t the best programming language out there
• But it happens to be great for data analysis
• The result is a steep learning curve with a high payoff
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 2 / 20
For instance . . .
• You’ll see a mix of camelCase, this.that, and snake case
conventions
• Dots (.) (mostly) don’t mean anything special
• Likewise, $ gets used in funny ways
• R is loosely typed, which can lead to unexpected coercions
and silent fails
• It also tries to be clever about variable scope, which can
backfire if you’re not careful
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 3 / 20
But it will help you . . .
• Do extremely fast exploratory data analysis
• Easily generate high-quality data visualizations
• Fit and evaluate pretty much any statistical model you can
think of
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
But it will help you . . .
• Do extremely fast exploratory data analysis
• Easily generate high-quality data visualizations
• Fit and evaluate pretty much any statistical model you can
think of
This will change the way you do data analysis, because you’ll ask
questions you wouldn’t have bothered to otherwise
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
Basic types
• int, double: for numbers
• character: for strings
• factor: for categorical variables (∼ struct or ENUM)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
Basic types
• int, double: for numbers
• character: for strings
• factor: for categorical variables (∼ struct or ENUM)
Factors are handy, but take some getting used to
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
Containers
• vector: for multiple values of the same type (∼ array)
• list: for multiple values of different types (∼ dictionary)
• data.frame: for tables of rectangular data of mixed types (∼
matrix)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
Containers
• vector: for multiple values of the same type (∼ array)
• list: for multiple values of different types (∼ dictionary)
• data.frame: for tables of rectangular data of mixed types (∼
matrix)
We’ll mostly work with data frames, which themselves are lists of
vectors
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
The tidyverse
The tidyverse is a collection of packages that work together to
make data analysis easier:
• dplyr for split / apply / combine type counting
• ggplot2 for making plots
• tidyr for reshaping and “tidying” data
• readr for reading and writing files
• ...
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 7 / 20
Tidy data
The core philosophy is that your data should be in a “tidy” table
with:
• One observation per row
• One variable per column
• One measured value per cell
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 8 / 20
Tidy data
• Most of the work goes into getting your data into shape
• After which descriptives statistics, modeling, and visualization
are easy
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 9 / 20
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine framework
discussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N )
• arrange: reorder rows by a variable (N → N )
• select: pick out specific columns (K → K )
• mutate: create new or change existing columns (K → K )
• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of the
split and combine phases
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine framework
discussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N )
• arrange: reorder rows by a variable (N → N )
• select: pick out specific columns (K → K )
• mutate: create new or change existing columns (K → K )
• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of the
split and combine phases
The cost is that you have to think “functionally”, in terms of
“vectorized” operations
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
filter
filter(trips, start_station_name == "Broadway & E 14 St")
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 11 / 20
arrange
arrange(trips, starttime)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 12 / 20
select
select(trips, starttime, stoptime,
start_station_name, end_station_name)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 13 / 20
mutate
mutate(trips, time_in_min = tripduration / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 14 / 20
mutate
summarize(trips, mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 15 / 20
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 16 / 20
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 17 / 20
group by + summarize
summarize(trips_by_gender,
count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 18 / 20
%>%: the pipe operator
trips %>%
group_by(gender) %>%
summarize(count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 19 / 20
r4ds.had.co.nz
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 20 / 20

More Related Content

Viewers also liked

Viewers also liked (20)

Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศ
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 

Similar to Modeling Social Data, Lecture 3: Data manipulation in R

Karen Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 PaperKaren Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 Paper
Karen Tran
 
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
collwe
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
GUANGYUAN PIAO
 

Similar to Modeling Social Data, Lecture 3: Data manipulation in R (20)

Algorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdfAlgorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdf
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Token
TokenToken
Token
 
Karen Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 PaperKaren Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 Paper
 
Are You Ready to Write Up Your Quantitative Data?
 Are You Ready to Write Up Your Quantitative Data? Are You Ready to Write Up Your Quantitative Data?
Are You Ready to Write Up Your Quantitative Data?
 
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
 
Assignment#11
Assignment#11Assignment#11
Assignment#11
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
R programming
R programmingR programming
R programming
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist?
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
Data analysis02 twovariables
Data analysis02 twovariablesData analysis02 twovariables
Data analysis02 twovariables
 
CMI2012 CCSS for Mathematics
CMI2012 CCSS for MathematicsCMI2012 CCSS for Mathematics
CMI2012 CCSS for Mathematics
 
Chapter-8 Relational Database Design
Chapter-8 Relational Database DesignChapter-8 Relational Database Design
Chapter-8 Relational Database Design
 

More from jakehofman

NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
jakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
jakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
jakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
jakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
jakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
jakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
jakehofman
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
jakehofman
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
jakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
 

Recently uploaded

Recently uploaded (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Modeling Social Data, Lecture 3: Data manipulation in R

  • 1. Data manipulation in R APAM E4990 Modeling Social Data Jake Hofman Columbia University February 3, 2017 Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 1 / 20
  • 2. The good, the bad, & the ugly • R isn’t the best programming language out there • But it happens to be great for data analysis • The result is a steep learning curve with a high payoff Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 2 / 20
  • 3. For instance . . . • You’ll see a mix of camelCase, this.that, and snake case conventions • Dots (.) (mostly) don’t mean anything special • Likewise, $ gets used in funny ways • R is loosely typed, which can lead to unexpected coercions and silent fails • It also tries to be clever about variable scope, which can backfire if you’re not careful Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 3 / 20
  • 4. But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
  • 5. But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of This will change the way you do data analysis, because you’ll ask questions you wouldn’t have bothered to otherwise Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
  • 6. Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
  • 7. Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Factors are handy, but take some getting used to Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
  • 8. Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of different types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
  • 9. Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of different types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) We’ll mostly work with data frames, which themselves are lists of vectors Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
  • 10. The tidyverse The tidyverse is a collection of packages that work together to make data analysis easier: • dplyr for split / apply / combine type counting • ggplot2 for making plots • tidyr for reshaping and “tidying” data • readr for reading and writing files • ... Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 7 / 20
  • 11. Tidy data The core philosophy is that your data should be in a “tidy” table with: • One observation per row • One variable per column • One measured value per cell Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 8 / 20
  • 12. Tidy data • Most of the work goes into getting your data into shape • After which descriptives statistics, modeling, and visualization are easy Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 9 / 20
  • 13. dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has five main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out specific columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
  • 14. dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has five main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out specific columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases The cost is that you have to think “functionally”, in terms of “vectorized” operations Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
  • 15. filter filter(trips, start_station_name == "Broadway & E 14 St") Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 11 / 20
  • 16. arrange arrange(trips, starttime) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 12 / 20
  • 17. select select(trips, starttime, stoptime, start_station_name, end_station_name) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 13 / 20
  • 18. mutate mutate(trips, time_in_min = tripduration / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 14 / 20
  • 19. mutate summarize(trips, mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 15 / 20
  • 20. group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 16 / 20
  • 21. group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 17 / 20
  • 22. group by + summarize summarize(trips_by_gender, count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 18 / 20
  • 23. %>%: the pipe operator trips %>% group_by(gender) %>% summarize(count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 19 / 20
  • 24. r4ds.had.co.nz Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 20 / 20