SlideShare ist ein Scribd-Unternehmen logo
1 von 23
The Importance of Cleaning,
Organizing and Analyzing Data
in Research
Michael Blecher
Why Data is Important To
Researchers
 What makes research scientific is the fact it relies on the use of statistics
and data.
 Researchers do not speculate. For example, they can’t just claim
something like colder temperatures make people smarter.
 To make hypotheses and theories into working truths, large sets of data
have to be collected (i.e. running the experiment), and analyzed.
 Only by having such empirical evidence, can we establish if our
theories have validity.
 Thus, there may be nothing more important for people interested in
research than having a background in statistics and knowledge on
how to manipulate data (the latter is where R comes in!!!)
The Plan
 Background into the psychology experiment I
programmed
 Amazon Mechanical Turk and what data
often looks like in academia.
 Making sense of the data and cleaning it
 Organizing the data.
 Analyzing the data
 Making the data look pretty!
My Experiment-The Famous DRM
Paradigm
 When one is asked to remember a list of strongly related words, they
have a high rate of falsely misremembering a word that semantically
embodies all these previously presented words (category word).
 Ex: If I asked you to remember the words mad, fear, hate, rage and
temper (related words), most of you would likely misremember seeing
the word anger as well which is considered the category word to this
list (there are 24 lists altogether-each with a different category word
and set of related words)
 Hence, this experiment had four conditions-lists of words that were
presented before the category word (pre-lure) they are associated
with, lists of words that were presented after the category word, and
the type of word it was (related or category)
What Data Often Looks Like
 Experiment was ran through Amazon Mechanical Turk-an internet site
that allows people to participate in psychological experiments.
(among other things)
 And now the fun begins! Looking at the data.
 C:UsersMichaelDocumentsR datascience notes
 C:UsersMichaelDocumentsR datascience notes
 The data in a more organized fashion but this was done manually.
 One goal is going to be using R to make our raw data look like that
excel file.
Where to Start
 Before anything, we have to set the working directory to the folder where our
data is-
setwd("C:/Users/Michael/Documents/R datascience notes")
 Next, let’s just make a list of different variables. They won’t contain anything at
the moment but they have the ability to hold multiple items at once.
 Country<-c()
 Gender<-c()
 Subject<-c()
 Age<-c()
 Handedness<-c()
 Many more variables in the actual R file!
Creating a Loop
 One of the most important things about programming experiments is knowing ahead of time how you
plan to extract the data. For instance, we assured that certain punctuation would appear before the
data of each trial in the experiment was recorded (a colon in this case). A colon also appeared before
the demographic information was recorded.
 Now we can make a loop
for (subnum in 5:55) #50 participants and only data for subjects 5-55 was reocrded
{
test<-scan(file="raw data.txt", what = "character", sep = " ", skip=(0+subnum), nlines=1) #skip and nlines
allows us to select which lines we want read
data<-unlist(strsplit(test,split=":")) #will split the data by the colon-we use unlist to make all the pieces of
the data into a single vector-this will allow us to extract any data we like just through the use of an index
if (length(data)==172)…{ #created to assure that we only extract participants’ data when they
completed the entire experiment (there were 172 trials).
Loop (continued)
 As you will see, our variable data contains many different pieces of
data. 13-172 contains all the info that was recorded for every trial.
 For instance:
 [14] "concert,music,Related,Pre-
Lure,9,1398816317238,1398816319063,old,4,1,1"
 On trial 14, the word presented was concert, the category word
that is associated with this word is music. Thus, it was a Related Pre-
Lure word, etc.
 So now we can split these things up even more with this code
 CurrentTrial<-unlist(strsplit(data[i],split=","))
 Now each piece of each trial has it’s own index.
Putting Data into our variables
 And now we can put each piece of our data into the correct
variables we made earlier.
 Ex:
CategoryWord<-c(CategoryWord,CurrentTrial[2])
RelatedWord<-c(RelatedWord,CurrentTrial[1]) etc.
And we can also do the same with demographic information.
for (i in 12:12){
CurrentTrial2<-unlist(strsplit(data[i],split=","))
Country<-c(Country,CurrentTrial2[2])
Age<-c(Age,CurrentTrial2[4])
Making data.frames
 First before we forget, we should change certain variables so R
reads them as numeric-ex: Memory<-as.numeric(Memory)
 Next, let’s make some neat data.frames
 One is a simple truncated version and the other is a longer version.
 AllData<-data.frame(Subject,Phase,Type,Memory,Confidence)
 AllData2<-
data.frame(Subject,Phase,Type,Memory,Confidence,CategoryWor
d,RelatedWord, OnsetTime, ResponseTime)
Adding columns to our data
 Still, so much of our data is missing. For instance, the times we have
recorded does not provide us yet with a true reaction time. Instead,
we only have the exact time when the stimulus was presented and
when a button was clicked.
 So let’s add to our data.frame by making a new column that
creates reaction time by finding the difference between the time-
related variables we do have.
 Ex: library(dplyr)
 AllData3<-mutate(AllData2,Reaction_Time=(ResponseTime-
OnsetTime))
 Let’s also make it into a separate variable -
Reaction_Time<-AllData3$Reaction_Time
Putting it in excel
 Our data as an excel file-
 write.table(AllData3, file='data.csv', sep=",",col.names=TRUE,row.names=FALSE)
 Doesn’t look half bad but so much more work to go
 For starters, let’s make the data more concise by calculating the means for each
subject by type and phase-
 AllData<-aggregate(Memory~Subject*Phase*Type,AllData,mean)
 library(reshape2)
 Can also use this code from reshape2:
 organizedata<dcast(AllData,Subject*Phase*Type~.,value.var='Memory',row.names=
TRUE)
Splitting Data
 We can even split the data into smaller subsets-we can separate the
data for each subject
 Ex:
Averag4 <- dlply(.data=AllData,
 .variables='Subject')
 The code below will make our last column labeled Memory.
names(organize_data)[4]="Memory“
 Next comes trying to merge the demographic information into our
data. We start by changing our data structure where Type and Phase
take their own columns.
Ex:
clean_data<-dcast(organize_data, Subject ~ Phase +
Type,value.var="Memory")
Subsetting Data
 Now we are going to combine all those demographic variables we made earlier
with the most recent data frame.
 demographic_info<-data.frame(Country,Vision,Handedness,Gender,clean_data)
 Now we can find pretty much anything in our data. Ex: can get values of all
memory scores during the Post-Lure Category Phase when the participant was a
male.
 Ex: demographic_info$Post.Lure_Category[demographic_info$Gender=="Male"]
 If we examine the country participants were from, we will see that many
participants were from the United States but they all didn’t wrote that exact term
in their questionnaire (some wrote U.S., America, etc.)
Recoding Data
 demographic_info$Country[demographic_info$Country=="usa"|
 demographic_info$Country=="USA"|
 demographic_info$Country=="America"
 |demographic_info$Country=="United States of America"
 |demographic_info$Country=="US"|demographic_info$Country=="United states"
 |demographic_info$Country=="us"]<-"United States“
This code above makes the adjustment so when we examine the frequencies, it will not
appear that people who wrote America are from a different country than those wrote the
United States.
Analyzing our data
 Let’s do a 2-way anova to see if there is actually an interaction
effect!
 aov.out<-aov(Memory~Phase*Type,AllData) #We use * to get the
interaction effect as well as the effect of the two independent
variables.
print(summary(aov.out),digits=10)
print(model.tables(aov.out,"means"),digits= 3) #will give us a
summary tables for the analysis we ran.
Can also use this code to get the model.tables
bar=tapply(AllData$Memory,list(AllData$Type,AllData$Phase),mean)
And it turns out our interaction was significant.
Showing the
Interaction Graphically
 And we can make different graphs from this code too!
Ex: library(sciplot)
lineplot.CI(x.factor=AllData$Type,response=AllData$Memory,group=AllData$Phase,
trace.label="Phase",xlab="Type",ylab="Percentage of Wrong Answers",
main="Interaction Effect")
Or if you want a bar graph-ex:
graph2 = barplot(bar, beside=T, ylim=c(0.2,1),xpd=FALSE,
space=c(.1,.8), main="TIP Effect in DRM Paradigm",
xlab="Phase", ylab="Percentage of Wrong Answers", legend =T,
args.legend = list(x="topleft"), col=c("red","blue"))
Applying this to
other parts of the data
 Can use the same code to also find the difference in reaction times
between the four different conditions and run an Anova on that to
see if there’s a significant difference in reaction time.
 We can also analyze demographic information-first let’s again
create isolated variables to represent certain columns that we
made in our demographic data.
 Now, let’s create a new data.frame but with a structure that looks
more familiar to us.
 Ex:
Demographic_info2=data.frame(Gender,Post.Lure_Category,Pre.Lur
e_Category, Pre.Lure_Related,Post.Lure_Related)
Organizing the data
 by_gender<-melt(data=demographic_info2,id="Gender")
dcast(data=by_gender, formula=Gender~variable, value.var='value',fun=mean)
Code above is exactly what we need to now find means by gender.
-In fact we can use this code to merge all this data into a really organized package.
demographic_info3=data.frame(Gender, Country, Handedness,Vision,
Post.Lure_Category,Pre.Lure_Category,Pre.Lure_Related,Post.Lure_Related)
by_gender2<-
melt(data=demographic_info3,id.vars=c("Gender","Country","Handedness","Vision"))
Running an ANCOVA
 However, if we want to run an ANCOVA and examine if we should
control for gender, we should merge the demographic data with the
original data.frame we created-just so type and phase are separate
variables again.
 Ex:
combine_everything<-
data.frame(by_gender2$Gender,by_gender2$Country,
by_gender2$Handedness,by_gender2$Vision,AllData)
ANCOVA<-
aov(Memory~Phase*Type+by_gender2.Gender,combine_everything)
print(summary(ANCOVA),digits=10)
Can even examine which was the
most luring category word!
 bark=tapply(AllData2$Memory,list(AllData2$CategoryWord),mean)
###Will get the mean for each category word
 library(plyr)
 organize_bark<-ddply(AllData2, "CategoryWord", summarise, mean
=mean(Memory))
##Can then make this data into a nice data.frame using summarise.
 organize_bark <- organize_bark[order(organize_bark$mean,
decreasing=TRUE),]
 ##Can make it where it’s sorted too
 And we can graph this as well!
Conclusion
 R makes the process of cleaning, organizing and analyzing data
quite enjoyable.
 The fact we can easily take raw data that looks like gibberish and
easily translate it into something coherent is quite amazing.
 R allows us to subset our data in various ways, and explore if certain
categorical independent variables (e.g. gender) are influencing the
effects our investigation may revolve around.
 R gives us the ability to run some kind of statistical analysis to see if
our results are actually significant
 R can easily depict our findings through graphs, facilitating a
communal understanding of the results among researchers.

Weitere ähnliche Inhalte

Was ist angesagt?

An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With DataBlue Elephant Consulting
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)MUHAMMAD AAMIR
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerAvjinder (Avi) Kaler
 
Basic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerBasic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerAvjinder (Avi) Kaler
 
Intro to t sql – 3rd session
Intro to t sql – 3rd sessionIntro to t sql – 3rd session
Intro to t sql – 3rd sessionMedhat Dawoud
 
Bt0082 visual basic2
Bt0082 visual basic2Bt0082 visual basic2
Bt0082 visual basic2Techglyphs
 
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...ijdms
 
Data structure &amp; algorithms introduction
Data structure &amp; algorithms introductionData structure &amp; algorithms introduction
Data structure &amp; algorithms introductionSugandh Wafai
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005rainynovember12
 
computer notes - Circular list
computer notes - Circular listcomputer notes - Circular list
computer notes - Circular listecomputernotes
 

Was ist angesagt? (19)

An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With Data
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
 
Basic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerBasic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder Kaler
 
Intro to t sql – 3rd session
Intro to t sql – 3rd sessionIntro to t sql – 3rd session
Intro to t sql – 3rd session
 
lect 2-DS ALGO(online).pdf
lect 2-DS  ALGO(online).pdflect 2-DS  ALGO(online).pdf
lect 2-DS ALGO(online).pdf
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
Bt0082 visual basic2
Bt0082 visual basic2Bt0082 visual basic2
Bt0082 visual basic2
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
 
Dbms ppt
Dbms pptDbms ppt
Dbms ppt
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Data structure &amp; algorithms introduction
Data structure &amp; algorithms introductionData structure &amp; algorithms introduction
Data structure &amp; algorithms introduction
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
MongoDB
MongoDB MongoDB
MongoDB
 
L6 structure
L6 structureL6 structure
L6 structure
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
computer notes - Circular list
computer notes - Circular listcomputer notes - Circular list
computer notes - Circular list
 

Andere mochten auch

9 Data Cleaning Tips for Your Surveys
9 Data Cleaning Tips for Your Surveys9 Data Cleaning Tips for Your Surveys
9 Data Cleaning Tips for Your SurveysDrive Research
 
представлення досвіду
представлення досвідупредставлення досвіду
представлення досвідуzavuchperetoky
 
1.2 Data Classification
1.2 Data Classification1.2 Data Classification
1.2 Data Classificationleblance
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing ProcessInfoCheckPoint
 
Importance of documentation to system analysis
Importance of documentation to system analysisImportance of documentation to system analysis
Importance of documentation to system analysis'Femi Akin-Laguda
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 

Andere mochten auch (10)

9 Data Cleaning Tips for Your Surveys
9 Data Cleaning Tips for Your Surveys9 Data Cleaning Tips for Your Surveys
9 Data Cleaning Tips for Your Surveys
 
представлення досвіду
представлення досвідупредставлення досвіду
представлення досвіду
 
1.2 Data Classification
1.2 Data Classification1.2 Data Classification
1.2 Data Classification
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
 
Importance of documentation to system analysis
Importance of documentation to system analysisImportance of documentation to system analysis
Importance of documentation to system analysis
 
Data processing
Data processingData processing
Data processing
 
Data Processing
Data ProcessingData Processing
Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Ähnlich wie Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and analyzing data in research

Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science JobRohit Dubey
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureRai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureRai University
 
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdf
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdfThis is the official tutorial from Oracle.httpdocs.oracle.comj.pdf
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdfjillisacebi75827
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
 
course31_a_ex01_task.pdf
course31_a_ex01_task.pdfcourse31_a_ex01_task.pdf
course31_a_ex01_task.pdfKentushArshe
 
CS301-lec01.ppt
CS301-lec01.pptCS301-lec01.ppt
CS301-lec01.pptomair31
 
Data Structure In C#
Data Structure In C#Data Structure In C#
Data Structure In C#Shahzad
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConqeTimeline, LLC
 

Ähnlich wie Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and analyzing data in research (20)

Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
Ds
DsDs
Ds
 
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdf
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdfThis is the official tutorial from Oracle.httpdocs.oracle.comj.pdf
This is the official tutorial from Oracle.httpdocs.oracle.comj.pdf
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
course31_a_ex01_task.pdf
course31_a_ex01_task.pdfcourse31_a_ex01_task.pdf
course31_a_ex01_task.pdf
 
Data processing
Data processingData processing
Data processing
 
CS301-lec01.ppt
CS301-lec01.pptCS301-lec01.ppt
CS301-lec01.ppt
 
Data Types
Data TypesData Types
Data Types
 
Data Structure In C#
Data Structure In C#Data Structure In C#
Data Structure In C#
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConq
 
R workshop
R workshopR workshop
R workshop
 

Mehr von Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

Mehr von Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Xgboost
XgboostXgboost
Xgboost
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 

Kürzlich hochgeladen

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 

Kürzlich hochgeladen (20)

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and analyzing data in research

  • 1. The Importance of Cleaning, Organizing and Analyzing Data in Research Michael Blecher
  • 2. Why Data is Important To Researchers  What makes research scientific is the fact it relies on the use of statistics and data.  Researchers do not speculate. For example, they can’t just claim something like colder temperatures make people smarter.  To make hypotheses and theories into working truths, large sets of data have to be collected (i.e. running the experiment), and analyzed.  Only by having such empirical evidence, can we establish if our theories have validity.  Thus, there may be nothing more important for people interested in research than having a background in statistics and knowledge on how to manipulate data (the latter is where R comes in!!!)
  • 3. The Plan  Background into the psychology experiment I programmed  Amazon Mechanical Turk and what data often looks like in academia.  Making sense of the data and cleaning it  Organizing the data.  Analyzing the data  Making the data look pretty!
  • 4. My Experiment-The Famous DRM Paradigm  When one is asked to remember a list of strongly related words, they have a high rate of falsely misremembering a word that semantically embodies all these previously presented words (category word).  Ex: If I asked you to remember the words mad, fear, hate, rage and temper (related words), most of you would likely misremember seeing the word anger as well which is considered the category word to this list (there are 24 lists altogether-each with a different category word and set of related words)  Hence, this experiment had four conditions-lists of words that were presented before the category word (pre-lure) they are associated with, lists of words that were presented after the category word, and the type of word it was (related or category)
  • 5. What Data Often Looks Like  Experiment was ran through Amazon Mechanical Turk-an internet site that allows people to participate in psychological experiments. (among other things)  And now the fun begins! Looking at the data.  C:UsersMichaelDocumentsR datascience notes  C:UsersMichaelDocumentsR datascience notes  The data in a more organized fashion but this was done manually.  One goal is going to be using R to make our raw data look like that excel file.
  • 6. Where to Start  Before anything, we have to set the working directory to the folder where our data is- setwd("C:/Users/Michael/Documents/R datascience notes")  Next, let’s just make a list of different variables. They won’t contain anything at the moment but they have the ability to hold multiple items at once.  Country<-c()  Gender<-c()  Subject<-c()  Age<-c()  Handedness<-c()  Many more variables in the actual R file!
  • 7. Creating a Loop  One of the most important things about programming experiments is knowing ahead of time how you plan to extract the data. For instance, we assured that certain punctuation would appear before the data of each trial in the experiment was recorded (a colon in this case). A colon also appeared before the demographic information was recorded.  Now we can make a loop for (subnum in 5:55) #50 participants and only data for subjects 5-55 was reocrded { test<-scan(file="raw data.txt", what = "character", sep = " ", skip=(0+subnum), nlines=1) #skip and nlines allows us to select which lines we want read data<-unlist(strsplit(test,split=":")) #will split the data by the colon-we use unlist to make all the pieces of the data into a single vector-this will allow us to extract any data we like just through the use of an index if (length(data)==172)…{ #created to assure that we only extract participants’ data when they completed the entire experiment (there were 172 trials).
  • 8. Loop (continued)  As you will see, our variable data contains many different pieces of data. 13-172 contains all the info that was recorded for every trial.  For instance:  [14] "concert,music,Related,Pre- Lure,9,1398816317238,1398816319063,old,4,1,1"  On trial 14, the word presented was concert, the category word that is associated with this word is music. Thus, it was a Related Pre- Lure word, etc.  So now we can split these things up even more with this code  CurrentTrial<-unlist(strsplit(data[i],split=","))  Now each piece of each trial has it’s own index.
  • 9. Putting Data into our variables  And now we can put each piece of our data into the correct variables we made earlier.  Ex: CategoryWord<-c(CategoryWord,CurrentTrial[2]) RelatedWord<-c(RelatedWord,CurrentTrial[1]) etc. And we can also do the same with demographic information. for (i in 12:12){ CurrentTrial2<-unlist(strsplit(data[i],split=",")) Country<-c(Country,CurrentTrial2[2]) Age<-c(Age,CurrentTrial2[4])
  • 10. Making data.frames  First before we forget, we should change certain variables so R reads them as numeric-ex: Memory<-as.numeric(Memory)  Next, let’s make some neat data.frames  One is a simple truncated version and the other is a longer version.  AllData<-data.frame(Subject,Phase,Type,Memory,Confidence)  AllData2<- data.frame(Subject,Phase,Type,Memory,Confidence,CategoryWor d,RelatedWord, OnsetTime, ResponseTime)
  • 11.
  • 12. Adding columns to our data  Still, so much of our data is missing. For instance, the times we have recorded does not provide us yet with a true reaction time. Instead, we only have the exact time when the stimulus was presented and when a button was clicked.  So let’s add to our data.frame by making a new column that creates reaction time by finding the difference between the time- related variables we do have.  Ex: library(dplyr)  AllData3<-mutate(AllData2,Reaction_Time=(ResponseTime- OnsetTime))  Let’s also make it into a separate variable - Reaction_Time<-AllData3$Reaction_Time
  • 13. Putting it in excel  Our data as an excel file-  write.table(AllData3, file='data.csv', sep=",",col.names=TRUE,row.names=FALSE)  Doesn’t look half bad but so much more work to go  For starters, let’s make the data more concise by calculating the means for each subject by type and phase-  AllData<-aggregate(Memory~Subject*Phase*Type,AllData,mean)  library(reshape2)  Can also use this code from reshape2:  organizedata<dcast(AllData,Subject*Phase*Type~.,value.var='Memory',row.names= TRUE)
  • 14. Splitting Data  We can even split the data into smaller subsets-we can separate the data for each subject  Ex: Averag4 <- dlply(.data=AllData,  .variables='Subject')  The code below will make our last column labeled Memory. names(organize_data)[4]="Memory“  Next comes trying to merge the demographic information into our data. We start by changing our data structure where Type and Phase take their own columns. Ex: clean_data<-dcast(organize_data, Subject ~ Phase + Type,value.var="Memory")
  • 15. Subsetting Data  Now we are going to combine all those demographic variables we made earlier with the most recent data frame.  demographic_info<-data.frame(Country,Vision,Handedness,Gender,clean_data)  Now we can find pretty much anything in our data. Ex: can get values of all memory scores during the Post-Lure Category Phase when the participant was a male.  Ex: demographic_info$Post.Lure_Category[demographic_info$Gender=="Male"]  If we examine the country participants were from, we will see that many participants were from the United States but they all didn’t wrote that exact term in their questionnaire (some wrote U.S., America, etc.)
  • 16. Recoding Data  demographic_info$Country[demographic_info$Country=="usa"|  demographic_info$Country=="USA"|  demographic_info$Country=="America"  |demographic_info$Country=="United States of America"  |demographic_info$Country=="US"|demographic_info$Country=="United states"  |demographic_info$Country=="us"]<-"United States“ This code above makes the adjustment so when we examine the frequencies, it will not appear that people who wrote America are from a different country than those wrote the United States.
  • 17. Analyzing our data  Let’s do a 2-way anova to see if there is actually an interaction effect!  aov.out<-aov(Memory~Phase*Type,AllData) #We use * to get the interaction effect as well as the effect of the two independent variables. print(summary(aov.out),digits=10) print(model.tables(aov.out,"means"),digits= 3) #will give us a summary tables for the analysis we ran. Can also use this code to get the model.tables bar=tapply(AllData$Memory,list(AllData$Type,AllData$Phase),mean) And it turns out our interaction was significant.
  • 18. Showing the Interaction Graphically  And we can make different graphs from this code too! Ex: library(sciplot) lineplot.CI(x.factor=AllData$Type,response=AllData$Memory,group=AllData$Phase, trace.label="Phase",xlab="Type",ylab="Percentage of Wrong Answers", main="Interaction Effect") Or if you want a bar graph-ex: graph2 = barplot(bar, beside=T, ylim=c(0.2,1),xpd=FALSE, space=c(.1,.8), main="TIP Effect in DRM Paradigm", xlab="Phase", ylab="Percentage of Wrong Answers", legend =T, args.legend = list(x="topleft"), col=c("red","blue"))
  • 19. Applying this to other parts of the data  Can use the same code to also find the difference in reaction times between the four different conditions and run an Anova on that to see if there’s a significant difference in reaction time.  We can also analyze demographic information-first let’s again create isolated variables to represent certain columns that we made in our demographic data.  Now, let’s create a new data.frame but with a structure that looks more familiar to us.  Ex: Demographic_info2=data.frame(Gender,Post.Lure_Category,Pre.Lur e_Category, Pre.Lure_Related,Post.Lure_Related)
  • 20. Organizing the data  by_gender<-melt(data=demographic_info2,id="Gender") dcast(data=by_gender, formula=Gender~variable, value.var='value',fun=mean) Code above is exactly what we need to now find means by gender. -In fact we can use this code to merge all this data into a really organized package. demographic_info3=data.frame(Gender, Country, Handedness,Vision, Post.Lure_Category,Pre.Lure_Category,Pre.Lure_Related,Post.Lure_Related) by_gender2<- melt(data=demographic_info3,id.vars=c("Gender","Country","Handedness","Vision"))
  • 21. Running an ANCOVA  However, if we want to run an ANCOVA and examine if we should control for gender, we should merge the demographic data with the original data.frame we created-just so type and phase are separate variables again.  Ex: combine_everything<- data.frame(by_gender2$Gender,by_gender2$Country, by_gender2$Handedness,by_gender2$Vision,AllData) ANCOVA<- aov(Memory~Phase*Type+by_gender2.Gender,combine_everything) print(summary(ANCOVA),digits=10)
  • 22. Can even examine which was the most luring category word!  bark=tapply(AllData2$Memory,list(AllData2$CategoryWord),mean) ###Will get the mean for each category word  library(plyr)  organize_bark<-ddply(AllData2, "CategoryWord", summarise, mean =mean(Memory)) ##Can then make this data into a nice data.frame using summarise.  organize_bark <- organize_bark[order(organize_bark$mean, decreasing=TRUE),]  ##Can make it where it’s sorted too  And we can graph this as well!
  • 23. Conclusion  R makes the process of cleaning, organizing and analyzing data quite enjoyable.  The fact we can easily take raw data that looks like gibberish and easily translate it into something coherent is quite amazing.  R allows us to subset our data in various ways, and explore if certain categorical independent variables (e.g. gender) are influencing the effects our investigation may revolve around.  R gives us the ability to run some kind of statistical analysis to see if our results are actually significant  R can easily depict our findings through graphs, facilitating a communal understanding of the results among researchers.