Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and analyzing data in research
1. The Importance of Cleaning,
Organizing and Analyzing Data
in Research
Michael Blecher
2. Why Data is Important To
Researchers
What makes research scientific is the fact it relies on the use of statistics
and data.
Researchers do not speculate. For example, they can’t just claim
something like colder temperatures make people smarter.
To make hypotheses and theories into working truths, large sets of data
have to be collected (i.e. running the experiment), and analyzed.
Only by having such empirical evidence, can we establish if our
theories have validity.
Thus, there may be nothing more important for people interested in
research than having a background in statistics and knowledge on
how to manipulate data (the latter is where R comes in!!!)
3. The Plan
Background into the psychology experiment I
programmed
Amazon Mechanical Turk and what data
often looks like in academia.
Making sense of the data and cleaning it
Organizing the data.
Analyzing the data
Making the data look pretty!
4. My Experiment-The Famous DRM
Paradigm
When one is asked to remember a list of strongly related words, they
have a high rate of falsely misremembering a word that semantically
embodies all these previously presented words (category word).
Ex: If I asked you to remember the words mad, fear, hate, rage and
temper (related words), most of you would likely misremember seeing
the word anger as well which is considered the category word to this
list (there are 24 lists altogether-each with a different category word
and set of related words)
Hence, this experiment had four conditions-lists of words that were
presented before the category word (pre-lure) they are associated
with, lists of words that were presented after the category word, and
the type of word it was (related or category)
5. What Data Often Looks Like
Experiment was ran through Amazon Mechanical Turk-an internet site
that allows people to participate in psychological experiments.
(among other things)
And now the fun begins! Looking at the data.
C:UsersMichaelDocumentsR datascience notes
C:UsersMichaelDocumentsR datascience notes
The data in a more organized fashion but this was done manually.
One goal is going to be using R to make our raw data look like that
excel file.
6. Where to Start
Before anything, we have to set the working directory to the folder where our
data is-
setwd("C:/Users/Michael/Documents/R datascience notes")
Next, let’s just make a list of different variables. They won’t contain anything at
the moment but they have the ability to hold multiple items at once.
Country<-c()
Gender<-c()
Subject<-c()
Age<-c()
Handedness<-c()
Many more variables in the actual R file!
7. Creating a Loop
One of the most important things about programming experiments is knowing ahead of time how you
plan to extract the data. For instance, we assured that certain punctuation would appear before the
data of each trial in the experiment was recorded (a colon in this case). A colon also appeared before
the demographic information was recorded.
Now we can make a loop
for (subnum in 5:55) #50 participants and only data for subjects 5-55 was reocrded
{
test<-scan(file="raw data.txt", what = "character", sep = " ", skip=(0+subnum), nlines=1) #skip and nlines
allows us to select which lines we want read
data<-unlist(strsplit(test,split=":")) #will split the data by the colon-we use unlist to make all the pieces of
the data into a single vector-this will allow us to extract any data we like just through the use of an index
if (length(data)==172)…{ #created to assure that we only extract participants’ data when they
completed the entire experiment (there were 172 trials).
8. Loop (continued)
As you will see, our variable data contains many different pieces of
data. 13-172 contains all the info that was recorded for every trial.
For instance:
[14] "concert,music,Related,Pre-
Lure,9,1398816317238,1398816319063,old,4,1,1"
On trial 14, the word presented was concert, the category word
that is associated with this word is music. Thus, it was a Related Pre-
Lure word, etc.
So now we can split these things up even more with this code
CurrentTrial<-unlist(strsplit(data[i],split=","))
Now each piece of each trial has it’s own index.
9. Putting Data into our variables
And now we can put each piece of our data into the correct
variables we made earlier.
Ex:
CategoryWord<-c(CategoryWord,CurrentTrial[2])
RelatedWord<-c(RelatedWord,CurrentTrial[1]) etc.
And we can also do the same with demographic information.
for (i in 12:12){
CurrentTrial2<-unlist(strsplit(data[i],split=","))
Country<-c(Country,CurrentTrial2[2])
Age<-c(Age,CurrentTrial2[4])
10. Making data.frames
First before we forget, we should change certain variables so R
reads them as numeric-ex: Memory<-as.numeric(Memory)
Next, let’s make some neat data.frames
One is a simple truncated version and the other is a longer version.
AllData<-data.frame(Subject,Phase,Type,Memory,Confidence)
AllData2<-
data.frame(Subject,Phase,Type,Memory,Confidence,CategoryWor
d,RelatedWord, OnsetTime, ResponseTime)
11.
12. Adding columns to our data
Still, so much of our data is missing. For instance, the times we have
recorded does not provide us yet with a true reaction time. Instead,
we only have the exact time when the stimulus was presented and
when a button was clicked.
So let’s add to our data.frame by making a new column that
creates reaction time by finding the difference between the time-
related variables we do have.
Ex: library(dplyr)
AllData3<-mutate(AllData2,Reaction_Time=(ResponseTime-
OnsetTime))
Let’s also make it into a separate variable -
Reaction_Time<-AllData3$Reaction_Time
13. Putting it in excel
Our data as an excel file-
write.table(AllData3, file='data.csv', sep=",",col.names=TRUE,row.names=FALSE)
Doesn’t look half bad but so much more work to go
For starters, let’s make the data more concise by calculating the means for each
subject by type and phase-
AllData<-aggregate(Memory~Subject*Phase*Type,AllData,mean)
library(reshape2)
Can also use this code from reshape2:
organizedata<dcast(AllData,Subject*Phase*Type~.,value.var='Memory',row.names=
TRUE)
14. Splitting Data
We can even split the data into smaller subsets-we can separate the
data for each subject
Ex:
Averag4 <- dlply(.data=AllData,
.variables='Subject')
The code below will make our last column labeled Memory.
names(organize_data)[4]="Memory“
Next comes trying to merge the demographic information into our
data. We start by changing our data structure where Type and Phase
take their own columns.
Ex:
clean_data<-dcast(organize_data, Subject ~ Phase +
Type,value.var="Memory")
15. Subsetting Data
Now we are going to combine all those demographic variables we made earlier
with the most recent data frame.
demographic_info<-data.frame(Country,Vision,Handedness,Gender,clean_data)
Now we can find pretty much anything in our data. Ex: can get values of all
memory scores during the Post-Lure Category Phase when the participant was a
male.
Ex: demographic_info$Post.Lure_Category[demographic_info$Gender=="Male"]
If we examine the country participants were from, we will see that many
participants were from the United States but they all didn’t wrote that exact term
in their questionnaire (some wrote U.S., America, etc.)
16. Recoding Data
demographic_info$Country[demographic_info$Country=="usa"|
demographic_info$Country=="USA"|
demographic_info$Country=="America"
|demographic_info$Country=="United States of America"
|demographic_info$Country=="US"|demographic_info$Country=="United states"
|demographic_info$Country=="us"]<-"United States“
This code above makes the adjustment so when we examine the frequencies, it will not
appear that people who wrote America are from a different country than those wrote the
United States.
17. Analyzing our data
Let’s do a 2-way anova to see if there is actually an interaction
effect!
aov.out<-aov(Memory~Phase*Type,AllData) #We use * to get the
interaction effect as well as the effect of the two independent
variables.
print(summary(aov.out),digits=10)
print(model.tables(aov.out,"means"),digits= 3) #will give us a
summary tables for the analysis we ran.
Can also use this code to get the model.tables
bar=tapply(AllData$Memory,list(AllData$Type,AllData$Phase),mean)
And it turns out our interaction was significant.
18. Showing the
Interaction Graphically
And we can make different graphs from this code too!
Ex: library(sciplot)
lineplot.CI(x.factor=AllData$Type,response=AllData$Memory,group=AllData$Phase,
trace.label="Phase",xlab="Type",ylab="Percentage of Wrong Answers",
main="Interaction Effect")
Or if you want a bar graph-ex:
graph2 = barplot(bar, beside=T, ylim=c(0.2,1),xpd=FALSE,
space=c(.1,.8), main="TIP Effect in DRM Paradigm",
xlab="Phase", ylab="Percentage of Wrong Answers", legend =T,
args.legend = list(x="topleft"), col=c("red","blue"))
19. Applying this to
other parts of the data
Can use the same code to also find the difference in reaction times
between the four different conditions and run an Anova on that to
see if there’s a significant difference in reaction time.
We can also analyze demographic information-first let’s again
create isolated variables to represent certain columns that we
made in our demographic data.
Now, let’s create a new data.frame but with a structure that looks
more familiar to us.
Ex:
Demographic_info2=data.frame(Gender,Post.Lure_Category,Pre.Lur
e_Category, Pre.Lure_Related,Post.Lure_Related)
20. Organizing the data
by_gender<-melt(data=demographic_info2,id="Gender")
dcast(data=by_gender, formula=Gender~variable, value.var='value',fun=mean)
Code above is exactly what we need to now find means by gender.
-In fact we can use this code to merge all this data into a really organized package.
demographic_info3=data.frame(Gender, Country, Handedness,Vision,
Post.Lure_Category,Pre.Lure_Category,Pre.Lure_Related,Post.Lure_Related)
by_gender2<-
melt(data=demographic_info3,id.vars=c("Gender","Country","Handedness","Vision"))
21. Running an ANCOVA
However, if we want to run an ANCOVA and examine if we should
control for gender, we should merge the demographic data with the
original data.frame we created-just so type and phase are separate
variables again.
Ex:
combine_everything<-
data.frame(by_gender2$Gender,by_gender2$Country,
by_gender2$Handedness,by_gender2$Vision,AllData)
ANCOVA<-
aov(Memory~Phase*Type+by_gender2.Gender,combine_everything)
print(summary(ANCOVA),digits=10)
22. Can even examine which was the
most luring category word!
bark=tapply(AllData2$Memory,list(AllData2$CategoryWord),mean)
###Will get the mean for each category word
library(plyr)
organize_bark<-ddply(AllData2, "CategoryWord", summarise, mean
=mean(Memory))
##Can then make this data into a nice data.frame using summarise.
organize_bark <- organize_bark[order(organize_bark$mean,
decreasing=TRUE),]
##Can make it where it’s sorted too
And we can graph this as well!
23. Conclusion
R makes the process of cleaning, organizing and analyzing data
quite enjoyable.
The fact we can easily take raw data that looks like gibberish and
easily translate it into something coherent is quite amazing.
R allows us to subset our data in various ways, and explore if certain
categorical independent variables (e.g. gender) are influencing the
effects our investigation may revolve around.
R gives us the ability to run some kind of statistical analysis to see if
our results are actually significant
R can easily depict our findings through graphs, facilitating a
communal understanding of the results among researchers.