SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Detecting Bad Data CARMA Research Module Jeff Stanton
May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers  May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection  Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)

Weitere ähnliche Inhalte

Was ist angesagt?

Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchMatthieu Schapranow
 
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFPharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFannzi
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Association Mining
Association Mining Association Mining
Association Mining Edureka!
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Matthieu Schapranow
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Matthieu Schapranow
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIPaul Agapow
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDatabricks
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhouLang Zhou
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesMatthieu Schapranow
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...CTSI at UCSF
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developersNirmal Fernando
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineMatthieu Schapranow
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 

Was ist angesagt? (20)

Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
 
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFPharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Association Mining
Association Mining Association Mining
Association Mining
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhou
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems Medicine
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Andere mochten auch

From the classroom to the workplace: how data skills develop better social re...
From the classroom to the workplace: how data skills develop better social re...From the classroom to the workplace: how data skills develop better social re...
From the classroom to the workplace: how data skills develop better social re...zzalszjc
 
The Cost Of Bad (And Clean) Data
The Cost Of Bad (And Clean) Data The Cost Of Bad (And Clean) Data
The Cost Of Bad (And Clean) Data Infobrandz
 
As Research methods, sociology
As Research methods, sociologyAs Research methods, sociology
As Research methods, sociologyZoe Dobson
 
Quant Vs Qual Research
Quant Vs Qual ResearchQuant Vs Qual Research
Quant Vs Qual Researchguesta861fa
 
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...Liliana Bounegru
 

Andere mochten auch (8)

From the classroom to the workplace: how data skills develop better social re...
From the classroom to the workplace: how data skills develop better social re...From the classroom to the workplace: how data skills develop better social re...
From the classroom to the workplace: how data skills develop better social re...
 
The Cost Of Bad (And Clean) Data
The Cost Of Bad (And Clean) Data The Cost Of Bad (And Clean) Data
The Cost Of Bad (And Clean) Data
 
Business Impact of Bad Data
Business Impact of Bad DataBusiness Impact of Bad Data
Business Impact of Bad Data
 
AS Sociology: Ethical Factors Influencing Choice of Methods
AS Sociology: Ethical Factors Influencing Choice of MethodsAS Sociology: Ethical Factors Influencing Choice of Methods
AS Sociology: Ethical Factors Influencing Choice of Methods
 
As Research methods, sociology
As Research methods, sociologyAs Research methods, sociology
As Research methods, sociology
 
Quant Vs Qual Research
Quant Vs Qual ResearchQuant Vs Qual Research
Quant Vs Qual Research
 
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
Data Infrastructure Literacy: Reshaping Practices of Measurement, Monitoring ...
 
AS Theoretical Issues
AS Theoretical IssuesAS Theoretical Issues
AS Theoretical Issues
 

Ähnlich wie Carma internet research module detecting bad data

Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterBen De Meester
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEMEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEIAEME Publication
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionbutest
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...ahmedragab433449
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET Journal
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.docmurtaza5500
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databasesIEEEMEMTECHSTUDENTSPROJECTS
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...ahmedragab433449
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internetSyracuse University
 

Ähnlich wie Carma internet research module detecting bad data (20)

Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEMEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
 
Datamining
DataminingDatamining
Datamining
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.doc
 
Data mining
Data miningData mining
Data mining
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Mehr von Syracuse University

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultySyracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question proSyracuse University
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issuesSyracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics CourseSyracuse University
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Syracuse University
 

Mehr von Syracuse University (20)

Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 

Carma internet research module detecting bad data

  • 1. Detecting Bad Data CARMA Research Module Jeff Stanton
  • 2. May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
  • 3. Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
  • 4. Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
  • 5. May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
  • 6. May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
  • 7. May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
  • 8. May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
  • 9. May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
  • 10. May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
  • 11. May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
  • 12. May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
  • 13. May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
  • 14. May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)