SlideShare a Scribd company logo
1 of 58
52-Week Biotech Stock Price 1
100 Years of “Emma” 2
17 Years Superbowl Viewership 3
4 What is common among these time series data?
All wrong! All these time series are fabrications All these time series are “random walks” 5
6 Welcome to secondary data analysis.
Secondary Data Analysis B. Rey de Castro, Sc.D. Guest Researcher CDC National Center for Health Statistics University of Maryland College Park School of Public Health FMSC 720 Study Design in MCH Epidemiology November 30, 2010
Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 8
Uses for Secondary Data Hypothesis generation/testing Pilot data for grant proposals Expanding knowledge Publications
National Health and Nutrition Examination Survey (NHANES) http://www.cdc.gov/nchs/nhanes.htm Population Children, adults  nationwide Method Face to face interview Physical exams Content  Chronic and Infectious Disease Mental health and cognitive functioning Energy Balance Reproductive history and sexual behavior Respiratory disease Data N ~ 5,000 annually Initiated in 1960’s; Annual since 1999 On-line tutorial
    National Health Interview Survey                    (NHIS) http://www.cdc.gov/nchs/nhis.htm Population Households, families, adults, children nationwide Method Face to face interview Content Health conditions and behaviors, access to and use of health services Cancer Control Module (1987, 1992, 2000, 2003, and 2005) Energy Balance Cancer Screening  Sun Avoidance  Tobacco Use and Control  Genetic Testing Data N ~ 40,000 households (~87,000 individuals) annually Initiated in 1957
Other Federal Surveys National Longitudinal Mortality Study http://www.census.gov/nlms/ National Health Care Survey http://www.cdc.gov/nchs/nhcs.htm National Ambulatory Medical Care Survey http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm Medical Expenditure Panel Survey http://www.meps.ahrq.gov/ Medicare Current Beneficiary Survey http://www.cms.hhs.gov/MCBS/ Medicare Health Outcomes Survey http://www.hosonline.org/ National Survey on Drug Use and Health http://www.oas.samhsa.gov/nhsda.htm National Survey of Family Growth http://www.cdc.gov/nchs/about/major/nsfg/nsfgbiblio.htm
Strengths Inexpensive data collection and design costs More statistical power: larger samples Broader geographic area Generalizable to national population Improves understanding of hypothesis Test trends over time Potential for linkage Person Geographically
Limitations 1 Substantial time spent on statistical analysis Cross-sectional Recall bias Mismatch: ideal and feasible hypothesis Mismatch: hypothesis and original purpose Generalizabilityto small areas impossible Specialized statistical techniques
Limitations 2 Quality Validity & reliability Changes to survey over time Poor documentation Restricted/conditional access Confidentiality 15
Recap Just a few examples of publicly available data Most are cross-sectional All employ a complex sampling design Many use multi-stage sampling Requires special software to analyze  e.g., SUDAAN Use of weighting, clustering, and stratification Differences in variance estimation methods
Complex Surveys 17
Statistical Weight The statistical weight of a sampled person is the number of people in the population that the person represents.  Weights derived from Selection probabilities Response rates Post-stratification adjustment  e.g., gender, education, income, region
Stratification Population divided before sampling into disjoint, exhaustive groups (strata) Members termed primary sampling units (PSUs)  Independent samples are taken in each strata Strata formed by similar geographic areas   e.g., NHANES: partition US counties into 49 strata based on region and economic/racial characteristics Sample 2 counties (PSUs) from each strata
Clustering Persons residing in a small area may have similar characteristics Thus, responses of subjects in small area are potentially correlated  Correlation must be accounted for in the analysis  Survey analysis programs do this through strata/PSU information
Variance Estimation for Surveys Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators  Default method for most programs Requires stratification and PSU information Replication: Calculates parameter estimates for each replicate and combines to estimate variance Jackknife with replicate weights available for SUDAAN, STATA, SAS and WesVAR
Replication vs. Linearization If survey doesn’t have replicate weights use the full sample weights and linearization If survey has replicate weights use them with the jackknife procedure Most software use linearization method Only SUDAAN, STATA, SAS, and WesVAR can incorporate replicate weights
Complex Survey Design Correct variance estimates Proper hypothesis testing Standard errors will tend to be larger  Less likely to make Type I error
Statistical Software for Analyzing Health Surveys  Specifically designed for analyzing data utilizing complex sampling designs: SUDAAN WesVar Others that can be used: SAS STATA SPSS Mplus
Data/Research Resources Univ. of Michigan Consortium for social research: http://www.icpsr.umich.edu/ UCLA Statistical Computing: http://www.ats.ucla.edu/stat/ BRFSS Maps http://apps.nccd.cdc.gov/gisbrfss/default.aspx State Cancer Profiles http://statecancerprofiles.cancer.gov/
References Korn, E.L. and Graubard, B.I. (1999). Analysis of  	Health Surveys. New York: John Wiley State Cancer Profiles: http://statecancerprofiles.cancer.gov/ SUDAAN: http://www.rti.org/SUDAAN/ SAS:  http://www.sas.com/ SPSS: http://www.spss.com/ STATA:  http://www.stata.com/ WesVar: http://www.westat.com/wesvar/ Mplus: http://www.statmodel.com/
Other Data Sources State registries Birth Death Cancer Emergency room admissions Acute outcomes 27
Intermission 28
Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 29
Lesson One 30 Integrity
Dirty Data Key-punch errors Invalid data Missing data Mislabeled variables Unknown variables 31
Preparing Data 32
Processing Data Recode data Label variables Format data 33
Investigation Reality checks Out-of-range values Descriptive statistics Ranges: out-of-range or improbable values Frequencies: missing values or classes Simple graphical display 34
Normal Ranges 35
Imputing Missing Values Increases available data Statistically more complex Defensibility Useful 36
Lesson Two Spend time up-front being sure about your data Foundation of sand or stone? Crystal clear case definition & recodes More time preparing than analyzing Prevents problems Simplifies analysis 37
Statistical Analysis Plan 38
Outcome 39
Design 40
Clustered Data 41
Longitudinal 42
Hierarchical 43
Diagnostics Independence Homoskedasticity Skewness Influential observations 44
Lesson Three Plan, then execute the plan Conform statistical technique to outcome and design Diagnostics 45
Case Study Ongoing spatial epidemiology project Complex survey Cross-sectional Data linkage Childhood asthma episodes Air pollution exposure 46
Case Study Air pollutant: acrolein EPA attributes >90% non-cancer respiratory health effects to acrolein No epidemiology to date 47
Data Linkage 48
National Health Interview Survey Health outcome Asthma episode in last 12 months 2000 – 2004 Children 3 – 17 years-old Parents of ~66,000 kids surveyed Nationally representative sample Complex survey weighting 49
National Health Interview Survey Potential Confounders Smoking household Acrolein industry household Age, sex, race Education, income, single-parent family Access to care, insurance Urban/rural Census regional division 50
National Air Toxics Assessment Air pollutant Acrolein Strong respiratory irritant Cigarette smoke; industrial emissions 2002 Modeled exposure assessment Census tracts nationwide 51
52 How would you link these two databases?
Geographic Linkage 53
54 But, requires access to confidential NHIS data.
NCHS Says Orient to data structure and contents Locate variables Download data Append & merge data Clean & recode data Format & label variables 55
NHIS Data Processing Extract and compile data by year Multiple files 2004 redesign Compile data 2000 – 2004 Formatting and variable names a pain Identify records with complete data Link to NATA Done confidentially by NCHS staff 56
Analysis Plan Hypothesis “Childhood asthma episodes are associated with census-tract-level estimates of acrolein exposure” Descriptive statistics Logistic regression Complex weighted variance estimation SAS-callable SUDAAN 57
Wisdom Network Cultivate relationships Front-line staff Principal investigators 58

More Related Content

What's hot

Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
Vineetha Menon
 
Studies of vaccine safety (Pharmacoepidemiology) V PharmD
Studies of vaccine safety (Pharmacoepidemiology)  V PharmDStudies of vaccine safety (Pharmacoepidemiology)  V PharmD
Studies of vaccine safety (Pharmacoepidemiology) V PharmD
Dr.Sohel Memon
 

What's hot (20)

Study designs
Study designsStudy designs
Study designs
 
# 9th lect clinical trial analysis
# 9th lect  clinical  trial analysis# 9th lect  clinical  trial analysis
# 9th lect clinical trial analysis
 
05 intervention studies
05 intervention studies05 intervention studies
05 intervention studies
 
Observational descriptive study: case report, case series & ecological study
Observational descriptive study: case report, case series & ecological studyObservational descriptive study: case report, case series & ecological study
Observational descriptive study: case report, case series & ecological study
 
Randomisation techniques
Randomisation techniquesRandomisation techniques
Randomisation techniques
 
Measurement of outcome v5
Measurement  of outcome v5Measurement  of outcome v5
Measurement of outcome v5
 
Phases in clinical trial
Phases in clinical trialPhases in clinical trial
Phases in clinical trial
 
Observational analytical and interventional studies
Observational analytical and interventional studiesObservational analytical and interventional studies
Observational analytical and interventional studies
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
 
Case control study
Case control studyCase control study
Case control study
 
Concept f risk
Concept f riskConcept f risk
Concept f risk
 
Meta analysis techniques in epidemiology
Meta analysis techniques in epidemiologyMeta analysis techniques in epidemiology
Meta analysis techniques in epidemiology
 
Research methodology iii
Research methodology iiiResearch methodology iii
Research methodology iii
 
Cross sectional study
Cross sectional studyCross sectional study
Cross sectional study
 
Studies of vaccine safety (Pharmacoepidemiology) V PharmD
Studies of vaccine safety (Pharmacoepidemiology)  V PharmDStudies of vaccine safety (Pharmacoepidemiology)  V PharmD
Studies of vaccine safety (Pharmacoepidemiology) V PharmD
 
Prescription event monitoring
Prescription event monitoringPrescription event monitoring
Prescription event monitoring
 
Vaccine safety
Vaccine safetyVaccine safety
Vaccine safety
 
Cohort study
Cohort studyCohort study
Cohort study
 
Randomisation
RandomisationRandomisation
Randomisation
 

Viewers also liked

Summer training project report on
Summer training project report onSummer training project report on
Summer training project report on
Kantinath Banerjee
 
Project report on- "A study of digital marketing services"
Project report on- "A study of digital marketing services" Project report on- "A study of digital marketing services"
Project report on- "A study of digital marketing services"
MarketerBoard
 

Viewers also liked (6)

Nagender
NagenderNagender
Nagender
 
Summer training project report on
Summer training project report onSummer training project report on
Summer training project report on
 
Project report on- "A study of digital marketing services"
Project report on- "A study of digital marketing services" Project report on- "A study of digital marketing services"
Project report on- "A study of digital marketing services"
 
A project report on evaluation of financial performance based on ratio analysis
A project report on  evaluation of financial performance based on ratio analysisA project report on  evaluation of financial performance based on ratio analysis
A project report on evaluation of financial performance based on ratio analysis
 
Project report on Financial Statement Analysis and interpretation of A Company
Project report on Financial Statement Analysis and interpretation of A CompanyProject report on Financial Statement Analysis and interpretation of A Company
Project report on Financial Statement Analysis and interpretation of A Company
 
A project report on analysis of financial statement of icici bank
A project report on analysis of financial statement of  icici bankA project report on analysis of financial statement of  icici bank
A project report on analysis of financial statement of icici bank
 

Similar to Secondary Data Analysis

Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014
muink
 
Methods for Observational Comparative Effectiveness Research on Healthcare De...
Methods for Observational Comparative Effectiveness Research on Healthcare De...Methods for Observational Comparative Effectiveness Research on Healthcare De...
Methods for Observational Comparative Effectiveness Research on Healthcare De...
Marion Sills
 
A Two-sample Approach for State Estimates of a Chronic Condition Outcome
A Two-sample Approach for State Estimates of a Chronic Condition OutcomeA Two-sample Approach for State Estimates of a Chronic Condition Outcome
A Two-sample Approach for State Estimates of a Chronic Condition Outcome
soder145
 
Matching the Research Design to the Study Question
Matching the Research Design to the Study QuestionMatching the Research Design to the Study Question
Matching the Research Design to the Study Question
AcademyHealth
 

Similar to Secondary Data Analysis (20)

WGHA Discovery Series: Ali Mokdad
WGHA Discovery Series: Ali MokdadWGHA Discovery Series: Ali Mokdad
WGHA Discovery Series: Ali Mokdad
 
Epidemiological study Design Case Control And Cohort Study.ppt
Epidemiological study Design Case Control And Cohort Study.pptEpidemiological study Design Case Control And Cohort Study.ppt
Epidemiological study Design Case Control And Cohort Study.ppt
 
Big data, RWE and AI in Clinical Trials made simple
Big data, RWE and AI in Clinical Trials made simpleBig data, RWE and AI in Clinical Trials made simple
Big data, RWE and AI in Clinical Trials made simple
 
Epidemiological study designs
Epidemiological study designs Epidemiological study designs
Epidemiological study designs
 
Epidemiological methods
Epidemiological methodsEpidemiological methods
Epidemiological methods
 
Math, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical ResearchMath, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical Research
 
Genetic testing evaluation part 1 2018
Genetic testing evaluation part 1 2018Genetic testing evaluation part 1 2018
Genetic testing evaluation part 1 2018
 
Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014
 
2022-06-07 Berman Lew Great Plains Conference FINAL.pptx
2022-06-07 Berman Lew Great Plains Conference FINAL.pptx2022-06-07 Berman Lew Great Plains Conference FINAL.pptx
2022-06-07 Berman Lew Great Plains Conference FINAL.pptx
 
PrEP Implementation Planning for the US
PrEP Implementation Planning for the USPrEP Implementation Planning for the US
PrEP Implementation Planning for the US
 
Methods for Observational Comparative Effectiveness Research on Healthcare De...
Methods for Observational Comparative Effectiveness Research on Healthcare De...Methods for Observational Comparative Effectiveness Research on Healthcare De...
Methods for Observational Comparative Effectiveness Research on Healthcare De...
 
The Learning Health System: Thinking and Acting Across Scales
The Learning Health System: Thinking and Acting Across ScalesThe Learning Health System: Thinking and Acting Across Scales
The Learning Health System: Thinking and Acting Across Scales
 
ODF III - 3.15.16 - Day Two Morning Sessions
ODF III - 3.15.16 - Day Two Morning SessionsODF III - 3.15.16 - Day Two Morning Sessions
ODF III - 3.15.16 - Day Two Morning Sessions
 
Research design fw 2011
Research design fw 2011Research design fw 2011
Research design fw 2011
 
PDAs for Nursing Students: Technology at Your Fingertips
PDAs for Nursing Students: Technology at Your FingertipsPDAs for Nursing Students: Technology at Your Fingertips
PDAs for Nursing Students: Technology at Your Fingertips
 
A Two-sample Approach for State Estimates of a Chronic Condition Outcome
A Two-sample Approach for State Estimates of a Chronic Condition OutcomeA Two-sample Approach for State Estimates of a Chronic Condition Outcome
A Two-sample Approach for State Estimates of a Chronic Condition Outcome
 
Embi cri review-2013-final
Embi cri review-2013-finalEmbi cri review-2013-final
Embi cri review-2013-final
 
Matching the Research Design to the Study Question
Matching the Research Design to the Study QuestionMatching the Research Design to the Study Question
Matching the Research Design to the Study Question
 
Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014
 
From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...
 

More from REY DECASTRO

Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
REY DECASTRO
 
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
REY DECASTRO
 
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
REY DECASTRO
 
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
REY DECASTRO
 
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
REY DECASTRO
 
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
REY DECASTRO
 
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
REY DECASTRO
 
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
REY DECASTRO
 

More from REY DECASTRO (14)

Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
Population-Weighted Exposure to 174 Air Toxics in a Representative Sample of...
 
Association of Urinary Arsenic Species with Diet in a Representative Sample o...
Association of Urinary Arsenic Species with Diet in a Representative Sample o...Association of Urinary Arsenic Species with Diet in a Representative Sample o...
Association of Urinary Arsenic Species with Diet in a Representative Sample o...
 
Acrolein and COPD in a Nationally Representative Sample of United States Adul...
Acrolein and COPD in a Nationally Representative Sample of United States Adul...Acrolein and COPD in a Nationally Representative Sample of United States Adul...
Acrolein and COPD in a Nationally Representative Sample of United States Adul...
 
Bootstrap estimation of variance from ROC curve analysis of NHANES complex su...
Bootstrap estimation of variance from ROC curve analysis of NHANES complex su...Bootstrap estimation of variance from ROC curve analysis of NHANES complex su...
Bootstrap estimation of variance from ROC curve analysis of NHANES complex su...
 
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
Perchlorate Exposure from Diet and Drinking Water in a Representative Sample ...
 
Acrolein and Neurocognitive Loss in a Nationally Representative Sample of Uni...
Acrolein and Neurocognitive Loss in a Nationally Representative Sample of Uni...Acrolein and Neurocognitive Loss in a Nationally Representative Sample of Uni...
Acrolein and Neurocognitive Loss in a Nationally Representative Sample of Uni...
 
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
Acrolein and Adult Asthma in a Nationally Representative Sample of the United...
 
Applications of Contemporary Statistical Approaches in Environmental Health A...
Applications of Contemporary Statistical Approaches in Environmental Health A...Applications of Contemporary Statistical Approaches in Environmental Health A...
Applications of Contemporary Statistical Approaches in Environmental Health A...
 
Applications of Contemporary Statistical Approaches in Environmental Health M...
Applications of Contemporary Statistical Approaches in Environmental Health M...Applications of Contemporary Statistical Approaches in Environmental Health M...
Applications of Contemporary Statistical Approaches in Environmental Health M...
 
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
 
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
The Dependence of Indoor PAH Concentrations on Outdoor PAHs and Traffic Volum...
 
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
Using Microarrays to Monitor Gene Expression Induced by Outdoor Airborne Part...
 
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
The Longitudinal Dependence of Indoor PAH Concentration on Outdoor PAH and Tr...
 
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
Microenvironment Exposure Weights Can Be Obtained from a Straightforward Stat...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Secondary Data Analysis

  • 2. 100 Years of “Emma” 2
  • 3. 17 Years Superbowl Viewership 3
  • 4. 4 What is common among these time series data?
  • 5. All wrong! All these time series are fabrications All these time series are “random walks” 5
  • 6. 6 Welcome to secondary data analysis.
  • 7. Secondary Data Analysis B. Rey de Castro, Sc.D. Guest Researcher CDC National Center for Health Statistics University of Maryland College Park School of Public Health FMSC 720 Study Design in MCH Epidemiology November 30, 2010
  • 8. Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 8
  • 9. Uses for Secondary Data Hypothesis generation/testing Pilot data for grant proposals Expanding knowledge Publications
  • 10. National Health and Nutrition Examination Survey (NHANES) http://www.cdc.gov/nchs/nhanes.htm Population Children, adults nationwide Method Face to face interview Physical exams Content Chronic and Infectious Disease Mental health and cognitive functioning Energy Balance Reproductive history and sexual behavior Respiratory disease Data N ~ 5,000 annually Initiated in 1960’s; Annual since 1999 On-line tutorial
  • 11. National Health Interview Survey (NHIS) http://www.cdc.gov/nchs/nhis.htm Population Households, families, adults, children nationwide Method Face to face interview Content Health conditions and behaviors, access to and use of health services Cancer Control Module (1987, 1992, 2000, 2003, and 2005) Energy Balance Cancer Screening Sun Avoidance Tobacco Use and Control Genetic Testing Data N ~ 40,000 households (~87,000 individuals) annually Initiated in 1957
  • 12. Other Federal Surveys National Longitudinal Mortality Study http://www.census.gov/nlms/ National Health Care Survey http://www.cdc.gov/nchs/nhcs.htm National Ambulatory Medical Care Survey http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm Medical Expenditure Panel Survey http://www.meps.ahrq.gov/ Medicare Current Beneficiary Survey http://www.cms.hhs.gov/MCBS/ Medicare Health Outcomes Survey http://www.hosonline.org/ National Survey on Drug Use and Health http://www.oas.samhsa.gov/nhsda.htm National Survey of Family Growth http://www.cdc.gov/nchs/about/major/nsfg/nsfgbiblio.htm
  • 13. Strengths Inexpensive data collection and design costs More statistical power: larger samples Broader geographic area Generalizable to national population Improves understanding of hypothesis Test trends over time Potential for linkage Person Geographically
  • 14. Limitations 1 Substantial time spent on statistical analysis Cross-sectional Recall bias Mismatch: ideal and feasible hypothesis Mismatch: hypothesis and original purpose Generalizabilityto small areas impossible Specialized statistical techniques
  • 15. Limitations 2 Quality Validity & reliability Changes to survey over time Poor documentation Restricted/conditional access Confidentiality 15
  • 16. Recap Just a few examples of publicly available data Most are cross-sectional All employ a complex sampling design Many use multi-stage sampling Requires special software to analyze e.g., SUDAAN Use of weighting, clustering, and stratification Differences in variance estimation methods
  • 18. Statistical Weight The statistical weight of a sampled person is the number of people in the population that the person represents. Weights derived from Selection probabilities Response rates Post-stratification adjustment e.g., gender, education, income, region
  • 19. Stratification Population divided before sampling into disjoint, exhaustive groups (strata) Members termed primary sampling units (PSUs) Independent samples are taken in each strata Strata formed by similar geographic areas   e.g., NHANES: partition US counties into 49 strata based on region and economic/racial characteristics Sample 2 counties (PSUs) from each strata
  • 20. Clustering Persons residing in a small area may have similar characteristics Thus, responses of subjects in small area are potentially correlated Correlation must be accounted for in the analysis Survey analysis programs do this through strata/PSU information
  • 21. Variance Estimation for Surveys Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators Default method for most programs Requires stratification and PSU information Replication: Calculates parameter estimates for each replicate and combines to estimate variance Jackknife with replicate weights available for SUDAAN, STATA, SAS and WesVAR
  • 22. Replication vs. Linearization If survey doesn’t have replicate weights use the full sample weights and linearization If survey has replicate weights use them with the jackknife procedure Most software use linearization method Only SUDAAN, STATA, SAS, and WesVAR can incorporate replicate weights
  • 23. Complex Survey Design Correct variance estimates Proper hypothesis testing Standard errors will tend to be larger Less likely to make Type I error
  • 24. Statistical Software for Analyzing Health Surveys Specifically designed for analyzing data utilizing complex sampling designs: SUDAAN WesVar Others that can be used: SAS STATA SPSS Mplus
  • 25. Data/Research Resources Univ. of Michigan Consortium for social research: http://www.icpsr.umich.edu/ UCLA Statistical Computing: http://www.ats.ucla.edu/stat/ BRFSS Maps http://apps.nccd.cdc.gov/gisbrfss/default.aspx State Cancer Profiles http://statecancerprofiles.cancer.gov/
  • 26. References Korn, E.L. and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley State Cancer Profiles: http://statecancerprofiles.cancer.gov/ SUDAAN: http://www.rti.org/SUDAAN/ SAS: http://www.sas.com/ SPSS: http://www.spss.com/ STATA: http://www.stata.com/ WesVar: http://www.westat.com/wesvar/ Mplus: http://www.statmodel.com/
  • 27. Other Data Sources State registries Birth Death Cancer Emergency room admissions Acute outcomes 27
  • 29. Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 29
  • 30. Lesson One 30 Integrity
  • 31. Dirty Data Key-punch errors Invalid data Missing data Mislabeled variables Unknown variables 31
  • 33. Processing Data Recode data Label variables Format data 33
  • 34. Investigation Reality checks Out-of-range values Descriptive statistics Ranges: out-of-range or improbable values Frequencies: missing values or classes Simple graphical display 34
  • 36. Imputing Missing Values Increases available data Statistically more complex Defensibility Useful 36
  • 37. Lesson Two Spend time up-front being sure about your data Foundation of sand or stone? Crystal clear case definition & recodes More time preparing than analyzing Prevents problems Simplifies analysis 37
  • 44. Diagnostics Independence Homoskedasticity Skewness Influential observations 44
  • 45. Lesson Three Plan, then execute the plan Conform statistical technique to outcome and design Diagnostics 45
  • 46. Case Study Ongoing spatial epidemiology project Complex survey Cross-sectional Data linkage Childhood asthma episodes Air pollution exposure 46
  • 47. Case Study Air pollutant: acrolein EPA attributes >90% non-cancer respiratory health effects to acrolein No epidemiology to date 47
  • 49. National Health Interview Survey Health outcome Asthma episode in last 12 months 2000 – 2004 Children 3 – 17 years-old Parents of ~66,000 kids surveyed Nationally representative sample Complex survey weighting 49
  • 50. National Health Interview Survey Potential Confounders Smoking household Acrolein industry household Age, sex, race Education, income, single-parent family Access to care, insurance Urban/rural Census regional division 50
  • 51. National Air Toxics Assessment Air pollutant Acrolein Strong respiratory irritant Cigarette smoke; industrial emissions 2002 Modeled exposure assessment Census tracts nationwide 51
  • 52. 52 How would you link these two databases?
  • 54. 54 But, requires access to confidential NHIS data.
  • 55. NCHS Says Orient to data structure and contents Locate variables Download data Append & merge data Clean & recode data Format & label variables 55
  • 56. NHIS Data Processing Extract and compile data by year Multiple files 2004 redesign Compile data 2000 – 2004 Formatting and variable names a pain Identify records with complete data Link to NATA Done confidentially by NCHS staff 56
  • 57. Analysis Plan Hypothesis “Childhood asthma episodes are associated with census-tract-level estimates of acrolein exposure” Descriptive statistics Logistic regression Complex weighted variance estimation SAS-callable SUDAAN 57
  • 58. Wisdom Network Cultivate relationships Front-line staff Principal investigators 58
  • 59. Wisdom No one cares more about your problem than you Or, you should 59
  • 60. Wisdom Teach yourself Learn to learn 60
  • 61. Contact B. Rey de Castro, Sc.D. jsq7@cdc.gov http://www.slideshare.net/intelligo/secondary-data-analysis-5972949 61

Editor's Notes

  1. Stage 1: Primary sampling units (PSUs) are selected.  These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS).Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas.Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.