SlideShare a Scribd company logo
1 of 19
Stories from the Field: Data are Messy and that’s
(kind of) ok
Jude Towers, Lecturer in Sociology and Quantitative Methods
David Ellis, Lecturer in Computational Social Science
Introductions: who we are and why
we care about (even messy) data
Jude Towers
• Doctor of Applied Social Statistics, Lecturer in Sociology and
Quantitative Methods, Associate Director of the Violence &
Society UNESCO Centre and lead for the N8 Policing Research
Partnership, Training and Learning strand
• Current research is focused on the measurement of violence
• Work with data which is highly confidential and very, very
‘messy’ (e.g. individualised police records, NGO datasets
• Teach Making Research Count: Engaging with Quantitative
Data – Faculty of Arts & Social Sciences ‘prequel to technical
methods courses’ - thinking critically about data
• JISC-sponsored Data Champion
Introductions: why we care about
(even messy) data
David Ellis
• Doctor of Psychology, Lecturer in Computational Social Science
at Lancaster, Core Researcher as part of CREST Research Centre,
Honorary Research Fellow at Lincoln
• Current research considers the measurement of digital traces
• Data collected is often messy and cloud-based
• JISC-sponsored Data Champion
Data: what counts?
• Inclusive understanding of ‘data’ - the
collection, use and management of a
myriad of forms of data
– ‘field’ data
• Policing
• Health
• Replication crisis within
Why bother with (messy) data?
Data, and the analysis of data can entrench or contest our
understanding of the world – we cannot either accept them at
face-value, nor dismiss them as positivistic and of no use for
progressive social change…
• Need to better support academics, students, policy-makers,
practitioners and the general public to better understand the
implications of the construction and analysis of data, the
presentation of data, especially statistical findings, and the use
and interpretation of ‘evidence’
-> key tool is robust management of data
Contribution to a progressive society, the common good,
a public academia
Messy Data:
• All data are ‘messy’ to some degree: data from ‘the field’ can be
especially messy
• Concepts and definitions can be wildly different
• Getting data is hard
– Sources; collection methods; confidentiality and anonymity;
access; sampling frames -> consequences of explicit and implicit
inclusions and exclusions
• ‘Cleaning’ data is time consuming and can be highly political
– E.g. Outliers: important anomalies or data ‘mistakes’?
• Units of measurement
Data are messy
– but that’s (kind of) OK
GOAL: Distinguish between the signal and
the noise
• SIGNAL: real variation we want to explain
• NOISE: random variation probably caused by the
process of collecting and using data e.g.
measurement, sampling and human error ( caveat:
tomorrow with new knowledge or new techniques /
technology we might return to this seemingly random ‘noise’
and impose a new meaning)
Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
Learning
GOAL: to expand the current knowledge base to improve
understanding of a particular issue/topic: learning is more than
collecting or producing (new) data -> data needs to be integrated
into and to change the existing knowledge base
Example 1. NHS Administrative
Data
Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public
Health
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
code appointments
attended = 830,039
DNA = 56,441
appointments.csv
N=892,216
patients.csv
N=73,012
clinical.csv
N=704,828
remove non-appointments
based on time rules
compute number of
appointments attended/missed
for each patient
appointmenthistory dataframe
patient ID
DNA
attended
total
percentage missed
annual DNA rate
Categorise each patient. zero,
low medium, high
appointment History merged with
Patients file
(using patient ID as link)
patientappointments dataset
(N=70,165)
ID
sex
age
distance
Rur8
PracticeRur8
SIMD
PracticeSIMD
Ethnic
attended
DNA
total
percentage missed
category
annual rate (attended)
Ready for analysis and visualization
(N=67,705)
reclassify based on
codes of interest
N=825,784 remaining after (7.4%)
removed
Zero N = 44,685 (63.7%)
Low N = 19,281(27.5%)
Medium N = 5,097 (7.3%)
High N = 1,102 (1.6%)
N = 491 patients (<1%) with no
appointment data removed
remove patients with missing
data
N=2,460
(3.5%)
patients classified as frequent/non
frequent attenders
(10th centile (annual attendance
rate>=8.66))
Yes = 7,283
No = 62,882
subset to remove
remove ethnicity data
add age categories
remove administrative/
secretary appointments
N=891,921 remaining after (<.01%)
removed
remove duplicate
patients
N=2,356
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science 5
4
3
2
1
Example 2. Problems within Social
Science
5
Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and
Social Networking
Example 2. Problems within Social
Science
Thank you!
j.towers1@Lancaster.ac.uk (@towersjude)
d.a.ellis@Lancaster.ac.uk (@davidaellis)
rdm@lancaster.ac.uk

More Related Content

What's hot

Responsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing openResponsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing openLudo Waltman
 
Scientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunitiesScientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunitiesLudo Waltman
 
An in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performanceAn in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performanceLudo Waltman
 
Social sciences research addressing societal challenges
Social sciences research addressing societal challengesSocial sciences research addressing societal challenges
Social sciences research addressing societal challengesLudo Waltman
 
Open science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometricsOpen science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometricsLudo Waltman
 
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Project
 
The landscape of research on research
The landscape of research on researchThe landscape of research on research
The landscape of research on researchLudo Waltman
 
Lecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetricsLecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetricsThed van Leeuwen
 
IT3010 Lecture on Data Analysis
IT3010 Lecture on Data AnalysisIT3010 Lecture on Data Analysis
IT3010 Lecture on Data AnalysisBabakFarshchian
 
Uses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impactUses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impactBerenika Webster
 
IT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literatureIT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literatureBabakFarshchian
 
Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?Ludo Waltman
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxMaicol Suntasig
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluationpetrknoth
 
IT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of researchIT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of researchBabakFarshchian
 

What's hot (20)

Responsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing openResponsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing open
 
Scientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunitiesScientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunities
 
Chew schoeborn niso vc apr 29
Chew schoeborn niso vc apr 29Chew schoeborn niso vc apr 29
Chew schoeborn niso vc apr 29
 
An in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performanceAn in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performance
 
Social sciences research addressing societal challenges
Social sciences research addressing societal challengesSocial sciences research addressing societal challenges
Social sciences research addressing societal challenges
 
Open science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometricsOpen science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometrics
 
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
 
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
 
The landscape of research on research
The landscape of research on researchThe landscape of research on research
The landscape of research on research
 
Casey, "Measuring Science Impact Among Citations (case studies)"
Casey, "Measuring Science Impact Among Citations (case studies)"Casey, "Measuring Science Impact Among Citations (case studies)"
Casey, "Measuring Science Impact Among Citations (case studies)"
 
Lecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetricsLecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetrics
 
IT3010 Lecture on Data Analysis
IT3010 Lecture on Data AnalysisIT3010 Lecture on Data Analysis
IT3010 Lecture on Data Analysis
 
Uses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impactUses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impact
 
TAMU_Poster_2015
TAMU_Poster_2015TAMU_Poster_2015
TAMU_Poster_2015
 
IT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literatureIT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literature
 
Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
 
TDT39 oppstartsmøte
TDT39 oppstartsmøteTDT39 oppstartsmøte
TDT39 oppstartsmøte
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluation
 
IT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of researchIT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of research
 

Similar to Stories from the Field: Data are Messy and that's (kind of) ok

Sdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptxSdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptxkimlyman
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policyHistoric Environment Scotland
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Nicola Osborne
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceJian Qin
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalkimlyman
 
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...The Higher Education Academy
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementMark Reed
 
ISSOTL Presentation
ISSOTL PresentationISSOTL Presentation
ISSOTL PresentationDavid Heath
 
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...Cindy Regalado
 
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier
 
Dhis elective topic 3 - info cycle, collection and collation
Dhis elective   topic 3 - info cycle, collection and collationDhis elective   topic 3 - info cycle, collection and collation
Dhis elective topic 3 - info cycle, collection and collationorach2
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)Iin Angriyani
 
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...Margaret Gold
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1sasi
 
Sdal overview sallie keller
Sdal overview  sallie kellerSdal overview  sallie keller
Sdal overview sallie kellerkimlyman
 

Similar to Stories from the Field: Data are Messy and that's (kind of) ok (20)

Sdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptxSdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptx
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policy
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information Science
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
A Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare DatasetsA Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare Datasets
 
ISSOTL Presentation
ISSOTL PresentationISSOTL Presentation
ISSOTL Presentation
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
 
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
 
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
 
Dhis elective topic 3 - info cycle, collection and collation
Dhis elective   topic 3 - info cycle, collection and collationDhis elective   topic 3 - info cycle, collection and collation
Dhis elective topic 3 - info cycle, collection and collation
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
 
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Sdal overview sallie keller
Sdal overview  sallie kellerSdal overview  sallie keller
Sdal overview sallie keller
 

More from Jisc RDM

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_BurlandJisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc RDM
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc RDM
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data ModellingJisc RDM
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewJisc RDM
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Jisc RDM
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data ToolkitJisc RDM
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318Jisc RDM
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy TurnerJisc RDM
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the caseJisc RDM
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPCJisc RDM
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Jisc RDM
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMJisc RDM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpiecesJisc RDM
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - IntroJisc RDM
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanJisc RDM
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardJisc RDM
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela DappartJisc RDM
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam HarwoodJisc RDM
 

More from Jisc RDM (20)

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 

Recently uploaded (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 

Stories from the Field: Data are Messy and that's (kind of) ok

  • 1. Stories from the Field: Data are Messy and that’s (kind of) ok Jude Towers, Lecturer in Sociology and Quantitative Methods David Ellis, Lecturer in Computational Social Science
  • 2. Introductions: who we are and why we care about (even messy) data Jude Towers • Doctor of Applied Social Statistics, Lecturer in Sociology and Quantitative Methods, Associate Director of the Violence & Society UNESCO Centre and lead for the N8 Policing Research Partnership, Training and Learning strand • Current research is focused on the measurement of violence • Work with data which is highly confidential and very, very ‘messy’ (e.g. individualised police records, NGO datasets • Teach Making Research Count: Engaging with Quantitative Data – Faculty of Arts & Social Sciences ‘prequel to technical methods courses’ - thinking critically about data • JISC-sponsored Data Champion
  • 3. Introductions: why we care about (even messy) data David Ellis • Doctor of Psychology, Lecturer in Computational Social Science at Lancaster, Core Researcher as part of CREST Research Centre, Honorary Research Fellow at Lincoln • Current research considers the measurement of digital traces • Data collected is often messy and cloud-based • JISC-sponsored Data Champion
  • 4. Data: what counts? • Inclusive understanding of ‘data’ - the collection, use and management of a myriad of forms of data – ‘field’ data • Policing • Health • Replication crisis within
  • 5. Why bother with (messy) data? Data, and the analysis of data can entrench or contest our understanding of the world – we cannot either accept them at face-value, nor dismiss them as positivistic and of no use for progressive social change… • Need to better support academics, students, policy-makers, practitioners and the general public to better understand the implications of the construction and analysis of data, the presentation of data, especially statistical findings, and the use and interpretation of ‘evidence’ -> key tool is robust management of data Contribution to a progressive society, the common good, a public academia
  • 6. Messy Data: • All data are ‘messy’ to some degree: data from ‘the field’ can be especially messy • Concepts and definitions can be wildly different • Getting data is hard – Sources; collection methods; confidentiality and anonymity; access; sampling frames -> consequences of explicit and implicit inclusions and exclusions • ‘Cleaning’ data is time consuming and can be highly political – E.g. Outliers: important anomalies or data ‘mistakes’? • Units of measurement
  • 7. Data are messy – but that’s (kind of) OK GOAL: Distinguish between the signal and the noise • SIGNAL: real variation we want to explain • NOISE: random variation probably caused by the process of collecting and using data e.g. measurement, sampling and human error ( caveat: tomorrow with new knowledge or new techniques / technology we might return to this seemingly random ‘noise’ and impose a new meaning) Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
  • 8. Learning GOAL: to expand the current knowledge base to improve understanding of a particular issue/topic: learning is more than collecting or producing (new) data -> data needs to be integrated into and to change the existing knowledge base
  • 9. Example 1. NHS Administrative Data Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public Health
  • 10. Example 1. NHS Administrative Data
  • 11. Example 1. NHS Administrative Data code appointments attended = 830,039 DNA = 56,441 appointments.csv N=892,216 patients.csv N=73,012 clinical.csv N=704,828 remove non-appointments based on time rules compute number of appointments attended/missed for each patient appointmenthistory dataframe patient ID DNA attended total percentage missed annual DNA rate Categorise each patient. zero, low medium, high appointment History merged with Patients file (using patient ID as link) patientappointments dataset (N=70,165) ID sex age distance Rur8 PracticeRur8 SIMD PracticeSIMD Ethnic attended DNA total percentage missed category annual rate (attended) Ready for analysis and visualization (N=67,705) reclassify based on codes of interest N=825,784 remaining after (7.4%) removed Zero N = 44,685 (63.7%) Low N = 19,281(27.5%) Medium N = 5,097 (7.3%) High N = 1,102 (1.6%) N = 491 patients (<1%) with no appointment data removed remove patients with missing data N=2,460 (3.5%) patients classified as frequent/non frequent attenders (10th centile (annual attendance rate>=8.66)) Yes = 7,283 No = 62,882 subset to remove remove ethnicity data add age categories remove administrative/ secretary appointments N=891,921 remaining after (<.01%) removed remove duplicate patients N=2,356
  • 12. Example 1. NHS Administrative Data
  • 13. Example 1. NHS Administrative Data
  • 14. Example 2. Problems within Social Science
  • 15. Example 2. Problems within Social Science
  • 16. Example 2. Problems within Social Science 5 4 3 2 1
  • 17. Example 2. Problems within Social Science 5 Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and Social Networking
  • 18. Example 2. Problems within Social Science

Editor's Notes

  1. My idea for this slide would be for David to give an example / exemplar from Health and I will give one from Crime to illustrate – as per below… we could talk to one slide or could separate out each bullet point into a separate slide and add examples – which ever you think is best… [I’ve just added some slides pointing to 2 examples, but the points you raise here apply to everything] Messy data – e.g. 80% of respondents reporting domestic violence to the Crime Survey for England and Wales have not reported to the police Concepts and definitions – what is violence is the most controversial question in the field – is it narrow and specific e.g. physical act which causes injury, fear or distress or is it wide e.g. Zizek ‘violence of capitalism’ Galtung ‘ any unnecessary civilian death’ – implications for data Often clashes with new ‘open data’ agenda e.g. CSEW Intimate Violence data – need to be certified, access via secure server from PC with static IP address, in a locked room with no public access; all outputs have to be checked and signed off before removal from server; only those certified can see data in ‘raw’ form or during analysis process; sampling frame CSEW excludes groups most likely to be victims of crime – homeless, anyone in an institutional setting e.g. prison, hospital, refuge, and anyone staying temporarily with friends or family (insecurely housed) Outliers – remove serial killers from homicide trends Unit of measurement: violent crime in UK going up if use crimes, going down if use victims
  2. Success stories…
  3. The truth is often far messier than what is presented within a journal https://psychology.shinyapps.io/example3/ https://psychology.shinyapps.io/smartphonepersonality/ https://t.co/DurJDuJHQM