This document introduces Jude Towers and David Ellis, who are lecturers focused on quantitative methods and computational social science. They discuss how data can be messy, including inconsistencies in concepts and definitions, difficulties in data collection, and the politics of data cleaning. They argue that while data is imperfect, it is still useful for understanding society when the signal is distinguished from the noise. They provide two examples of working with messy real-world data: administrative health records from the NHS and social science replication problems. Their overall goal is to help people critically engage with quantitative data.
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Stories from the Field: Data are Messy and that's (kind of) ok
1. Stories from the Field: Data are Messy and that’s
(kind of) ok
Jude Towers, Lecturer in Sociology and Quantitative Methods
David Ellis, Lecturer in Computational Social Science
2. Introductions: who we are and why
we care about (even messy) data
Jude Towers
• Doctor of Applied Social Statistics, Lecturer in Sociology and
Quantitative Methods, Associate Director of the Violence &
Society UNESCO Centre and lead for the N8 Policing Research
Partnership, Training and Learning strand
• Current research is focused on the measurement of violence
• Work with data which is highly confidential and very, very
‘messy’ (e.g. individualised police records, NGO datasets
• Teach Making Research Count: Engaging with Quantitative
Data – Faculty of Arts & Social Sciences ‘prequel to technical
methods courses’ - thinking critically about data
• JISC-sponsored Data Champion
3. Introductions: why we care about
(even messy) data
David Ellis
• Doctor of Psychology, Lecturer in Computational Social Science
at Lancaster, Core Researcher as part of CREST Research Centre,
Honorary Research Fellow at Lincoln
• Current research considers the measurement of digital traces
• Data collected is often messy and cloud-based
• JISC-sponsored Data Champion
4. Data: what counts?
• Inclusive understanding of ‘data’ - the
collection, use and management of a
myriad of forms of data
– ‘field’ data
• Policing
• Health
• Replication crisis within
5. Why bother with (messy) data?
Data, and the analysis of data can entrench or contest our
understanding of the world – we cannot either accept them at
face-value, nor dismiss them as positivistic and of no use for
progressive social change…
• Need to better support academics, students, policy-makers,
practitioners and the general public to better understand the
implications of the construction and analysis of data, the
presentation of data, especially statistical findings, and the use
and interpretation of ‘evidence’
-> key tool is robust management of data
Contribution to a progressive society, the common good,
a public academia
6. Messy Data:
• All data are ‘messy’ to some degree: data from ‘the field’ can be
especially messy
• Concepts and definitions can be wildly different
• Getting data is hard
– Sources; collection methods; confidentiality and anonymity;
access; sampling frames -> consequences of explicit and implicit
inclusions and exclusions
• ‘Cleaning’ data is time consuming and can be highly political
– E.g. Outliers: important anomalies or data ‘mistakes’?
• Units of measurement
7. Data are messy
– but that’s (kind of) OK
GOAL: Distinguish between the signal and
the noise
• SIGNAL: real variation we want to explain
• NOISE: random variation probably caused by the
process of collecting and using data e.g.
measurement, sampling and human error ( caveat:
tomorrow with new knowledge or new techniques /
technology we might return to this seemingly random ‘noise’
and impose a new meaning)
Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
8. Learning
GOAL: to expand the current knowledge base to improve
understanding of a particular issue/topic: learning is more than
collecting or producing (new) data -> data needs to be integrated
into and to change the existing knowledge base
9. Example 1. NHS Administrative
Data
Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public
Health
11. Example 1. NHS Administrative
Data
code appointments
attended = 830,039
DNA = 56,441
appointments.csv
N=892,216
patients.csv
N=73,012
clinical.csv
N=704,828
remove non-appointments
based on time rules
compute number of
appointments attended/missed
for each patient
appointmenthistory dataframe
patient ID
DNA
attended
total
percentage missed
annual DNA rate
Categorise each patient. zero,
low medium, high
appointment History merged with
Patients file
(using patient ID as link)
patientappointments dataset
(N=70,165)
ID
sex
age
distance
Rur8
PracticeRur8
SIMD
PracticeSIMD
Ethnic
attended
DNA
total
percentage missed
category
annual rate (attended)
Ready for analysis and visualization
(N=67,705)
reclassify based on
codes of interest
N=825,784 remaining after (7.4%)
removed
Zero N = 44,685 (63.7%)
Low N = 19,281(27.5%)
Medium N = 5,097 (7.3%)
High N = 1,102 (1.6%)
N = 491 patients (<1%) with no
appointment data removed
remove patients with missing
data
N=2,460
(3.5%)
patients classified as frequent/non
frequent attenders
(10th centile (annual attendance
rate>=8.66))
Yes = 7,283
No = 62,882
subset to remove
remove ethnicity data
add age categories
remove administrative/
secretary appointments
N=891,921 remaining after (<.01%)
removed
remove duplicate
patients
N=2,356
My idea for this slide would be for David to give an example / exemplar from Health and I will give one from Crime to illustrate – as per below… we could talk to one slide or could separate out each bullet point into a separate slide and add examples – which ever you think is best…
[I’ve just added some slides pointing to 2 examples, but the points you raise here apply to everything]
Messy data – e.g. 80% of respondents reporting domestic violence to the Crime Survey for England and Wales have not reported to the police
Concepts and definitions – what is violence is the most controversial question in the field – is it narrow and specific e.g. physical act which causes injury, fear or distress or is it wide e.g. Zizek ‘violence of capitalism’ Galtung ‘ any unnecessary civilian death’ – implications for data
Often clashes with new ‘open data’ agenda e.g. CSEW Intimate Violence data – need to be certified, access via secure server from PC with static IP address, in a locked room with no public access; all outputs have to be checked and signed off before removal from server; only those certified can see data in ‘raw’ form or during analysis process; sampling frame CSEW excludes groups most likely to be victims of crime – homeless, anyone in an institutional setting e.g. prison, hospital, refuge, and anyone staying temporarily with friends or family (insecurely housed)
Outliers – remove serial killers from homicide trends
Unit of measurement: violent crime in UK going up if use crimes, going down if use victims
Success stories…
The truth is often far messier than what is presented within a journal
https://psychology.shinyapps.io/example3/
https://psychology.shinyapps.io/smartphonepersonality/
https://t.co/DurJDuJHQM