Stories from the Field: Data are Messy and that's (kind of) ok

Stories from the Field: Data are Messy and that’s
(kind of) ok
Jude Towers, Lecturer in Sociology and Quantitative Methods
David Ellis, Lecturer in Computational Social Science

Introductions: who we are and why
we care about (even messy) data
Jude Towers
• Doctor of Applied Social Statistics, Lecturer in Sociology and
Quantitative Methods, Associate Director of the Violence &
Society UNESCO Centre and lead for the N8 Policing Research
Partnership, Training and Learning strand
• Current research is focused on the measurement of violence
• Work with data which is highly confidential and very, very
‘messy’ (e.g. individualised police records, NGO datasets
• Teach Making Research Count: Engaging with Quantitative
Data – Faculty of Arts & Social Sciences ‘prequel to technical
methods courses’ - thinking critically about data
• JISC-sponsored Data Champion

Introductions: why we care about
(even messy) data
David Ellis
• Doctor of Psychology, Lecturer in Computational Social Science
at Lancaster, Core Researcher as part of CREST Research Centre,
Honorary Research Fellow at Lincoln
• Current research considers the measurement of digital traces
• Data collected is often messy and cloud-based
• JISC-sponsored Data Champion

Data: what counts?
• Inclusive understanding of ‘data’ - the
collection, use and management of a
myriad of forms of data
– ‘field’ data
• Policing
• Health
• Replication crisis within

Why bother with (messy) data?
Data, and the analysis of data can entrench or contest our
understanding of the world – we cannot either accept them at
face-value, nor dismiss them as positivistic and of no use for
progressive social change…
• Need to better support academics, students, policy-makers,
practitioners and the general public to better understand the
implications of the construction and analysis of data, the
presentation of data, especially statistical findings, and the use
and interpretation of ‘evidence’
-> key tool is robust management of data
Contribution to a progressive society, the common good,
a public academia

Messy Data:
• All data are ‘messy’ to some degree: data from ‘the field’ can be
especially messy
• Concepts and definitions can be wildly different
• Getting data is hard
– Sources; collection methods; confidentiality and anonymity;
access; sampling frames -> consequences of explicit and implicit
inclusions and exclusions
• ‘Cleaning’ data is time consuming and can be highly political
– E.g. Outliers: important anomalies or data ‘mistakes’?
• Units of measurement

Data are messy
– but that’s (kind of) OK
GOAL: Distinguish between the signal and
the noise
• SIGNAL: real variation we want to explain
• NOISE: random variation probably caused by the
process of collecting and using data e.g.
measurement, sampling and human error ( caveat:
tomorrow with new knowledge or new techniques /
technology we might return to this seemingly random ‘noise’
and impose a new meaning)
Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.

Learning
GOAL: to expand the current knowledge base to improve
understanding of a particular issue/topic: learning is more than
collecting or producing (new) data -> data needs to be integrated
into and to change the existing knowledge base

Example 1. NHS Administrative
Data
Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public
Health

Data

Data
code appointments
attended = 830,039
DNA = 56,441
appointments.csv
N=892,216
patients.csv
N=73,012
clinical.csv
N=704,828
remove non-appointments
based on time rules
compute number of
appointments attended/missed
for each patient
appointmenthistory dataframe
patient ID
DNA
attended
total
percentage missed
annual DNA rate
Categorise each patient. zero,
low medium, high
appointment History merged with
Patients file
(using patient ID as link)
patientappointments dataset
(N=70,165)
ID
sex
age
distance
Rur8
PracticeRur8
SIMD
PracticeSIMD
Ethnic
attended
DNA
total
percentage missed
category
annual rate (attended)
Ready for analysis and visualization
(N=67,705)
reclassify based on
codes of interest
N=825,784 remaining after (7.4%)
removed
Zero N = 44,685 (63.7%)
Low N = 19,281(27.5%)
Medium N = 5,097 (7.3%)
High N = 1,102 (1.6%)
N = 491 patients (<1%) with no
appointment data removed
remove patients with missing
data
N=2,460
(3.5%)
patients classified as frequent/non
frequent attenders
(10th centile (annual attendance
rate>=8.66))
Yes = 7,283
No = 62,882
subset to remove
remove ethnicity data
add age categories
remove administrative/
secretary appointments
N=891,921 remaining after (<.01%)
removed
remove duplicate
patients
N=2,356

Example 2. Problems within Social
Science

Science 5
4
3
2
1

Science
5
Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and
Social Networking

Thank you!
j.towers1@Lancaster.ac.uk (@towersjude)
d.a.ellis@Lancaster.ac.uk (@davidaellis)
rdm@lancaster.ac.uk

Stories from the Field: Data are Messy and that's (kind of) ok

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stories from the Field: Data are Messy and that's (kind of) ok

Similar to Stories from the Field: Data are Messy and that's (kind of) ok (20)

More from Jisc RDM

More from Jisc RDM (20)

Recently uploaded

Recently uploaded (20)

Stories from the Field: Data are Messy and that's (kind of) ok

Editor's Notes