This document summarizes the statistical challenges in estimating the health effects of air pollution using areal unit study designs. It discusses how air pollution and health data are spatially misaligned at different geographical scales. It also describes potential sources of uncertainty and bias, like measurement error in pollution estimates and ecological bias. As an example, it presents results from a study that used a log-linear Poisson model to examine the relationship between PM10 and respiratory hospitalizations in Italy, finding a 3.5% increased risk per 1 mg/m3 increase in PM10. Overall, the document critiques statistical approaches for analyzing spatially-referenced air pollution and health data.
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Statistical models reveal air pollution's health impacts
1. Statistical models for
environmental studies: air
pollution and its adverse
health effects
LUIGI IPPOLITI
Professor of Statistics
Department of Economics,
University G.d’Annunzio, CHIETI-PESCARA
30.11.2021
2. Many epidemiological analyses have shown that air pollution is a cross-border problem with direct negative
effects on human health and the environment
According to the World Health Organisation (WHO), ambient air pollution:
o is the biggest environmental risk to human health globally (it is considered a major factor for
premature death);
o has indirect but tangible adverse effects on economies and societies more generally
With the aim of securing good air quality status for its citizens and the environment, the EU has established
a policy framework that employs legal regulation as the main policy instrument
Currently, the main EU strategic document with a specific focus on air quality is the 2013 clean air
programme for Europe. Its main objective is to ensure that by 2030, the number of premature deaths
caused by exposure to ground level ozone and fine particulate matter (PM2.5) is reduced by half as
compared to 2005 levels
Very recently, the European Green Deal provided for the adoption of a zero pollution action plan, expected
to include air quality improvement across the EU among its key objectives
Introduction
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
3
3. Because of the actuation of many regulatory policies air quality has been generally improved over the last
decades. However, as regulatory actions become more costly, policy makers, industry, and the public are
questioning whether lower levels of air pollution have actually yielded demonstrable improvements in public
health
Debates surrounding the control of air quality have thus increasingly emphasized the need for evidence of
the effectiveness of specific regulatory policies
Yet significant gaps in knowledge remain, particularly with regard to:
o the health effects of long-term exposure to lower levels of air pollution
o the adverse consequences of climate change (we expect more frequent and intense heat waves,
wildfires, droughts. Increased exposure to many environmental stressors will adversely affect human
health)
Introduction
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
4
4. It is important to recognize that several types of epidemiological studies have been introduced for
estimating health effects of air pollution. Most of the air pollution epidemiological studies can be broadly
classified according to the following designs:
o cohort studies: use individual-level data and quantify the health effects resulting from long-term
exposure to pollution
o time series studies: associate time-varying pollution exposures with time-varying event counts.
These are a type of ecologic study because they analyze daily population-averaged health outcomes
and exposure levels
o areal unit studies: are the spatial analogue of time series studies, and estimate the effects of air
pollution based on spatial contrasts in disease risk and pollution concentrations across a set of
contiguous areal units. Like time series studies they use population-level rather than individual-level
disease data, and cannot be used to quantify individual level cause and effect.
The choice of an optimal design depends upon the research question and the availability of data. No single
design is best for all applications. Each design targets specific types of effects, outcomes, and exposure
sources. An optimal design should have sufficient power to detect the effect of exposure; this depends on
the variability of exposure and the size of the study.
Introduction
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
5
5. Here we provide a critique of the statistical and epidemiological challenges faced by researchers
conducting areal/spatial unit studies
We are primarily interested in estimating and understanding the association between an exposure
to an environmental agent (e.g., PM10) and an outcome (e.g. hospitalization/mortality). One question
of scientific interest might be
“Are changes in the PM10 series associated with changes in the mortality series?"
We thus focus on the features of spatial data that allow us to build good statistical models and to ultimately
estimate and explain the health effects of environmental exposures accounting for all the sources of
uncertainty
These are a type of ecologic studies where we analyze population-averaged health outcomes and
exposure levels
Generalized linear models (GLM) are widely used to estimate the effects associated with exposure to air
pollution while accounting for smooth fluctuations in the mortality that confound estimates of the pollution
effects
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
6
Introduction
6. Suppose that a study region A is a large geographical region such as a State or country, partitioned into n
areal units A = (A1; … ; An) such as local authorities or census tracts. The areal units are typically defined
by administrative boundaries, and the populations living in each one will be of different sizes and
demographic structures
The differences in the population sizes and demographics between areal units are accounted for by
computing the expected number of disease cases based on national disease rates, which are denoted
here by e = {e(A1); … ; e(An)}. For this calculation the population in each areal unit are split into a total of R
strata based on their age, sex and possibly ethnicity, so let Nik denote the number of people from areal unit
Ai in strata k. If rk denotes the strata specific disease risk for the entire population, then e(Ai) = σ𝑘=1
𝑅
Nik rk
In areal unit studies of air pollution and health, we typically model the outcome as a spatial series of counts
representing the number of times a particular event has occurred on a given areal unit Ai. For example,
each observation of the outcome, Y(Ai), could be a count indicating the number of deaths hospitalizations
for heart failure that occurred on unit Ai
Then we can define the Standardised Morbidity/Mortality Ratio (SMR) as the ratio qi = Y(Ai) / e(Ai)
which is a simple estimate of disease risk in areal unit Ai. A SMR value of one represents an average risk,
while a SMR value of 1.3 means an area has a 30% increased risk of disease.
Models for Air Pollution and Health: outcomes and SMRs
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
7
7. Models for Air Pollution and Health: outcomes and SMRs
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
8
Example of maps of the standardised morbidity ratio (SMR) for hospital admissions in Piemonte and
Lombardia due to cardiovascular (left) and respiratory (right) diseases in 2011. The lables within each
district represents the acronymous of the 28 districts (ASL).
8. A vector of representative pollution concentrations for the n areal units is denoted by x = (x1; … ; xn),
and is typically measured in micrograms per cubic metre (mgm-3).
For simplicity of exposition, we work with one particular pollutant, which can also be taken as a continuous
index of air quality, but the methodology can be easily extended to include multiple pollutants in which case
each xi will be a vector of pollution concentrations.
The most harmful pollutants to human health in Europe are particulate matter (PM), nitrogen dioxide (NO₂)
and ground-level ozone (O₃)
Pollution concentrations data can be available from:
o a pollution monitoring network: concentrations can be obtained with little error, but they usually do
not have good spatial coverage and in particular some the n areal units may not have any air pollution
monitoring site at all;
o air pollution computer dispersion models: estimate pollution concentrations on a regular grid, and
give complete spatial coverage of the study region without any missing observation. However,
modelled concentrations are known to contain errors and biases, and are less accurate than the
monitored values.
Models for Air Pollution and Health: air pollution
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
9
9. Models for Air Pollution and Health: air pollution
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
10
Map of the 28 districts and the complete
monitoring network (dots). The thicker border in
the middle separates Piemonte from Lombardia.
Daily measurements of nitrogen dioxide (NO2)
for July 7, 2016 on an 82 × 128 regular grid, with
a spatial resolution of 5 × 5 km2. The data came
from the NINFA2015 (Northern Italy Network to
Forecast Photochemical and Aerosol pollution)
model
10. At the most basic level, we are trying to model the relationship between outcome Y and exposure X
in the presence of potential confounding factors Z such as meteorological and socio-economic
deprivation variables (e.g., income, unemployment, house price and other deprivation indices)
Models for Air Pollution and Health: confounding factors
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
11
11. Let Yi= Y(Ai) and ei= e(Ai). With spatial series of counts, the most commonly used model is the log-linear
Poisson model which takes the general form
𝑌𝑖~ Poisson 𝑒𝑖𝜃𝑖 , 𝑖 = 1, ⋯ , 𝑛
ln 𝜃𝑖 = 𝛽0 + 𝑥𝑖𝛽𝑥 + 𝒛𝑖
′
𝜷𝑧 + δ𝑖
This model takes the outcome Y(Ai) to be Poisson with mean m(Ai)=𝑒𝑖𝜃𝑖and the the log risk is modelled as a
linear combination of an overall intercept term 𝛽0, air pollution concentrations (𝑥𝑖𝛽𝑥), measured confounding
factors (𝒛𝑖
′
𝜷𝑥) and a random effect accounting for: 1) the spatial autocorrelation remaining in the data after
the covariate effects have been removed, 2) any overdispersion resulting from the restrictive Poisson
assumption and 3) for unmeasured confounding which occurs when an important spatially correlated
covariate is either unmeasured or unknown
The regression parameter 𝜷𝒙 quantifies the relationship between air pollution and disease risk on
the log scale, and is transformed to a relative risk for the purposes of interpretation. Hence, 100 ×
𝑒𝛽𝑥 − 1 measures the percent increase in mortality per unit increase in the pollutant, i.e. a relative risk of
1.05 means a 5% increase in disease risk when the pollution level increases by 1 mgm-3.
Models for Air Pollution and Health: the log-linear Poisson model
12
12. The log-linear Poisson model:
𝑌𝑖~ Poisson 𝑒𝑖𝜃𝑖 , 𝑖 = 1, ⋯ , 𝑛
ln 𝜃𝑖 = 𝛽0 + 𝑥𝑖𝛽𝑥 + 𝒛𝑖
′
𝜷𝑧 + δ𝑖
A Bayesian approach is the most popular inferential framework in these studies, because the models used
are typically hierarchical in nature and include spatial autocorrelation and different levels of variation
Furthermore, various factors for which no measurements are available can also confound the relationship
between pollution and health so we must also gauge the sensitivity of estimates of 𝛽𝑥 to potential
unmeasured confounding. Unfortunately, without direct measurements, we cannot include such potential
confounders into the model. However, we can make an assumption that these other factors vary smoothly
in space.
We may not be able to specify exactly how they vary, however, we might assume that they are not too
different from site to site. One common approach is to model the vector of random effects with a
Conditional Autoregressive (CAR) prior
Models for Air Pollution and Health: confounding factors
13
13. This spatial misalignment has been termed the
change of support problem, who argue that
the desired pollution concentration for areal
unit Ai can be obtained as follows
TITOLO PRESENTAZIONE | AUTORE
Models for Air Pollution and Health: spatial misalignment
15
The log-linear Poisson model:
𝑌𝑖~ Poisson 𝑒𝑖𝜃𝑖 , 𝑖 = 1, ⋯ , 𝑛
ln 𝜃𝑖 = 𝛽0 + 𝑥𝑖𝛽𝑥 + 𝒛𝑖
𝑇
𝜷𝑧 + δ𝑖
Estimating representative pollution concentrations: the
disease and pollution data are spatially misaligned,
because the geographical scales at which the data are
measured are different:
o The disease data are available as a summary measure
for each areal unit Ai, which are typically defined by
administrative boundaries and are of irregular shapes
and sizes
o In contrast, monitored and modelled pollution data are
available at point and grid locations within the study
region A, and are typically irregularly spaced (monitored)
and on a regular grid (modelled) respectively
14. The change of support problem implies that the desired pollution concentration for areal unit Ai is
The Poisson log-linear model treats the vector of estimated pollution concentrations x = (x1; … ; xn), as
known constants, so that xi is the known and constant pollution concentration for areal unit Ai
This assumption ignores two different sources of uncertainty in x when estimating its health effects.
1) The first source is related to measurement errors which occur because the true constant pollution
exposure in each areal unit is unknown and its estimated value is subject to error and uncertainty.
This leads to the following measurement model:
𝑧 𝒔𝑗 = 𝑥 𝒔𝑗 + 𝜀 𝒔𝑗 , 𝜀 𝒔𝑗 ~𝑁 0, 𝜎2
𝑥 𝒔𝑗 = 𝑣 𝒔𝑗 ′a + f 𝒔𝑗
where 𝒔𝑗 represents sites of a prediction grid over A, 𝑣 𝒔𝑗 contains regressors and/or trend components
and f 𝒔𝑗 is a correlated Geostatistical spatial process with valid spatial correlation function
Models for Air Pollution and Health: spatial misalignment
16
15. Models for Air Pollution and Health: spatial misalignment
STATISTICAL MODELS FOR ENVIRONMENTAL STUDIES: AIR POLLUTION AND ITS ADVERSE HEALTH EFFECTS | LUIGI IPPOLITI
17
The posterior predictive mean average PM10
concentration obtained from the pollution model
shows that the highest concentrations are
observed in the main cities, with Lombardia more
polluted than Piemonte
Possible prediction grids for the pollutant
16. 2) The second source of variation is that the pollution concentration is not constant across each areal unit,
meaning that there is within-area variability in exposure.
This within-area variability in exposure means that the population level risk model has a different algebraic
form compared to what one would obtain by aggregating an individual level risk model to the population
scale.
The difference between the estimated individual and population level relationships is known as ecological
bias, and has been the subject of extensive study
Models for Air Pollution and Health: ecological bias
18
17. We briefly discuss the fit of the log-linear Poisson model by presenting a study examining the impact of
long-term exposure to PM10 on respiratory hospitalisation risk in Piemonte and Lombardia which
are partitioned in 28 districts (ASL).
The monitoring network for PM10 (considered as annual average) consists of 96 sites
Measured confounders are given by temperature and humidity measurements (socio-economic deprivation
variables on disease risk have not been used here)
The disease data are counts of the numbers of hospital admissions due to respiratory disease in 2011, and
the spatial pattern in the standardised morbidity ratio (SMR) displayed in the previous slides
The estimated relative risks and 95% credible intervals corresponding to a 1 increase mgm-3 in PM10
concentrations is 1.035 (1.013, 1.059) suggesting that areal units with higher concentrations of PM10
exhibit higher risks for respiratory disease, with increases about 3.5 %
Models for Air Pollution and Health: fitting the log-linear Poisson model
19
18. o This work has critiqued the statistical challenges involved in estimating the long-term health impact of air
pollution using an ecological areal unit study design
o A multivariate specification of the model is possible but requires the use of more complex spatial correlation
structures (e.g. Linear Model of Corregionalization)
o Time series studies pose similar problems (i.e. understanding the relationship between day-to-day changes
air pollution levels and day-to-day changes in mortality counts). The statistical analysis can be made richer
with the use of Distributed Lag Models and Impulse Response Functions
o Spatiotemporal studies are likely to throw up a number of additional modeling challenges
Models for Air Pollution and Health: discussion
20
19. Bruno F., Cameletti M., Franco-Villoria M., Greco F., Ignaccolo R., Ippoliti L., Valentini P., Ventrucci M. (2016) A survey on
ecological regression for health hazard associated with air pollution. SPATIAL STATISTICS, vol. 18, p. 276-299, ISSN:
2211-6753, doi: 10.1016/j.spasta.2016.05.003
M. Blangiardo, M. Pirani, L. Kanapka, A. Hansell, and G. Fuller (2019). A hierarchical modelling approach to assess multi
pollutant effects in time-series studies. Plos One, 14:1–16.
N. Cressie, Statistics for Spatial Data (1993), revised ed., Wiley, New York.
Banerjee, S., Carlin, B., Gelfand, A. (2004). Hierarchical Modeling and Analysis for Spatial Data. In: Monographs on
Statistics and Applied Probability, Chapman and Hall, New York.
Besag, J., York, J., Mollié, A. (1991). Bayesian image restoration, with two applications in spatial statistics. Ann. Inst.
Statist. Math. 43 (1), 1–20.
F. Dominici, M. Daniels, S.L. Zeger, and J.M. Samet (2002). Air pollution and mortality: estimating regional and national
dose-response relationships. Journal of the American Statistical Association, 97:100–111.
F. Dominici, R.D. Peng, C.D. Barr, and M.L. Bell (2010). Protecting human health from air pollution:shifting from a single-
pollutant to a multipollutant approach. Epidemiology, 21:187–194.
European Environmental Agency. Air quality in Europe — No 09/2020 Report
Models for Air Pollution and Health: some references
21
20. EPRS | European Parliamentary Research Service (2021). EU policy on air quality: Implementation of
selected EU legislation. European Implementation Assessment
Gelfand, A., Diggle, P., Fuentes, M., Guttorp, P. (Eds.), (2010). Handbook of Spatial Statistics. Chapman &
Hall.
Gelfand, A. (2012). Hierarchical modeling for spatial data problems. Spat. Stat. 1, 30–39.
Ippoliti L., Martin R.J., Romagnoli L. (2018) Efficient likelihood computations for some multivariate Gaussian
Markov random fields. JOURNAL OF MULTIVARIATE ANALYSIS, vol. 168, p. 185-200,
Ippoliti, L., Valentini, P., Gamerman, D. (2012). Space–time modelling of coupled spatio-temporal
environmental variables. J. Roy. Statist. Soc. Ser. C 61, 175–200.
Reich, B., Fuentes, M., Burke, J. (2008). Analysis of the effects of ultrafine particulate matter while
accounting for human exposure. Environmetrics 20 (2), 131–146.
H. Rue, L. Held (2005) Gaussian Markov Random Fields:Theory and Applications, Chapman and Hall/CRC,
Boca Raton, FL.
Models for Air Pollution and Health: references
22
21. Shaddick, G., Zidek, J. (2015). Spatio-Temporal Methods in Environmental Epidemiology. Chapman & Hall
Waller, L., Gotway, C. (2004). Applied Spatial Statistics for Public Health Data. Wiley.
Wakefield, J. (2007). Disease mapping and spatial regression with count data. Biostatistics 8 (2), 158–183
WHO, 2015. Economic cost of the health impact of air pollution in Europe: Clean air, health and wealth.
Tech. rep. WHO Regional Office for Europe, URL: hiip://www.euro.who.int/en/media-
centre/events/events/2015/04/ehp-mid-termreview/publications/economic-cost-of-the-health-impact-of-air-
pollution-in-europe.
Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on ambient air quality and cleaner
air for Europe (the 2008 AAQ directive) and Directive 2004/107/EC of the European Parliament and of the Council of 15
December 2004 relating to arsenic, cadmium, mercury, nickel and polycyclic aromatic hydrocarbons in ambient air (the
2004 AAQ directive).
Directive (EU) 2016/2284 of the European Parliament and of the Council of 14 December 2016 on the reduction of
national emissions of certain atmospheric pollutants, amending Directive 2003/35/EC and repealing Directive
2001/81/EC.
Directive 2010/75/EU of the European Parliament and of the Council of 24 November 2010 on industrial emissions
(integrated pollution prevention and control directive, IPPCD).
Models for Air Pollution and Health: references
23