This document summarizes modeling methods for ground-level ozone concentrations in the contiguous United States. It describes four modeling methods tested: inverse distance weighting (IDW), ordinary kriging, generalized linear models (GLM), and geographically weighted regression (GWR). IDW and kriging account for spatial autocorrelation in the data. GLM and GWR use solar radiation and relative humidity as predictor variables. Kriging and GWR had the lowest errors when validated against new data points, though all models have limitations due to the characteristics and amount of input data. The document emphasizes that statistical models are abstractions of reality and should adhere to principles like parsimony.
Modeling Ground Ozone Levels Across the Contiguous United States
1. Modeling Ground Ozone for
the Contiguous United States
By
Michael Tuffly, Ph.D.
ERIA Consultants, LLC
GIS in the Rockies 2013
Cable Center
Denver, Colorado
10/9/2013
http://www.eriaconsultants.com
mtuffly@eriaconsultants.com
2. What is Ozone
Chemically
It is a molecule containing 3 Oxygen atoms (aka triatomic)
oxygen (O3).
Ozone is a powerful oxidizer (e.g. combines with Oxygen).
Examples of Oxidation
Rust on metal objects
Fire
“Oxidation is an increase in the oxidation number or a real or
apparent loss of one or more electrons.” (Miller 1981).
Miller. G. T., 1981. Chemistry: A basic Introduction Second Edition.
Wadsworth Publishing Company, Belmont, Californai. USA.
3. Ozone’s Location
Ozone which is located in the lower stratosphere (20 –
50 km in elevation) is beneficial to life on earth.
In the lower stratosphere ozone molecules form a
protective layer that filters out much of the high-energy
solar ultraviolet radiation.
3O2
Ultraviolet
Radiation
2 O3
4.
Ground Ozone
Ozone at ground level can be an issue to the health of plants and animals
One way ground ozone is formed is via a reaction of NOx VOC’s, and
sunlight.
The primary source of NOx is from internal combustion engines (i.e. cars) and coal fire power plants.
Many sources of VOC’s
Methane, CFC, Benzene, Methylene chloride, etc…
VOC’s have a high vapor pressure which produces low boiling point temperatures
Low boiling point temperatures allows VOC’s to escape to the atmosphere
5. Some Effects of Ground
Ozone
In animals
Lung tissue damage can result from inhalation of ozone
In plants
Leaf surface damage (oxidation)
Disruption in stomata cell functions
Causing excessive water loss emulating drought conditions
(Smith et al. 2008).
Smith, G. C., J. W. Coulston, and B. M. O'Connell. 2008. Ozone Bioindicators and Forest Health: A Guide to the
Evaluation, Analysis and Interpretation of the Ozone Injury Data in the Forest Inventory and Analysis Program.
United States Department of Agriculture, Forest Service General Technical Report 34
6. Other ways ozone can be
formed
Lighting (natural) (small contributor)
Shorts in electrical equipment (anthropogenic)
Provides that unique smell (very small contributor)
Ozone is also use as a replacement for Chlorine
(potentially high contributor; but, really unknown)
In swimming Pools
In sewage treatment plants
In domestic water supply as a disinfectant
7. Modeling Ozone
Source ozone data are from EPA CASTNET
ftp://ftp.epa.gov/castnet/data/
Data are from a single year 2010
In the summer months during the “Ozone Activity Envelope” (OAE)
June – August from 1:00 PM – 5:00 PM
Base data for ozone are recorded every hour
Only 73 ground ozone collections sites were used
This is part of a larger study over a ten year time period. These 73 sites were the only
sites consistent from 2002 to 2011.
Five variables were extracted from these data for the OAE and averaged:
Ozone (PPB)
Wind Speed (MS)
Relative Humidity (% * 100)
Solar Radiation (Watts per m2)
Temperature (degrees C * 10)
8. Modeling Methods
Four different modeling methods were investigated:
Ordinary Kriging
Generalized Linear Model (GLM)
Inverse Weighted Distance (IDW)
Geographically Weighted Regression (GWR)
Results for all four modeling methods were:
Compared with a set of sample data not used in model
creation via the Mean Squared Error Predicted (MSEP)
method.
9. Autocorrelation
First, need to know if the data are autocorrelated
If the data are autocorrelated then we can use:
IDW
Kriging
Results from Morans’I (a test for autocorrelation) (Moran 1950
Data have a strong positive autocorrelation
Data points that are close together have similar values
Index = 0.421; p-value = 0
If data were not autocorrelated
Our best estimate using IDW or Kriging would be the mean for the whole
study site.
.
Moran, P.A.P. (1950). Notes on continuous stochastic phenomena, Biometrika 37, pp17-23
10. IDW
Called a deterministic function
Using the same input parameters will get the same results.
Data needs to be spatially autocorrelated
Three Basic parameters are required
Number of nearest neighbors
Power
Study area boundary
Useful for Continuous data (e.g. rainfall, elevation)
Not useful for: Categorical, Binary, Ordinal
11. Identifying IDW Parameters
Cross Validation
Calculate a new value for that point using the neighboring points
Repeat this for all points
Remove one data point at one location
Calculate the mean squared error and variance
Mean Squared Error Predicted (MSEP) gives:
The best number of nearest neighbors
The best power
The fewer number of nearest neighbors produces good local
estimates; but, poor global.
A larger number of nearest neighbors produces good global
estimates; but, poor local.
Need to balance between local and global estimates.
13. Distance is calculated
using the Pythagorean
Theorem
a2 + b2 = c2
For Distance A to x (C)
1.582 + 1.582 = 2.232
2.4964 + 2.4964 = 4.9729
4.97290.5 = 2.23
A
a
B
c
b
C
D
16. Ordinary Kriging
(Krige 1951) (Matheron 1962)
A stochastic or indeterminate interpolation process
Where estimates or interpolations at an unobserved location are made based upon: the weighted
average of values at an observed location
Weights are base upon
The distance separating points
The function for the variogram
A variogram is used to identify key Kriging parameters:
Assumes an unknown stationary mean.
Sill, Range, Nugget, and covariance
Stationary mean refers that the mean over the area behaves predictably (e.g.. Gaussian).
Consider unbias
Mean residual sum to zero
Variance of error is minized
BLUE
Best Linear Unbias Estimator (Isaaks and Srivastava 1989)
Isaaks, E. H., and Srivastava, R (1989). An Introduction to Applied Spatial Statistics. Oxford, UK:
Oxford University Press.
Krige, D. G. 1951 A statistical approach to some basic mine valuation problems on the
Witwatersrand. Journal of the Chemical, Metal and Mining Society of South Africa 52 (6): 119 –
139)
Matheron, G. 1962. Traite de geostatistique appliquee. Editions Technip.
17. R output from Variogram
Spherical
Least Squares Estimate
Nugget = 7.7377
Sill = 47.48165
Range = 1100000
AICC
= 125.5306
Estimates:
Nugget = 15
Sill = 30
Range = 1,100,000
Gaussian
Least Squares Estimate
Nugget = 13.6845
Sill = 52.25631
Range = 1100000
AICC
= 128.4038
Exponential
Nugget = 9.2776
Sill = 71.61078
Range = 1100000
AICC
= 132.1289
Spherical and
Gaussian have
an AICC is less
than 3 units
apart; So
there is no
difference.
19. Number of Nearest Neighbors
39
38
37
36
35
var(crossidw$resid)
40
41
Kriging Cross Validation, Gaussian Model
5
10
15
20
No. of Neighbors
25
30
20.
21. Generalized Linear Models
(GLM)
Similar to linear regression
Different than IDW and Kriging
Needs predictor input variables
solar radiation and relative humidity proved to be significant predicator
variables.
Need to create the solar radiation and relative humidity surface via IDW
as input into the GLM equation.
The GLM equation is:
45.35 + (SR * 0.0332) + (RH * -0.235)
R2 = 0.58
The GLM describes the “Large Scale Variability”
The “Small Scale Variability” is computed by calculating the differences
between the observed values and the (GLM) predicted values.
Adding the “Large Scale Variability” to the “Small Scale Variability” can
produce a good predicative surface.
22.
23. Geographically Weighted
Regression (GWR)
A powerful modeling method that includes:
Linear Regression
Space
In a nutshell
GWR creates a series of local linear equations base upon the spatial parameters of the independent
variables:
Kernel Function
Fixed Search Radius
Variable (number of neighbors)* (AKA Adaptive)
Bandwidth Method (fixed radius)
Cells located with in the search radius will have the same coefficients.
Best if sample points are located in a systematic method (e.g. no a gird with fixed distances).
Bandwidth Method (Adaptive or variable search radius)
One that uses the number of nearest neighbors from user input
One that uses a cross validation method which attempts to minimize the collinearity
Best if sample points are randomly located in the study area.
A sample point will be used multiple times to construct multiple linear equations
Each cell may contain different regression coefficients
Each linear equation (fixed radius or adaptive) uses the same global predictor variables as GLM
Solar Radiation and Relative Humidity proved to be the best global independent variables.
24.
25.
26.
27. Results
Test
Residuale Autocorrelated
No
GLM + IDW
GWR using AICC and 25 nn No
GWR using CV
No
IDW
No
Kriging
No
MSE
MSE New Points
0.54
196.06
21.98
265.09
38.43
241.2
0.6
204.45
6.48
191.86
Data Issues
1) Should have more data points to create and test the models
2) Data points should be more distributed over the study area
(e.g. no points in Oregon, Idaho, etc.. and few points in
center of the nation.)
3) IDW MSE values for the observe points should not be
different. This is likely due to cell size and rounding errors.
4) The variables temperature and wind speed were tested in the
GWR model. Test results using these covariates included both
the CV method or number of nearest neighbors. Results were
very poor and not shown here.
28. Take Home Message Final
Statistical models are an abstraction of reality.
No statistical model is perfect. (e.g. errors)
Some models are better than other (Crawley 2007).
The correct model can never be known with complete certainty (Crawley 2007).
The simpler the model the better it is (Crawley 2007).
Models should include the Principle of Parsimony (Occam’s Razor)
Use the fewest number of variables
The correct explanation is the simplest explanation
Make sure that the assumptions of the model are followed.
Are the data IID.
Are the data spatially autocorrelated
Are the input variables correct?
Errors in measurement
Using temperature when solar radiation is a better independent variable.
How was the data collect
Random Sample, Systematic, etc…
Is there bias in the sample data?
Always as yourself does this model make sense.
Is the model predicted something where it should not
Example a fish population on land.
Crawley, M. J. 2007. The R Book. Imperial College London at Silwood Park, UK.
29. Final Quote
“Son you're going to drive me to drinking… if you don’t stop
driving that hot rod Lincoln.”
1971.