CLIM Undergraduate Workshop: Introduction to Spatial Data Analysis with R - Maggie Johnson, Oct 23, 2017
1. An Introduction to Spatial Data Analysis
Maggie Johnson
Statistical and Applied Mathematical Sciences Institute
North Carolina State University
mjohnson@samsi.info
CLIM Undergraduate Workshop
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 1 / 27
2. Dependent Data
The First Law of Geography
“Everything is related to everything else, but near things are more related than
distant things.” – Waldo Tobler
Time
AirPassengers
1950 1952 1954 1956 1958 1960
100400
Figure: Time series data
−92 −90 −88 −86 −84
38404244
0
50
100
150
June 18, 1987 Ozone Conc
Figure: Spatial data
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 2 / 27
3. Spatial Data
The term spatial data is often used to refer to data that are connected to
physical geographical locations.
Notation:
D ⊂ Rd
represents the spatial domain, usually d = 2
s ∈ D is a d-dimensional vector representing a “location” in space. e.g.
s ≡ (longitude, latitude)
Three main types of spatial data
Point-referenced (geostatistical) data
Areal-referenced data
Point process data
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 3 / 27
4. Point-Referenced Data
Features:
Data are observations of a continuous spatial process
We only observed data at a subset of fixed locations
Goals:
Main goal is often prediction at unobserved locations
Examples:
Daily maximum temperature data collected at land surface monitoring
stations across the US
Ozone concentration measured at stations
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 4 / 27
5. Point-Referenced (Geostatistical) Data
−86 −84 −82 −80 −78
34353637383940
GHCN Station Locations
−86 −84 −82 −80 −78
34353637383940
22
24
26
28
30
32
July Average Max Temp
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 5 / 27
6. Focus for Today
Point referenced data (geostatistics)
Prediction at unobserved loations
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 6 / 27
7. Data
Average Maximum July Temperature and Elevation
−86 −84 −82 −80 −78
34353637383940
22
24
26
28
30
32
Avg Maximum July Temp
−86 −84 −82 −80 −78
34353637383940
200
400
600
800
1000
1200
1400
Elevation
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 7 / 27
8. Correlation
Correlation: A numeric measure of the relationship between two variables, ranges
between -1 and 1.
If two variables are correlated, knowing the value of one variable provides
information about what we expect the value of the other variable should be.
−2 −1 0 1 2 3
−2−1012
Corr = 0.81
x
y
−2 −1 0 1 2 3
−3−2−1012
Corr = −0.56
x
y
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 8 / 27
9. Average Maximum July Temperature and Elevation
0 500 1000 1500
222426283032
Corr = −0.82
Elevation
AvgJulyMaxTemp
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 9 / 27
10. Exploring Spatial Dependence
Correlogram
An exploratory visualization of the correlation between locations as a function of
distance.
1 Compute the pairwise distance between all locations
dist(s1, s2) = (lat1 − lat2)2 + (lon1 − lon2)2
2 Bin distances into a set of groups, estimate correlation
3 Plot estimated correlations against distance
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 10 / 27
11. Exploring Spatial Dependence
0 1 2 3 4 5 6 7
−1.0−0.50.00.51.0
Avg July Temp
Distance
Correlation
0 1 2 3 4 5 6 7
−1.0−0.50.00.51.0
Independent Data
Distance
Correlation
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 11 / 27
12. Linear Regression Model
The classical simple linear regression model assumes
ObservedValue = β0 + Covariate∗
β1 + error
For example,
ObservedTemperature(s) = β0 + Elevation(s)∗
β1 + error(s)
is the linear model defining temperature at a location (s) as a linear function of
elevation at that location.
errors are assumed independent and normally distributed (N(0, σ2
))
β0, β1 and σ2
are unknown, so to use the model we need to estimate them
(use R!)
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 12 / 27
13. Average Maximum July Temperature and Elevation
Idea is to find “best fit
line”, y = mx + b to the
data
Using R, we get
Temp = 32.666 + Elev∗
(−0.0065)
0 500 1000 1500
222426283032
Elevation
AvgJulyMaxTemp
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 13 / 27
14. Prediction using the simple linear regression model
Once we’ve estimated the model, as long as we have a value of elevation at a new
location s0 we can predict temperature at that location.
PredictedTemp(s0) = 32.366 + Elevation(s0)∗
(−0.0065)
−86 −84 −82 −80 −78
34353637383940
24
26
28
30
32
Predictions
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 14 / 27
15. How reasonable are the predictions?
Look at the residuals, Observed Temp(s) - Predicted Temp(s)
Residuals indicate the “errors” made by our model
Remember the model assumes errors are random and independent of each
other
−86 −84 −82 −80 −78
34353637383940
−2
−1
0
1
Elevation Residuals
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 15 / 27
16. Spatial correlation in the residuals?
Look at a correlogram of the residuals
0 1 2 3 4 5 6 7
−1.0−0.50.00.51.0
Empirical Correlogram
Distance
Correlation
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 16 / 27
17. Add Latitude and Longitude
PredTemp(s) = 49.037 + Lat(s)∗
(−0.48) + Long(s)∗
(−0.13) + Elev(s)∗
(−0.006)
−86 −84 −82 −80 −78
34353637383940
24
26
28
30
32
Predictions
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 17 / 27
18. Look at the residuals
−86 −84 −82 −80 −78
34353637383940
−1.5
−1.0
−0.5
0.0
0.5
1.0
Long + Lat + Elev Residuals
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 18 / 27
19. Spatial correlation in the residuals?
Look at a correlogram of the residuals
0 1 2 3 4 5 6 7
−1.0−0.50.00.51.0
Empirical Correlogram
Distance
Correlation
How do we incorporate the remaining dependence between locations into the
model?
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 19 / 27
20. Additive Geostatistical Modeling
An additive spatial regression model includes an additional component to model
the remaining spatial dependence in the residuals.
Observation(s) = Regression Terms(s) +g(s) + error(s)
The g(s) term is a spatial process model which allows us to model the dependence
between any two locations as a function of the distance between them.
g(s) is assumed to be a Gaussian process
Models the dependence between any two locations through a specified
correlation (or covariance) function, which have additional parameters that
need to be estimated (use R!)
Think fitting a curve to the correlogram
Commonly used functions are the exponential and Mat´ern
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 20 / 27
21. Additive Geostatistical Modeling
Prediction
Prediction at a new location s0 is
Prediction = Regression Terms + Weighted Sum of Observations
Same idea as with the independent linear model, except now an additional
weighted average of the observed data at all locations is included in the
prediction.
Data observed at locations closest to the prediction location have highest
weights.
Under this model, predictions can be obtained even in the absence of
covariates!
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 21 / 27
22. Geostatistical Model with Long, Lat as Covariates
−86 −84 −82 −80 −78
34353637383940
26
28
30
32
Predictions
−86 −84 −82 −80 −7834353637383940
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Standard Errors
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 22 / 27
23. Geostatistical Model with Long, Lat as Covariates
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 23 / 27
24. Geostatistical Model with Long, Lat, as Covariates
2 4 6 8
−1.0−0.50.00.51.0
Empirical Correlogram
Correlation
−86 −84 −82 −80 −78
34353637383940
−4
−3
−2
−1
0
1
Residuals
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 24 / 27
25. Geostatistical Model with Long, Lat, Elevation as
Covariates
−86 −84 −82 −80 −78
34353637383940
22
24
26
28
30
32
Predictions
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 25 / 27
26. Geostatistical Model with Long, Lat as Covariates
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 26 / 27
27. Geostatistical Model with Long, Lat, Elevation as
Covariates
0 2 4 6 8
−1.0−0.50.00.51.0
Empirical Correlogram
Correlation
−86 −84 −82 −80 −78
34353637383940
−0.5
0.0
0.5
Residuals
M. Johnson (SAMSI) CLIM Undergrad Wksh October 23, 2017 27 / 27