-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathspatialStats_cadmium.rmd
803 lines (585 loc) · 32.1 KB
/
spatialStats_cadmium.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
---
##-- arkriger: October 2023
title: "Spatial Statistics with R"
output:
html_document:
df_print: paged
html_notebook: default
---
<style>
p.comment {
background-color: #DBDBDB;
padding: 10px;
border: 1px solid black;
margin-left: 25px;
border-radius: 5px;
font-style: italic;
}
</style>
<div class="alert alert-info"> <strong>NAME: </strong> _[write your name.surname between the brackets (like that name.surname)]_
</div>
Welcome to this exercise which explores **(Geo)statistical analysis with R**.
This assignment is an opportunity for you to dive into the `R` statistical package for spatial interpolation and in particular, to start to learn some of the capabilities of the `gstat` package (_and the various commands and some of the graphical outputs which can be produced)_.
<div class="alert alert-block alert-success"> **Our focus is understanding concepts, the analysis, potential and application of these methods.** </div>
We do not cover all the details of the `gstat`. If you are unsure what the parameters for a particular function are or if you wish to explore other arguments related to a function, please consult the `help` files or the relevant package vignettes.
In this assignment, you will work through the two major _**interpolation**_ categories. These are:
**1. Deterministic Methods**
a. Thiessen polygons
c. Linear Regression
b. Inverse Distance Weighting
d. Ordinary Least Squares
**2. Stochastic Methods**
a. Variograms and Kriging
**3.** we also briefly highlight technics to **interrogate the quality of an interpolation** with
a. $n$-fold cross Validation and
b. Residual Mean Squared Error (rmse).
<div class="alert alert-danger">
<strong>REQUIRED!</strong>
You are required to insert your outputs and any comment into this document. The document you submit should therefore contain the existing text in addition to:
- Plots and other outputs from executing the code chunks
- Discussion of your plots and other outputs as well as conclusions reached.
- This should also include any hypotheses and assumptions made as well as factors that may affect your conclusions.
</div>
To help you interpret and understand your results, please consult the following resources:
|||
|-|:---|
|**Chapter 15:**|Introduction to Geographic Information Systems [9th ed.] Kang-tsung Chan.|
|**Chapter 8:**|Applied Spatial Data Analysis with R by Bivand et. al. (2008).|
A copy of these are available on [Amathuba](https://amathuba.uct.ac.za/d2l/login?sessionExpired=0&target=%2fd2l%2fle%2flessons%2f30111%2ffolders%2f1057404) inside the `Resources` folder, Study Material & Books
----
```{r install }
options(prompt="> ", continue="+ ", digits=3, width=70, show.signif.stars=F, repr.plot.width=7, repr.plot.height=7)
rm(list=ls())
# Install necessary packages: You only need to run this part once
##- install.packages(c("sf", "gstat", "lattice", "automap"))))
#library(sp)
library(sf) # 'simple features' representations of spatial objects
library(gstat) # geostatistics
library(ggplot2) # plotting
#library(lattice)
#library(automap)
```
```{r help }
#- help
help(package="gstat")
```
```{r data }
#data(package="sp")
```
<div class="alert alert-success">
<strong>THE DATASET: This assignment is based on the meuse data set.</strong>
The data set which can be found in the `gstat` package consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables. The samples were collected in a flood plain of the river Meuse, near the village Stein in the Netherlands. Historic metal mining has caused the widespread dispersal of lead, zinc, copper and cadmium in the alluvial soil.
</div>
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 1.** What would be the purpose of producing a map of zinc or cadmium deposits? Who would use a map of this nature?
<p class="comment">
[ Answer 1. click in this cell and type your answer here. your answer must be between the outer [] brackets ]
</p>
```{r load-data }
data("meuse", package = "sp")
```
```{r descr-data }
#- describe data
meuse
```
```{r class }
class(meuse)
```
```{r summary }
summary(meuse)
```
Notice the spatial aspect of the data (x and y) are fields. We need to explicitly define the data as a spatial object. We do this through identifying the coordinates.
```{r to-sf }
meuse.sf <- st_as_sf(meuse, coords = c("x","y"))
class(meuse.sf)
```
```{r structure}
str(meuse.sf)
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 2.** Which field refers to the spatial object’s geometry? What is its data type?
<p class="comment">
[ Answer 2. click in this cell and type your answer here. your answer must be between the outer [] brackets ]
</p>
Although we have coordinate values we have yet to define a coordinate reference system (crs)
```{r crs }
st_crs(meuse.sf) <- 28992
print(st_crs(meuse.sf))
```
-----
#### Explore the dataset
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 3.** Execute the following 5 `blocks (chunks)` of `code` and briefly describe **ALL** the outputs
```{r cadmium-summary }
# Summary Statistics
summary(meuse.sf$cadmium)
summary(log(meuse.sf$cadmium))
```
```{r plot-stem-leaf }
# Steam and leaf plot
stem(meuse.sf$cadmium)
stem(log(meuse.sf$cadmium))
```
```{r plot-hist }
# Histogram and Q-Q Plots
par(mfrow=c(1,2)) # arranged in 1 row and 2 columns
hist(meuse.sf$cadmium, n=20, main = "Histogram of cadmium (ppm)")
abline(v=mean(meuse.sf$cadmium), col="red")-
abline(v=median(meuse.sf$cadmium), col="blue")
hist(log(meuse.sf$cadmium), n=20, main = "Histogram of Log-cadmium (ppm)")
abline(v=mean(log(meuse.sf$cadmium)), col="red")
abline(v=median(log(meuse.sf$cadmium)), col="blue", main = NULL, sub = NULL)
```
Mean (red) and Median (blue) have been highlighted.
```{r plot-box }
par(mfrow=c(1,2))
boxplot(meuse.sf$cadmium, main = "Boxplot of cadmium (ppm)")
boxplot(log(meuse.sf$cadmium), main = "Boxplot of Log-cadmium (ppm)")
```
```{r plot-qq }
par(mfrow=c(1,2))
qqnorm(meuse$cadmium, main = "Q-Q Plot of cadmium (ppm)")
qqnorm(log(meuse$cadmium), main = "Q-Q Plot of Log-cadmium (ppm)")
par(mfrow=c(1,1)) # Reset to default plotting of 1 figure per page
```
<div class="alert alert-info"> <strong>HINT</strong> Describe the original and log transformed variables with respect to data distribution _(symmetry, modality, distribution, variance [or homoescedascity] presence of outliers, etc.)_. Also indicate if the transformation was appropriate for the data.</div>
<p class="comment">
[ Answer 3. click in this cell and type your answers here. your answers must be between the outer [] brackets
- the summary:
- the histogram:
- the boxplot:
- the qq plot:
- the mean, median, distribution, etc.
]</p>
To add context; the dataset includes a river object.
```{r river }
data(meuse.riv, package="sp")
class(meuse.riv)
```
```{r river-crs }
meuse.riv.sf <- st_sfc(st_linestring(meuse.riv, dim="XY"), crs = st_crs(meuse.sf))
class(meuse.riv.sf)
```
```{r river-summary }
summary(meuse.riv.sf)
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 4.** Execute the following `code chunk` and briefly describe what you observe with respect to the spatial distribution of Cadmium.
```{r points-plot }
#basic plot
plot(meuse.sf['cadmium'], asp = 1, pch = 1)
```
</style>
<div class="alert alert-info"> <strong>HINT:</strong> How are the points distributed over the study area? For example, are they evenly-spaced (gridded)? Random? Denser in some areas?</div>
<p class="comment">
[ Answer 4. click in this cell and type your answer here. your answer must be between the outer [] brackets ]
</p>
We can add context with the river
```{r points-plot02 }
plot(meuse.sf["cadmium"], reset = FALSE, nbreaks = 64, pch = 20,
cex=4*meuse.sf$cadmium/max(meuse.sf$cadmium),
#col=rainbow(100),
main = "Cd concentration [ppm]")
plot(meuse.riv.sf, add = TRUE)
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 5.** Talk about positions of the data in relation to the bounds of the region, the locations of high and low values and the presence of possible outliers.
<p class="comment">
[ Answer 5. click in this cell and type your answer here. your answer must be between the outer [] brackets ]
</p>
Or we could `plot()` the dataset differently
```{r points-plot03 }
ggplot(data = meuse.sf) +
geom_sf(mapping = aes(size=cadmium, color=dist.m)) +
labs(x = "Longitude", y = "Latitude", title = "Meuse River (NL)",
color="Distance to river [m]", size = "Cd concentration, ppm")
```
This specific dataset comes with a 40*40-m interpolation grid (a set of regularly-spaced points that covers the study area and specify locations were predictions will be done)
```{r grid }
data(meuse.grid, package="sp")
class(meuse.grid)
```
```{r grid-names }
names(meuse.grid)
```
Like the sample points; the grid needs to be explicitly defined
```{r grid-crs }
meuse.grid.sf <- st_as_sf(meuse.grid, coords = c("x", "y"))
st_crs(meuse.grid.sf) <- st_crs(meuse.sf)
summary(meuse.grid.sf)
```
If you explore the grid object you'll notice it contains several attributes
```{r grid-summary }
summary(meuse.grid.sf)
```
Its even possible to plot some of the attributes
```{r ffreq-plot }
plot(meuse.grid.sf["ffreq"], pch = 15,
main = "Meuse River, flooding frequency classes")
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 6.** What is the meaning of flood frequency class 1? What is its spatial distribution?
</style>
<div class="alert alert-info"> <strong>HINT:</strong> Consider using the `help` function.</div>
<p class="comment">
[ Answer 6. click in this cell and type your answer here. your answer must be between the outer [] brackets ]
</p>
___
<div class="alert alert-success">
<strong>Why do we need to interpolate?</strong>
Generally we want to _interpolate_ because **we either do not have the resources _(money and time)_ or its simply not feasible** to collect data everywhere.
Our data will have gaps. When these cases arrive, and they often do, we need a method to _predict_ values where we have none.
</div>
## 1. Deterministic (non-geostatistical) Methods
**These are models where arbitrary** _(random)_ **or empirical** _(observed)_ **parameters are used with no estimates of the model error and no assumptions about the variability of a feature.**
In this exercise we will cover four.
### 1. a. Thiessen polygons
Also called a voronoi diagram these methods are characterized by abrupt edge changes with only one point for each prediction.
```{r voronoi-vector }
#- Voronoi tesselation
pnts <- st_union(meuse.sf)
voronoi_grid <- st_voronoi(pnts)
meuse.sf$lcd <- log(meuse.sf$cadmium)
voronoi_sf <- st_sf(geometry = st_collection_extract(voronoi_grid, "POLYGON")) |>
st_join(meuse.sf[,"lcd"])
```
```{r plot-voronoi-vector }
#-- plot
plot(voronoi_sf[,"lcd"],
xlim=c(178605,181390),
ylim=c(329714,333611), main="Thiessen Polygons")
```
Or we could create a tessellated surface directly
```{r voronoi }
# Voronoi tesselation
thiessen = krige(log10(cadmium) ~ 1, meuse.sf, meuse.grid.sf, nmax = 1)
```
```{r plot-voronoi }
#-- plot
pts.s <- list("sp.points", meuse.sf, col="white",pch=20)
plot(thiessen["var1.pred"], pch=15, main="Thiessen Polygons")
```
Notice we have **not _interpolated_**. We have assigned a value to an area.
___
## 1. b. Regression Modelling
A common non-spatial (feature-space) approach to **_approximation_** is to model one variables’ distribution (the dependent or response variable) by one or more other variables (the independent or predictor variables). This is sometimes commonly called **_'regression modelling'_**.
<div class="alert alert-success"> <strong>CONCEPT CHECK: Regression Modeling and Interpolation</strong>
While this might seem excessive; the words we use here are relevant and their application is determined by the challenge.
When we **interpolate** we look for a function that fits the values of a dataset **exactly**. Here we say: given $n$ $(x_i, y_i)$ points we look for a function $F$ that satisfies $F(x_i, y_i) = z_i$.
Very often, within the spatial domain what this mean is; we generally have a coordinate value $(x, y)$ and want to predict a $z$. Additionally $F$ is typically a polynomial (1st, 2nd, etc) or a Spline.
**Regression** refers to fitting a function to minimise some cost; typically **_sum of squares error_**. What this means is given a dataset of $n$ points regression looks to **_find the line of best fit_**.
Regression can approximate while Interpolation can predict. Interpolation goes further than regression because it typically uses a regression model to predict values **between** _(at upsampled locations)_ a given dataset.
We start with Regression and follow on with Interpolation.
</div>
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 7.** Use the following 8 `code chunks` to perform the various analyses required for a regression model with distance to river as an independent variable.
```{r plot-dist }
# Map of distance to river:
meuse.grid.sf$sqrtdist <- sqrt(meuse.grid.sf$dist)
plot(meuse.grid.sf["sqrtdist"], pch = 15,
#sp.layout = list("sp.points", meuse.sf, col = 3, cex=.5),
main = "Distance to river")
```
```{r plot-scatter }
# Map of distance to river:
plot(log(cadmium)~sqrt(dist), as.data.frame(meuse.sf),
main="Scatterplot of log(Cd) vs. Squ. dist.")
abline(lm(log(cadmium)~sqrt(dist), meuse.sf))#, col="blue")
```
```{r fit-regression }
# Fit the regression model
cadmium.lm <- lm(log(cadmium)~sqrt(dist), meuse.sf)
```
```{r summary-reg }
# Get a summary of the regression model
summary(cadmium.lm)
```
```{r plot-diagnostic }
# Get diagnostic plots
layout(matrix(1:4, ncol=2))
plot(cadmium.lm, add.smooth = FALSE)
layout(1)
```
<!-- # ```{r plot-diagnostic02 } -->
<!-- # # Get diagnostic plots -->
<!-- # layout(matrix(1:4, ncol=2)) -->
<!-- # plot(cadmium.lm, add.smooth = FALSE) -->
<!-- # layout(1) -->
<!-- # ``` -->
```{r predict-std-error }
# Get Predicted Values and Standard Error of Fit for all locations on the grid
meuse.grid.sf$lzn.fit <- predict(cadmium.lm, meuse.grid.sf)
meuse.grid.sf$se.fit <- predict(cadmium.lm, meuse.grid.sf, se.fit=TRUE)$se.fit
```
```{r plot-preds }
# Plot the predicted values
plot(meuse.grid.sf['lzn.fit'], pch = 15, #"lzn.fit", #sp.layout = meuse.lt,
main = "Log(Cadmium) - ppm: Regression model values")
```
```{r plot-std-fit }
# Plot the Standard Error of fit
plot(meuse.grid.sf["se.fit"], pch = 15,
main = "Log(Cadmium) - ppm: Regression model Standard Error of fit")
```
<div class="alert alert-info"> <strong>HINT</strong> You should include a discussion on the nature of relationship between the dependent and independent variable, the model fit: model summary and regression diagnostics [say which, (if any), of the assumptions underlying linear regression are violated and the quality of interpolation with respect to the distribution and density of controls points etc. </div>
<p class="comment">
[ Answer 7. click in this cell and type your answers here. your answers must be between the outer [] brackets
- the distance to river:
- the scatter plot:
- the regression summary:
- the diagnostic plots:
- the standard error, prediction and standard fit:
] </p>
___
### 1. b. Inverse Distance Weighting (IDW)
With IDW we start **_interpolating_**. This means we convert point data, of numerical values, into a continuous surface and visualize how the data may be distributed across space.
IDW _interpolates_ point data by using a weighted average of a variable from nearby points to predict the value of that variable for each location. The weighting of the points is determined by their inverse distances; drawing on [Tobler’s first law of geography](https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography) which states:
<div align="center">
_'everything is related to everything else, but near things are more related than distant things'_. </div>
____
<div class="alert alert-success"> <strong>CONCEPT CHECK:</strong> This is a concept we will come back to when we introduce spatial correlation.
</div>
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 8.** Use the following `code chunk` to generate interpolated surfaces of log (cadmium) concentrations obtained using Inverse-Distance Weighting (IDW) with different power functions (`idp`). Discuss all outputs and draw conclusions. Give reasons for your conclusions.
```{r idw }
# Run the IDW interpolation. check the effect of power (p) on the IDW interpolation by changing the value of "idp"
meuse.grid.sf$idwp05 = idw(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, idp = 0.5)$var1.pred
meuse.grid.sf$idwp25 = idw(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, idp = 2.5)$var1.pred
meuse.grid.sf$idwp5 = idw(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, idp = 5)$var1.pred
meuse.grid.sf$idwp10 = idw(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, idp = 10)$var1.pred
plot(meuse.grid.sf[c("idwp05", "idwp25", "idwp5", "idwp10")], pch=15,
main = "Log(Cadmium) - ppm , IDW Interpolation")
```
<div class="alert alert-info"> <strong>HINT</strong> You should say something about the role of the power function and the quality/behaviour of the predictions with respect to the distribution and density of controls points.</div>
<p class="comment">
[ Answer 8. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
___
## 1. c. Trend Surface Analysis with Ordinary Least Squares (OLS)
We are now firmly in the realm of _Trend Surface Analysis_. What we mean by a **_trend surface_** is: a variable (say $z$, cadmium deposits, elevation, etc.) can be expressed as some smooth function of coordinates ($x_i, y_i$). A polynomial function of the coordinates similar to:
$$
Z = 𝜷_0 + 𝜷_1x + 𝜷_2y + ϵ
$$
where: $Z$ is the is the elevation of the surface at $(x,y)$; $𝜷$ are possible coefficients representing the average elevation _(and possibly slope and curvature)_ in the $x$ and $y$ directions and $ε$ is the error term accounting for small fluctuations and deviations from the average.
In other words we; _predict_ values at upsampled locations (the `meuse grid`) with a 1st-, 2nd- or 3rd-order polynomial. The higher the degree of the polynomial the more the surface matches the original data. We should however take care.
We want realistic results within our dataset. **The higher the degree the more extreme the extrapolations.**
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 9.** Run the following `code chunk` to perform trend surface interpolation. Compare the three interpolation maps and discuss the results. Which one gives a better fit? Why?
```{r ols }
#-- trend surface up to degree 3
#- Note: tr1, tr2 and tr3 are names for Trend surface of order 1, 2 and 3 respectively
meuse.grid.sf$tr1 = krige(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, degree = 1)$var1.pred
meuse.grid.sf$tr2 = krige(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, degree = 2)$var1.pred
meuse.grid.sf$tr3 = krige(log(cadmium) ~ 1, meuse.sf, meuse.grid.sf, degree = 3)$var1.pred
plot(meuse.grid.sf[c("tr1", "tr2", "tr3")], pch=15,
main = "Log(Cadmium) - ppm Trend Surface Interpolation") #main="IDW") #"idwp1",
```
<div class="alert alert-info"> <strong>HINT</strong> Talk about the appropriateness of trend orders with respect to the nature of the interpolated surface, the quality of interpolation with respect to the distribution and density of controls points.</div>
<p class="comment">
[ Answer 9. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
___
### 2. Variogram and Kriging
While Ordinary Least Squares (OLS) predicts with a polynomial and applies a general error term we can in fact go further.
One of the principles of Spatial Data Science is; **_data is spatially correlated_**. Patterns and particularly residuals (errors) have a spatial structure / are connected.
We can model the spatial structure of residuals with a **variogram**; and then use this model to refine a trend surface with Kriging; a powerful geostatistical interpolation method.
First; we investigate the variance of the dataset _(how the observed values change)_ then model and later fit this model to a special form of linear interpolation: Kriging.
<div class="alert alert-success"> <strong>CONCEPT CHECK: Think about what this means.</strong>
We determine the structure (spatial correlation) of the dataset. And we use this knowledge to refine our prediction.
</div>
### - The Variogram
In geostatistics, **spatial correlation** is modeled with a **variogram**. What it does is explain the degree to which nearby locations have similar values using _semi-variance_.
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 10.** Use the next set of commands to estimate, model and fit the variogram and comment on the output, plots and results.
```{r h-scatter }
###################################################
### 8.4 Estimating Spatial Correlation: The Variogram - Bivand et al. (2008)
### 8.4.1 Exploratory Variogram Analysis
###################################################
### h-scatterplots/ lagged scatterplots
hscat(log(cadmium)~1, meuse.sf, (0:8)*100)
```
```{r v-cloud }
### Variogram cloud
plot(variogram(log(cadmium) ~ 1, meuse.sf, cloud = TRUE))
```
```{r binned-variogram }
###################################################
### Sample variogram (binned variogram) plot of (8.4)
###################################################
plot(variogram(log(cadmium) ~ 1, meuse.sf))
```
```{r variogram-anistropy }
###################################################
### Variograms in four different angles
###################################################
plot(variogram(log(cadmium) ~ 1, meuse.sf, alpha = c(0, 45, 90, 135)))
```
```{r variogram-cut }
###################################################
### Override the default cutoff and interval width values ### See Bivand et al. (2008)
###################################################
plot(variogram(log(cadmium) ~ 1, meuse.sf, cutoff = 1000, width = 50))
```
```{r variogram-distance }
###################################################
### Specifying interval for the distance vector - See Bivand et al. (2008)
###################################################
variogram(log(cadmium) ~ 1, meuse.sf, boundaries = c(0,50,100,seq(250,1500,250)))
```
```{r variogram }
######################################### ##########
### Variogram in Fig. 8.6 - Bivand et al. (2008)
###################################################
v <- variogram(log(cadmium) ~ 1, meuse.sf)
plot(v)
```
```{r variogram-initial }
###################################################
### Initial values for the variogram fit
###################################################
v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1))
plot(v, pl = T, model = v.fit)
```
```{r variogram-partial }
###################################################
### Partial fitting of variogram coefficients : PAGE 204 - Bivand et al. (2008)
###################################################
fit.variogram(v, vgm(1, "Sph", 800, 0.06), fit.sills = c(FALSE, TRUE))
```
```{r variogram-anistropy-plot }
###################################################
### 8.4.4 Anisotropy - Bivand et al. (2008)
###################################################
v.dir <- variogram(log(cadmium)~1, meuse.sf, alpha=(0:3)*45)
v.anis <- vgm(.6, "Sph", 1600, .05, anis=c(45, 0.3))
###################################################
### Fig. 8.7 - Bivand et al. (2008)
###################################################
plot(v.dir, v.anis)
```
```{r variogram-map }
###################################################
### variogram map - Bivand et al. (2008)
###################################################
plot(variogram(log(cadmium)~1, meuse.sf, map=TRUE, cutoff=1000, width=100))
```
<div class="alert alert-info"> <strong>HINT</strong> Page numbers or plot numbers from sections with similar outputs from Bivand et al. (2008) have been indicated in the comments attached to the codes. Please refer to those sections/plots for guidance. In general, discuss all outputs (text and plots). What they are, what they do, what assumptions have been made etc.</div>
<p class="comment">
[ Answer 10. click in this cell and type your answer here. your answer must be between the outer [] brackets
- the lagged scatterplots:
- the variogram cloud and sample variogram:
- variogram in four directions and what these indicate:
- the variogram model and fit. its sill range and nugget:
]</p>
### - Kriging
Now that we have modelled the structure of our dataset _(the spatial correlation / variance)_, with a variogram, we use this knowledge and refine our prediction. We now predict with Kriging.
Kriging allows variance to be non-constant.If you recall the general form of a Least Squares solution from earlier. Notice the error term: $ϵ$. One value applied to the entire dataset.
$$
Z = 𝜷_0 + 𝜷_1x + 𝜷_2y + ϵ
$$
The error term applied with Kriging is dependent on the distance between the points; as modeled by the variogram. It are the spatially correlated residuals that we modeled in the previous step that allows us to refine our _prediction_.
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 11.** Use the next `Chunck` to perform Kriging interpolation and comment on the output, plot and results.
```{r osk-ok-kriging }
###################################################
### 8.5.1 Simple Kriging and Ordinary Kriging - Bivand et al. (2008)
###################################################
lz.sk <- krige(log(cadmium)~1, meuse.sf, meuse.grid.sf, v.fit, beta = 5.9)
lz.ok <- krige(log(cadmium)~1, meuse.sf, meuse.grid.sf, v.fit)
par(mfrow=c(1,2)) # arranged in 1 row and 2 columns
plot(lz.sk['var1.pred'], pch=15, nbreaks = 64, main ="Simple Kriging [log10(Cd ppm)]",
reset = FALSE)
plot(lz.ok['var1.pred'], pch=15, nbreaks = 64, main = "Ordinary Kriging [log10(Cd ppm)]")
par(mfrow=c(1,1)) # Reset to default plotting of 1 figure per page
```
<p class="comment">
[ Answer 11. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 11.** What other output(s) would you generate to investigate the quality of the Simple and Ordinary Kriging Interpolation?
<div class="alert alert-info"> <strong>HINT</strong> How would you judge the quality of the interpolation? </div>
<p class="comment">
[ Answer 11. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
- **Question 12.** Do a visual comparison of **_ALL_** the interpolation results from the previous methods and discuss your observations
<div class="alert alert-info"> <strong>HINT</strong> What differences do you see from the interpolated maps you produced in terms of the behaviour of the predictions with respect to the distribution and density of controls points etc. </div>
<p class="comment">
[ Answer 12. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
- **Question 13.** How else would you compare the results from the various interpolation methods?
<div class="alert alert-info"> <strong>HINT</strong> Summarise section 15.5 of [9th ed.] Kang-tsung Chan. </div>
<p class="comment">
[ Answer 13. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
### 3. Validation
While we have _**looked**_ at, and performed a **_qualitative analysis_** of, how the different interpolations differ we now perform a statistical analysis and execute a **_quantitative assessment_**.
We evaluate the quality of our predictions with two measures.
- $N$-fold cross validation and
- Residual Mean Squared Error (RMSE)
```{r cross }
idw.cv.a <- krige.cv(log(cadmium) ~ 1, meuse.sf, nmax = 7, nfold=5, set = list(idp = 2.5))
ols.cv.a <- krige.cv(log(cadmium) ~ 1, meuse.sf, degree = 2, nmax = 40, nfold=5)
ok.cv.a <- krige.cv(log(cadmium)~1, meuse.sf, v.fit, nmax = 40, nfold=5)
```
```{r idw-cv }
#- idw-2.5 cv at 5-fold
idw.cv.a[1:5,]
```
```{r ols-cv }
#- 2nd-ols cv at 5-fold
ols.cv.a[1:5,]
```
```{r iok-cv }
#- ok cv at 5-fold
ok.cv.a[1:5,]
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 14.** Explain what we mean with $N$-fold validation and comment on the other validation method offered by the `krige.cv()` function of the `gstat` package.
<p class="comment">
[ Answer 14. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
```{r cv-correlation}
par(mfrow=c(1,3))
plot(var1.pred ~ observed, idw.cv.a, main="IDW 2.5 ", ylab="IDW pred.")
plot(var1.pred ~ observed, ols.cv.a, main="2nd-order OLS", ylab="OLS pred.")
plot(var1.pred ~ observed, ok.cv.a, main="OK", ylab="OK pred.")
par(mfrow=c(1,1)) # Reset to default plotting of 1 figure per page
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 14.** Discuss the correlation between the predicted and the observed values.
<div class="alert alert-info"> <strong>HINT</strong> Imagine a _**line-of-best-fit**_ running through the correlation. </div>
<p class="comment">
[ Answer 14. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
#### Examine the distribution _(and magnitude)_ of the residuals .
```{r cv-bubble-idw }
plot(idw.cv.a["residual"], main = "log(cadmium): IDW 2.5 5-fold CV residuals", #reset = FALSE,
cex=4*idw.cv.a$residual/max(idw.cv.a$residual), reset = FALSE)
plot(meuse.riv.sf, add = TRUE)
```
```{r cv-bubble-ols }
plot(ols.cv.a["residual"], main = "log(cadmium): 2nd-order OLS 5-fold CV residuals",
cex=4*ols.cv.a$residual/max(ols.cv.a$residual), reset = FALSE)
plot(meuse.riv.sf, add = TRUE)
```
```{r cv-bubble-ok }
plot(ok.cv.a["residual"], main = "log(cadmium): OK 5-fold CV residuals",
cex=4*ok.cv.a$residual/max(ok.cv.a$residual), reset = FALSE)
plot(meuse.riv.sf, add = TRUE)
```
```{r cv-box }
boxplot(idw.cv.a$residual, ols.cv.a$residual, ok.cv.a$residual, main="IDW 2.5 , 2nd-order OLS, OK")
```
```{r rmse-idw}
#- idw 2.5 rmse
sqrt(sum(idw.cv.a$residual^2)/length(idw.cv.a$residual))
```
```{r rmse-ols}
#- 2nd-order ols rmse
sqrt(sum(ols.cv.a$residual^2)/length(ols.cv.a$residual))
```
```{r rmse-iok}
#- ok rmse
sqrt(sum(ok.cv.a$residual^2)/length(ok.cv.a$residual))
```
<div class="alert alert-warning"> <strong>QUESTION!</strong> </div>
- **Question 15.** Explain the residuals. Comment on their magnitude, location and distribution.
<p class="comment">
[ Answer 15. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>
<div class="alert alert-block alert-success"> <strong>DISCUSSION!</strong> </div>
- **Question 16.** Having gone through a variety of interpolation methods along with both a **qualitative and quantitative evaluation**; which method would you choose? Motivate your choice. Your answer must be between 100 and 150 words.
<p class="comment">
[ Answer 16. click in this cell and type your answer here. your answer must be between the outer [] brackets ] </p>