-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-site_comparisons.Rmd
248 lines (146 loc) · 9.72 KB
/
08-site_comparisons.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Site comparisons
```{r, include=F}
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
dt_options <- list(scrollX = T, autoWidth = T, searching = F, ordering = F, lengthChange = F, paginate = F, info = F)
```
![](images/cover8.jpg){align="right" style="margin-top: -4px; margin-left: 24px; width: 38%;"}
__Recommended steps__
- Start by charting data with box and whisker plots. Some site differences may be easily apparent and will not require statistical comparisons. The analyst can use their judgement if site means are obviously different. In some cases a statistical test may not be necessary.
- Even when site means appear to be nearly identical, a statistical test should be used to confirm. Boxplots can be deceiving when trying to determine that two means are not significantly different.
- Check for normality Chi-squared test. If the data are normal, and there is equal variance complete an ANOVA to compare sites. If there are more than 2 sites, follow with a post hoc test.
- If the data are not normal, and there is not equal variance, complete a Kruskal Wallis test (or other test depending on the number of valid samples for each site). If there are more than 2 sites to compare, use Dunn's test for post hoc site by site comparisons.
## Confidence intervals
If the data are not normally distributed, use bootstraping to generate means for each site and take the differences of those means. If the lower bound of the 95% confidence interval for those differences is greater than zero, or the upper bound is less than zero, then the site means are significantly different. You cannot just generate confidence intervals for the means individually and see if they overlap since the joint probability of both means being in the same region is lower than the marginal probabilities of each mean being in the region.
It is not effective to do pairwise comparisons of individual samples due to null data and non-detects. You cannot automatically use an ANOVA test since most ambient air concentration data more closely resembles a log-normal distribution than a normal distribution and for a log-normal distribution, the variance is not independent of the mean which means that transforming the data and doing an ANOVA test would bias the confidence interval for the difference. MPCA uses bootstrap sampling to generate confidence intervals since bootstrap sampling is non-parametric and makes no assumptions about the distribution of the data or its variance.
The script below uses bootstrap sampling to generate 1,000 differences in site means which are then ordered to create a 95% confidence interval for the difference in means. If the lower bound of the interval is greater than zero, then the second site has a higher mean concentration than the first site. If the upper bound of the confidence interval is less than zero, then the second site has a lower mean concentration than the first site.
<br>
_Click below to view an example calculating confidence intervals._
<div class="toggle">
<button class="btn_code">Show __R__ code</button>
<br>
```{r, eval=F, warning=F}
library(tidyverse)
library(EnvStats)
site_compare <- function(data, site_number, Boot_Repeats = 1000) {
library(EnvStats)
set.seed(2017)
annual_AT_means <- function(air_toxics) {
air_toxics <- mutate(air_toxics, Year = year(ymd(Date)), Quarter = quarter(ymd(Date)) )
sample_complete <- air_toxics %>%
group_by(AQSID, Pollutant, Year, Quarter, MDL) %>%
summarise(Complete = ( (sum(!is.na(Result) ) / length(Result) ) >= 0.75 ) ) %>%
mutate(Complete = ifelse(is.na(Complete), F, Complete) ) %>%
group_by(AQSID, Pollutant, Year, MDL) %>%
summarise(Complete = all(Complete) )
enough_detects <- air_toxics %>%
group_by(AQSID, Pollutant, Year, MDL) %>%
summarise(Detected = mean(Censored, na.rm = T) <= 0.8 )
site_means <- air_toxics %>%
group_by(AQSID, Pollutant, Year, MDL) %>%
summarise(Mean = ifelse(length(unique(Result[!is.na(Result) & !Censored] ) ) < 2, NA,
ifelse (any(Censored, na.rm = T), elnormAltCensored(Result, Censored, method = "impute.w.mle", ci = F)$parameters[[1]], mean(Result, na.rm = T) ) ) )
site_means <- left_join(site_means, sample_complete, by = c("AQSID", "Pollutant", "Year", "MDL") ) %>%
left_join(enough_detects, by = c("AQSID", "Pollutant", "Year", "MDL") ) %>%
mutate(Mean = ifelse(Complete & Detected, Mean, NA), ID = paste(AQSID, Pollutant, Year) )
return(site_means)
}
MLE_est <- function(data){
results <- data$Result
censored <- data$Censored
n <- sum(!is.na(results))
if (length(unique(results[!is.na(results) & !censored] ) ) < 2 ) {
MLE_means <- NA
}
else {
random.rows = NULL
random.rows = sample(which(!is.na(censored) & (!censored) & !duplicated(results) ), 2, replace = FALSE)
random.rows = c(random.rows, sample(which(!is.na(censored)), n-2, replace = TRUE))
MLE_means <- ifelse(sum(censored[random.rows], na.rm = T) == 0, mean(results[random.rows]), elnormAltCensored(results[random.rows], censored[random.rows], method = "impute.w.mle", ci = F)$parameters[[1]] )
}
return(MLE_means)
}
data <- mutate(data, Result = ifelse(Censored, MDL, Result), ID = paste(AQSID, Pollutant, Year))
Bootstrap_means = replicate(Boot_Repeats, (by(data, data$ID, MLE_est) ) )
Bootstrap_means = rownames_to_column(as.data.frame(Bootstrap_means), "ID" )
Bootstrap_means = right_join(annual_AT_means(data), Bootstrap_means, by = "ID")
Bootstrap_means = Bootstrap_means %>% group_by(Pollutant, Year) %>% arrange(desc(AQSID == site_number), .by_group = T ) %>%
group_by(Pollutant, Year) %>% mutate_at(vars(num_range("V", 1:Boot_Repeats)), funs(c(first(.), (. - first(.))[-1])) ) %>% ungroup()
LB = select(Bootstrap_means, num_range("V", 1:Boot_Repeats) ) %>% apply(1, function(x) sort(-x)[floor(0.025 * Boot_Repeats)] )
UB = select(Bootstrap_means, num_range("V", 1:Boot_Repeats) ) %>% apply(1, function(x) sort(-x)[ceiling(0.975 * Boot_Repeats)] )
CI = data.frame(Lower = LB, Upper = UB)
CI = bind_cols(CI, Bootstrap_means) %>% select(Lower:ID) %>% mutate(Lower = ifelse(any(AQSID == site_number & Complete & Detected) & AQSID != site_number & Complete & Detected, Lower, NA),
Upper = ifelse(any(AQSID == site_number & Complete & Detected) & AQSID != site_number & Complete & Detected, Upper, NA),
Comparison = ifelse(Lower > 0, "Higher", ifelse(Upper < 0, "Lower", "Same") ) )
return(CI)
}
data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')
names(data)[1:10] <- c("AQSID", "POC", "Parameter", "Date","Result",
"Null_Data_Code", "MDL", "Pollutant", "Year", "CAS")
data <- data %>%
filter(AQSID %in% c(270370020, 270370470, 271230871) ) %>%
mutate(Censored = Result < MDL)
site_number = 271230871
seed = 2017
set.seed(seed)
compare = site_compare(data, site_number, 50) #minimum 50 repeats
```
</div>
## Tools to visualize differences
### Beeswarm Plots {-}
Beeswarm plots are useful for visualizing concentrations measured at different sites with more detail than boxplots.
_Click below to view an example R script._
<div class="toggle">
<button class="btn_code">Show __R__ code</button>
<br>
```{r}
library(tidyverse)
beeswarm_plot <- function(data, pollutant, year) {
library(dplyr)
library(ggbeeswarm)
library(ggplot2)
data <- mutate(data, AQSID = as.factor(AQSID))
filter(data, Pollutant == pollutant, Year == year) %>%
ggplot(aes(y = AQSID, x = Result, color = Censored) ) +
geom_quasirandom(groupOnX = F) +
labs(title = paste(pollutant, year), x = "Result (ug/m^3)")
}
data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')
names(data)[1:10] <- c("AQSID", "POC", "Param_Code", "Date","Result",
"Null_Data_Code", "MDL", "Pollutant", "Year", "CAS")
data <- mutate(data, Censored = Result < MDL)
beeswarm_plot(data, "Lead", 2013)
beeswarm_plot(data, "Arsenic", 2013)
beeswarm_plot(data, "Benzene", 2013)
```
</div>
### Correlation matrices {-}
Correlation matrices help visualize the relationship between different pollutants at a site.
_Click below to view an example R script._
<div class="toggle">
<button class="btn_code">Show __R__ code</button>
```{r}
correlation_plots = function(data, site) {
library(tidyverse)
library(corrplot)
data_site <- filter(data, AQSID %in% site) %>% select(Date, Pollutant, Result)
analytes <- spread(data_site, Pollutant, Result, drop = T)
analytes$Date <- NULL
coranalytes <- cor(analytes, method="kendall", use="pairwise.complete.obs") %>% as.data.frame()
coranalytes <- select_if(coranalytes, function(x) !all(is.na(x))) %>%
filter_all(any_vars(!is.na(.))) %>%
as.matrix()
rownames(coranalytes) <- colnames(coranalytes)
corrplot(coranalytes, method = "circle", type="lower", tl.cex=0.6) #plot matrix
}
data <- read_csv('https://raw.githubusercontent.com/MPCA-air/air-methods/master/airtoxics_data_2009_2013.csv')
names(data)[1:10] <- c("AQSID", "POC", "Param_Code", "Date","Result",
"Null_Data_Code", "MDL", "Pollutant", "Year", "CAS")
data %>% correlation_plots(270530962)
```
</div>
<br>
## _References_ {-}
- [Visualizing uncertainty in housing data](http://lenkiefer.com/2017/04/26/housing-data-uncertainty/)
- R packages: [Beeswarm](https://github.com/eclarke/ggbeeswarm)
<br>
<br> [Back to top](#site-comparisons)