-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathdata-ordinal.qmd
407 lines (308 loc) · 18.4 KB
/
data-ordinal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "Ordinal scales"
editor_options:
chunk_output_type: inline
---
## Introduction
Ordinal scales are organized as rank-ordered numeric classes, with a finite number of such classes. The utilization of ordinal scales is often due to their convenience and speed of rating [@madden2007]. In fact, plant pathologists often encounter situations where direct estimates of the nearest percent severity are time-consuming or impractical, besides being unreliably and inacuratelly estimated. In plant pathological research, there are two commonly used types of ordinal scales: quantitative and qualitative [@Chiang2022].
### Quantitative ordinal
In the quantitative ordinal scale, each score signifies a defined interval of the percentage scale. The most renowned quantitative ordinal scale is the Horsfall-Barratt (HB) scale, which was developed in the early 1940s when the science of plant pathology was transitioning towards more quantitative methodologies [@hebert1982]. The HB scale partitions the percentage scale into twelve successive, logarithmic-based intervals of severity ranging from 0 to 100%. The intervals increase in size from 0 to 50% and decrease from 50 to 100%.
::: callout-warning
## Controversy of the H-B scale
The divisions of the H-B scale were established on two assumptions. The first was the logarithmic relationship between the intensity of a stimulus and the subsequent sensation. The second was the propensity of a rater to focus on smaller objects when observing objects of two colors [@madden2007]. This foundation is based on the so-called Weber-Fechner law. However, there is limited experimental evidence supporting these assumptions. Current evidence indicates a linear relationship, rather than a logarithmic one, between visually estimated and actual severity [@nutter2006a]. Additionally, these authors demonstrated that raters more accurately discriminated disease severity between 25% and 50% than what the H-B scale allowed. New scale structures have been proposed to address the issues associated with the H-B scale [@liu2019; @chiang2014]. The Chiang scale follows a linear relationship with the percentage area diseased at severities greater than 10% (class 6 on the scale).
:::
Let's input the HB scale data and store as a data frame in R so we can prepare a table and a plot.
```{r}
#| label: tbl-HB
#| tbl-cap: "The Horsfal-Barrat quantitative ordinal scale used as a tool for assessing plant disease severity "
HB <- tibble::tribble(
~ordinal, ~'range', ~midpoint,
0, '0', 0,
1, '0+ to 3', 1.5,
2, '3+ to 6', 4.5,
3, '6+ to 12', 9.0,
4, '12+ to 25', 18.5,
5, '25+ to 50', 37.5,
6, '50+ to 75', 62.5,
7, '75+ to 88', 81.5,
8, '88+ to 94', 91.0,
9, '94+ to 97', 95.5,
10,'97+ to 100', 98.5,
11, '100', 100
)
knitr::kable(HB, align = "c")
```
Let's visualize the different sizes of the percent interval encompassing each score.
```{r}
#| warning: false
#| message: false
#| label: fig-hb
#| fig-cap: "Ordinal scores of the Horsfal-Barrat scale"
#| code-fold: true
library(tidyverse)
library(r4pde)
HB |>
ggplot(aes(midpoint, ordinal))+
geom_point(size =2)+
geom_line()+
scale_x_continuous(breaks = c(0, 3, 6, 12, 25, 50, 75, 88, 94, 97))+
scale_y_continuous(breaks = c(1:12))+
geom_vline(aes(xintercept = 3), linetype = 2)+
geom_vline(aes(xintercept = 6), linetype = 2)+
geom_vline(aes(xintercept = 12), linetype = 2)+
geom_vline(aes(xintercept = 25), linetype = 2)+
geom_vline(aes(xintercept = 50), linetype = 2)+
geom_vline(aes(xintercept = 75), linetype = 2)+
geom_vline(aes(xintercept = 88), linetype = 2)+
geom_vline(aes(xintercept = 94), linetype = 2)+
geom_vline(aes(xintercept = 97), linetype = 2)+
labs(x = "Percent severity", y = "HB score")+
theme_r4pde()
```
We can repeat those procedures to visualize the Chiang scale.
```{r}
#| label: tbl-chiang
#| tbl-cap: "The Chiang quantitative ordinal scale used as a tool for assessing plant disease severity "
chiang <- tibble::tribble(
~ordinal, ~'range', ~midpoint,
0, '0', 0,
1, '0+ to 0.1', 0.05,
2,'0.1+ to 0.5', 0.3,
3, '0.5+ to 1', 0.75,
4, '1+ to 2', 1.5,
5, '2+ to 5', 3,
6, '5+ to 10', 7.5,
7, '10+ to 20', 15,
8, '20+ to 30', 25,
9, '30+ to 40', 35,
10, '40+ to 50', 45,
11, '50+ to 60', 55,
12, '60+ to 70', 65,
13, '70+ to 80', 75,
14, '80+ to 90', 85,
15,'90+ to 100', 95
)
knitr::kable(chiang, align = "c")
```
```{r}
#| warning: false
#| message: false
#| label: fig-chiagn
#| fig-cap: "Ordinal scores of the Chiang scale"
#| code-fold: false
chiang |>
ggplot(aes(midpoint, ordinal))+
geom_point(size =2)+
geom_line()+
scale_y_continuous(breaks = c(0:15))+
scale_x_continuous(breaks = c(0, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100))+
geom_vline(aes(xintercept = 0), linetype = 2)+
geom_vline(aes(xintercept = 0.1), linetype = 2)+
geom_vline(aes(xintercept = 0.5), linetype = 2)+
geom_vline(aes(xintercept = 1), linetype = 2)+
geom_vline(aes(xintercept = 2), linetype = 2)+
geom_vline(aes(xintercept = 5), linetype = 2)+
geom_vline(aes(xintercept = 10), linetype = 2)+
geom_vline(aes(xintercept = 20), linetype = 2)+
geom_vline(aes(xintercept = 30), linetype = 2)+
geom_vline(aes(xintercept = 40), linetype = 2)+
geom_vline(aes(xintercept = 50), linetype = 2)+
geom_vline(aes(xintercept = 60), linetype = 2)+
geom_vline(aes(xintercept = 70), linetype = 2)+
geom_vline(aes(xintercept = 80), linetype = 2)+
geom_vline(aes(xintercept = 90), linetype = 2)+
geom_vline(aes(xintercept = 100), linetype = 2)+
labs(x = "Percent severity", y = "Chiang score")+
theme_r4pde()
```
### Qualitative ordinal
In the qualitative ordinal scale, each class provides a description of the symptoms. An example is the ordinal 0-3 scale for rating eyespot of wheat developed by [@scott1974].
| Class | Description |
|--------------------------|----------------------------------------------|
| 0 | uninfected |
| 1 | slight eyespot (or or more small lesion occupying in total less than half of the circumference of the stem) |
| 2 | moderate eyespot (one or more lesions occupying at least half the circumference of the stem) |
| 3 | severe eyespot (stem completely girdled by lesions; tissue softened so that lodging would really occur) |
: Ordinal scale for rating eyespot of wheat [@scott1974]
## Disease severity index (DSI)
Sometimes, when quantitative or qualitative ordinal scales are used, the scores given to various individual specimens (the observational units) are transformed into an index on a percentage basis, such as the disease severity index (DSI) which is used as a value for the experimental unit further in data analysis. The DSI is a single number that summarizes a large amount of information on disease severity [@chester1950]. The formula for a DSI (%) can be written as follows:
$DSI = \frac{∑(class \ freq. \ ✕ \ score \ of \ class)} {total \ n \ ✕ \ maximal \ class} ✕ 100$
The `DSI()` and `DSI2()` are part of the *r4pde* package. Let's see how each function works.
The `DSI()` allows to automate the calculation of the disease severity index (DSI) in a series of units (e.g. leaves) that are further classified according to ordinal scores. The function requires three arguments:
- unit = the vector of the number of each unit
- class = the vector of the scores for the units
- max = the maximum value of the scale
Let's create a toy data set composed of 12 units where each received an ordinal score. The vectors were arranged as a data frame named scores.
```{r}
unit <- c(1:12)
class <- c(2,3,1,1,3,4,5,0,2,5,2,1)
ratings <- data.frame(unit, class)
knitr::kable(ratings)
```
The ordinal score used in this example has 6 as the maximum score. The function returns the DSI value.
```{r}
library(r4pde)
DSI(ratings$unit, ratings$class, 6)
```
Let's now deal with a situation of multiple plots (five replicates) where a fixed number of 12 samples were taken and assessed using an ordinal score. Let's input the data using the `tribble()` function. Note that the data is in the wide format.
```{r}
exp <- tibble::tribble(
~rep, ~`1`, ~`2`, ~`3`, ~`4`, ~`5`, ~`6`, ~`7`, ~`8`, ~`9`, ~`10`, ~`11`,~`12`,
1, 2, 3, 1, 1, 3, 4, 5, 0, 2, 5, 2, 1,
2, 3, 4, 4, 6, 5, 4, 4, 0, 2, 1, 1, 5,
3, 5, 6, 6, 5, 4, 2, 0, 0, 0, 0, 2, 0,
4, 5, 6, 0, 0, 0, 3, 3, 2, 1, 0, 2, 3,
5, 0, 0, 0, 0, 2, 3, 2, 5, 6, 2, 1, 0,
)
knitr::kable(exp)
```
After reshaping the data to the long format, we can calculate the DSI for each plot/replicate as follows:
```{r}
res <- exp |>
pivot_longer(2:13, names_to = "unit", values_to = "class") |>
group_by(rep) |>
summarise(DSI = DSI(unit, class, 6))
```
And here we have the results of the DSI for each replicate.
```{r}
knitr::kable(res, align = "c")
```
Now our data set is organized as the frequency of each class as follows:
```{r}
ratings2 <- ratings |>
dplyr::count(class)
ratings2
```
Now we can apply the `DSI2()` function. The function requires three arguments:
- class = the number of the respective class
- freq = the frequency of the class
- max = the maximum value of the scale
```{r}
library(r4pde)
DSI2(ratings2$class, ratings2$n, 6)
```
## Analysis of ordinal data
Ordinal score data typically do not align well with the assumptions of traditional parametric statistical methods. Given this challenge, non-parametric methods have emerged as a compelling alternative for handling ordinal plant pathological data [@shah2004]. Unlike their parametric counterparts, these methods do not rest on the presumption of a specific distribution for the underlying population, offering greater flexibility in accommodating the intricacies inherent to ordinal data. On the other hand, when the conditions are right, parametric methods can also be harnessed effectively.
A common strategy, particularly for ordinal scores, involves converting these values into the mid-points of their corresponding percent scales. This transformation renders the data more amenable to parametric analyses. However, the mid-point conversion has been criticized in the literature as it may amplify the imprecision, especially when the interval size is wide, and because it does not really reflect a true value, but an interval for each estimate [@chiang2023a; @onofri2018].
Let's see some examples of analysis using the mid-point conversion in a parametric framework as well as the non-parametric tests.
### Example data
We will use a data set made available in an article on the application of the survival analysis technique to test hypotheses when using quantitative ordinal scales [@chiang2023a]. The data and codes used in the paper can be found in the specified GitHub [repository](https://github.com/StatisticalMethodsinPlantProtection/CompMuCens). The data is structured into four treatments, each with 30 ratings ranging from a score of 1 to 6 on the H-B scale.
```{r}
# Create the vectors for the treatments
trAs <- c(5,4,2,5,5,4,4,2,5,2,2,3,4,3,2,
2,6,2,2,4,2,4,2,4,5,3,4,2,2,3)
trBs <- c(5,3,2,4,4,5,4,5,4,4,6,4,5,5,5,
2,6,2,3,5,2,6,4,3,2,5,3,5,4,5)
trCs <- c(2,3,1,4,1,1,4,1,1,3,2,1,4,1,1,
2,5,2,1,3,1,4,2,2,2,4,2,3,2,2)
trDs <- c(5,5,4,5,5,6,6,4,6,4,3,5,5,6,4,
6,5,6,5,4,5,5,5,3,5,6,5,5,5,6)
# Create the tibble
dat_ordinal <- tibble::tibble(
treatment = rep(c("A", "B", "C", "D"),
each = 30),
score = c(trAs, trBs, trCs, trDs)
)
dat_ordinal
```
Because the ordinal response was obtained using an interval scale, the mid-point of the score was also obtained.
```{r}
# Create the vectors for the treatments
trAm <- c(18.5,9,1.5,18.5,18.5,9,9,1.5,
18.5,1.5,1.5,4.5,9,4.5,1.5,1.5,
37.5,1.5,1.5,9,1.5,9,1.5,9,18.5,
4.5,9,1.5,1.5,4.5)
trBm <- c(18.5,4.5,1.5,9,9,18.5,9,18.5,9,
9,37.5,9,18.5,18.5,18.5,1.5,37.5,
1.5, 4.5,18.5,1.5,37.5,9,4.5,1.5,
18.5,4.5,18.5,9,18.5)
trCm <- c(1.5,4.5,0,9,0,0,9,0,0,4.5,1.5,0,9
,0,0,1.5,18.5,1.5,0,4.5,0,9,1.5,1.5,
1.5,9,1.5,4.5,1.5,1.5)
trDm <- c(18.5,18.5,9,18.5,18.5,37.5,37.5,9,
37.5,9,4.5,18.5,18.5,37.5,9,37.5,
18.5,37.5,18.5,9,18.5,18.5,18.5,4.5,18.5,
37.5,18.5,18.5,18.5,37.5)
# Create the tibble
dat_ordinal_mp <- tibble::tibble(
treatment = rep(c("A", "B", "C", "D"),
each = 30),
midpoint = c(trAm, trBm, trCm, trDm)
)
dat_ordinal_mp
```
The visualization using box-plots suggests differences between the treatments.
```{r}
p_score <- dat_ordinal |>
ggplot(aes(treatment, score))+
geom_boxplot(outlier.colour = NA)+
geom_point()+
theme_r4pde()
p_mp <- dat_ordinal_mp |>
ggplot(aes(treatment, midpoint))+
geom_boxplot(outlier.colour = NA)+
geom_point()+
theme_r4pde()
library(patchwork)
p_score | p_mp
```
### Parametric (mid-point conversion)
Parametric methods can be employed when analyzing data, provided that the underlying assumptions of these methods are satisfied. For data that is based on ordinal scores, one way to apply parametric methods is to convert these scores into the mid-points of their corresponding percent scales. Once converted, the data is better suited for parametric analyses. Among the available techniques, the most frequently utilized is the application of an ANOVA (Analysis of Variance) model. This model is particularly useful for testing the null hypothesis, which posits that there are no significant differences among the treatments being studied.
We can use the `aov` function to fit the model and check the parametric assumptions using the `DHARMa` package.
```{r}
#| warning: false
#| message: false
m1_mp <- aov(midpoint ~ treatment, data = dat_ordinal_mp)
# normality and homocedasticity tests
library(DHARMa)
plot(simulateResiduals(m1_mp))
```
Since both normality and homoscedasticity assumptions are violated in our initial model, we will halt further analysis using this model. Instead, we will explore an alternative approach that includes data transformation (logit).
```{r}
#| warning: false
#| message: false
# data transformation
library(car)
# transform midpoint to proportion
dat_ordinal_mp$midpoint2 <- dat_ordinal_mp$midpoint/100
m2_mp <- aov(logit(midpoint2) ~ treatment, data = dat_ordinal_mp)
plot(simulateResiduals(m2_mp)) # assumptions are met
```
Now that both assumptions have been met, we can proceed to use a means comparison test to determine which treatments differ from one another, with the assistance of the {emmeans} package. It's important to note that the results are displayed in the original scale after transformation, specifically when using `type = "response`.
```{r}
library(emmeans)
means_m2_mp <- emmeans(m2_mp, "treatment", type = "response")
means_m2_mp
```
The best way to visualize the differences between treatments is using the `pwpm` function to display a matrix of estimates, pairwise differences, and P values. Most commonly, the compact letter display is used for means comparison.
```{r}
#| warning: false
#| message: false
# matrix
pwpm(means_m2_mp)
# compact letter display
library(multcomp)
cld(means_m2_mp)
```
### Non-parametric (ordinal score)
Because ordinal score data generally do not meet the assumptions of traditional parametric statistical methods, non-parametric methods can be considered as an alternative. Such methods have been proposed for analyzing ordinal data in plant pathology [@shah2004]. For our example, when more than two treatments are involved, a Kruskal-Wallis test can be utilized. The `kruskal` function of the {agricolae} package does the job we want. Note that in this case we use the ordinal score directly and not the mid-point value.
```{r}
library(agricolae)
kruskal(dat_ordinal$score, dat_ordinal$treatment, console = TRUE)
```
### Survival analysis
A method based on survival analysis was more recently introduced for the analysis of ordinal data [@chiang2023a]. The authors developed an R script called "CompMuCens", which facilitates the comparison of multiple treatment means for plant disease severity data. This is achieved through nonparametric survival analysis using class- or interval-based data derived from a quantitative ordinal scale. The code is accessible in [this repository](https://github.com/StatisticalMethodsinPlantProtection/CompMuCens/tree/main) and has been incorporated into the {r4pde} package.
We continue with our working example data. Since the expected variable name for the score in the function is `x`, we must adjust our data frame accordingly.
```{r}
names(dat_ordinal) <- c("treatment", "x")
```
The `CompMuCens()` function uses the `ictest()` from the {interval} package to conduct nonparametric survival analysis. Detailed explanation of the function's input and output can be found [here](https://github.com/StatisticalMethodsinPlantProtection/CompMuCens/blob/main/Supplementary%20Data.pdf). We just need to set the `dat` and the `scale` arguments. The scale will be used to convert the scores in to the defined interval which is used as response variable in the analysis. For example, if the score is 2, the respective limits of the interval will be 3 and 6. For 7, the limits will be 75 and 88. The function takes care of this conversion based on the inputted scale values.
```{r}
library(interval)
library(r4pde)
scale <- c(0,3,6,12,25,50,75,88,94,97,100, 100)
CompMuCens(dat_ordinal, scale)
```
The outcomes are three: the ordered scores for each treatment, the pairwise comparison between treatments and the significance levels, followed by a conclusion section. In this example all treatments differ from each other.
### Interpretation
Upon comparing the three methods, it becomes evident that there is a marked distinction in their outcomes. The Kruskal-Wallis and the survival analysis tests suggested a significant difference between treatment A and B, whereas the parametric counterpart did not. However, it's crucial to note that the P-value is borderline significant at 0.0530, being just slightly above the conventional threshold of 0.05. Given this, it appears that the conversion to the mid-point might have resulted in a type II error, wherein a genuine difference between the treatments might have been missed or overlooked. A detailed comparison between the mid-point and survival analysis has been presented in the paper, which is worth a reading [@chiang2023a]