-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathevaluation.qmd
707 lines (557 loc) · 35.7 KB
/
evaluation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
# Evaluating the integration results {#sec-evaluation}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(purrr)
library(dplyr)
library(ggplot2)
```
```{r setup-visible}
#| eval: false
library(targets)
library(moiraine)
## For data-frames manipulation
library(dplyr)
## For colour palettes
library(ggplot2)
```
After having investigated the results of an integration method (@sec-interpretation), we will show in this chapter how to evaluate the integration results against single-omics analyses results or prior biological knowledge.
As a reminder, here is what the `_targets.R` script should look like so far:
<details>
<summary>`_targets.R` script</summary>
```{r previous-vignettes}
#| echo: false
#| results: asis
knitr::current_input() |>
get_previous_content() |>
cat()
```
</details>
## Introduction
Evaluating the results of a multi-omics integration analysis is not an easy task, mainly because when analysing biological data there is no ground truth about the underlying biological mechanisms at play. However, it is possible to gain some measure of confidence in our results by comparing them to what we already know about the biological system studied, or to other analyses (typically single-omics analyses) performed on the dataset. In the latter case, this can also show us what insights are gained by analysing the omics datasets together that would have been missed by looking at each omics dataset separately. Note that in this vignette, we refer to the **evaluation** of an integration method as comparison with prior knowledge or previous studies, while in @sec-comparison we will talk about the **comparison** of results obtained with different integration tools.
When evaluating integration results, we can compare the features contribution (see @sec-interpretation-dim-reduction) to some prior information that we have about the features: for example, their score in a differential expression analysis, whether they were found differentially expressed or not, or the group to which they belong (e.g. biological pathway or chemical class). On the other hand, we can assess whether the latent dimensions created separate different sample groups (e.g. control vs treatment, or different time points).
For this vignette, we will use the results from the DIABLO (@sec-diablo) and MOFA (@sec-mofa) analyses as examples. We will use the dimension reduction output objects that we created in @sec-interpretation, which contains the results of the DIABLO and MOFA runs stored in a standardised format (@sec-interpretation-standard-output):
```{r load-methods-output}
tar_load(diablo_output)
tar_load(mofa_output)
```
Note that for the DIABLO run, the samples score are computed for the weighted average of each latent component, rather than for the dataset-specific latent components (this is controlled through the `use_average_dimensions` argument from the `get_output()` function).
We will also use the `MultiDataSet` object that contains information about the samples and features:
```{r load-mo-set-de}
tar_load(mo_set_complete)
```
## Evaluating features prioritisation
First, we can evaluate the integration results in terms of the features prioritisation, i.e. the ranking of the features for each latent dimension in terms of their importance score. We can for example assess whether the features that were given the highest scores for some latent components were also highlighted through single-omics analyses, or compare the importance score of groups of features, for example compounds from different chemical classes or genes from different gene families.
Here, we will compare the DIABLO results to the results of traditional single-omics analyses. In our case, a GWAS and eQTL analysis were performed to detect genomic variants associated with the disease status of the samples. The mode of action of significant markers (i.e. QTL, cis-eQTL or trans-eQTL). The GWAS status (high-scoring or not) of each marker was recorded in the genomics features metadata table:
```{r geno-fmeta}
get_features_metadata(mo_set_complete)[["snps"]] |>
pull(qtl_type) |>
unique()
```
For the transcriptomics and metabolomics dataset, a differential expression (DE) analysis was performed on each of them to compare control and BRD animals. The resulting adjusted p-value, log2-fold change or t-value, status (upregulated, downregulated or not differentially expressed) and significance (differentially expressed or not) of each gene or compound are recorded in each dataset's features metadata:
```{r transcripto-metabo-fmeta}
get_features_metadata(mo_set_complete)[["rnaseq"]] |>
select(feature_id, log_fc:de_status) |>
head()
get_features_metadata(mo_set_complete)[["metabolome"]] |>
select(feature_id, t_value:de_status) |>
head()
```
### Table of counts of selected genes against a grouping
As DIABLO performs feature selection, it is of interest to assess how many (e)QTLs and DE genes and compounds were retained for each latent component. We can use the `evaluate_feature_selection_table()` function to generate a table of counts comparing the features selected for each latent component to a features grouping (in our case, the results of the single-omics analyses). The features grouping is obtained from the features metadata stored in the `MultiDataSet` object. The function takes as input the integration result object, the `MultiDataSet` object and a named list giving for each dataset the name of the column from their features metadata table containing the features grouping:
```{r evaluate-feature-selection-table}
diablo_evaluation_counts_table <- evaluate_feature_selection_table(
diablo_output,
mo_data = mo_set_complete,
col_names = list(
"snps" = "qtl_type",
"rnaseq" = "de_signif",
"metabolome" = "de_signif"
)
)
diablo_evaluation_counts_table |>
filter(latent_dimension == "Component 1")
diablo_evaluation_counts_table |>
filter(latent_dimension == "Component 2")
```
We can see that for the first latent component, only three out of the eight (e)QTLs were selected ^[Note that in the full genomics dataset, 36 markers were detected as QTLs or eQTLs, however only eight of them were retained in the features preselection step ( @sec-prefiltering-supervised). For a real analysis, all 36 markers would have also been included in the prefiltered dataset even if they did not pass the preselection.]; two-third of the genes selected were found differentially expressed; and all ten of the selected compounds were found differentially expressed. Conversely, for the second latent component, none of the selected markers were QTLs, and none of the selected genes were differentially expressed. This makes sense, as the DIABLO analysis was run with the goal of separating the control and infected animals; therefore it makes sense that the features selected for the first latent component, which is the one best able to separate the groups, are the features detected as differentially expressed. On the contrary, the second latent component is constructed to be orthogonal to the first one, therefore the features selected will likely not be differentially expressed.
With this function, we can focus on only some datasets, by changing which datasets are present in the list passed to `col_names`. Similarly, we can select specific latent dimensions through the `latent_dimensions` parameter. For example, let us count the number of DE genes selected with the first latent component:
```{r evaluate-feature-selection-table-subset}
evaluate_feature_selection_table(
diablo_output,
mo_data = mo_set_complete,
col_names = list("rnaseq" = "de_signif"),
latent_dimensions = "Component 1"
)
```
Note that this function is useful for integration methods that perform feature selection, such as DIABLO or sO2PLS, but will be irrelevant for methods that do not perform feature selection (for example MOFA tends to assign a non-null weight to most features).
### Plotting features weight against a covariate
We can go further and compare the weights or importance scores given to the features for each latent component to the outcome of the single-omics results. This can be done with the `plot_features_weight_covariate()` function. As its name suggests, this function displays the features weight for each latent dimension against some covariate obtained from the features metadata tables. Again, we will look at the results of the GWAS/QTL and DE analyses and see whether (e)QTLs, DE genes and compounds were assigned higher weights is some of the latent components constructed with DIABLO. The input parameters for this function are similar to those for the `evaluate_feature_selection_table()` function: we need to pass the results of the integration method, the `MultiDataSet` object, and the named list giving the column in each features metadata table containing information about the features grouping. The latter is passed to the `covariate` argument of the function. If some datasets are not present in this list, they won't be present in the plot.
```{r plot-features-weight-covariate-all}
plot_features_weight_covariate(
diablo_output,
mo_data = mo_set_complete,
covariate = list(
"snps" = "qtl_type",
"rnaseq" = "de_status",
"metabolome" = "de_status"
)
)
```
By default, the function plots on the y-axis the signed importance score of the features, that is, their importance score to which the sign of their weight was added. This can be controlled via the `features_metric` argument. We see that most of the points have an importance score of 0, because most of the features were not selected. We can focus on only those features with a non-null weight with the `remove_null_weight` argument:
```{r plot-features-weight-covariate-nonnull}
plot_features_weight_covariate(
diablo_output,
mo_data = mo_set_complete,
covariate = list(
"snps" = "qtl_type",
"rnaseq" = "de_status",
"metabolome" = "de_status"
),
remove_null_weight = TRUE
)
```
Here, we see that for the first latent component, selected genes that were also found up-regulated were assigned on average a higher importance score that selected genes that were found non DE. Also, for the metabolomics dataset, selected compounds that were found up-regulated were assigned a positive weight while those found down-regulated were assigned a negative weight.
Note that with the `plot_features_weight_covariate()` function, we can also plot the features against a continuous covariate, e.g. the log2-fold change obtained with the differential expressions:
```{r plot-features-weight-covariate-continuous}
plot_features_weight_covariate(
diablo_output,
mo_data = mo_set_complete,
covariate = list(
"rnaseq" = "log_fc",
"metabolome" = "t_value"
)
)
```
We can see a nice trend between compounds' t-value and their signed importance score in the differential expression analysis for the first latent component.
We can further change the colour or shape of the points according to some information from the features metadata (information must be passed as a named list as for the covariate).
```{r plot-features-weight-covariate-continuous-colour}
plot_features_weight_covariate(
diablo_output,
mo_data = mo_set_complete,
covariate = list(
"rnaseq" = "log_fc",
"metabolome" = "t_value"
),
colour_by = list(
"rnaseq" = "fdr",
"metabolome" = "padj"
)
) +
scale_colour_viridis_c(aesthetics = c("colour", "fill"))
```
Note that for continuous covariates, as with the function `plot_samples_score_covariate()`, a loess curve is fit to summarise the trend between the covariates and the features weight. If `colour_by` is used, and the corresponding variables are categorical, a different loess curve is fitted for each category. If instead the `colour_by` variables are numeric, a single loess curve will be plotted for each latent dimension and dataset.
## Features set enrichment
In addition to looking at the distribution of features importance score or counting the number of selected features, we can also perform an enrichment to assess whether each of the latent dimensions constructed is enriched for some features set of interest. Features sets are groups of features that belong in a similar category: e.g. all features involved in a same biological pathway, or with the same GO term annotation, or even all up- or down-regulated features. They are built using prior knowledge, for example information extracted from databases, or constructed from the results of previous analyses. Importantly, a feature can belong to more than one set (for example a compound can be involved in several biological pathways).
### Constructing feature sets
With the `moiraine` package, there are two ways to obtain feature sets to perform enrichment or for plots. The first one is to extract information from the features metadata tables stored in our `MultiDataSet` object; which is done with the `make_feature_sets_from_fm()` function. Note that if some features grouping is stored in the features metadata, it means that features can only belong to one set. For example, we will use the results of the single-omics analyses to group the features into sets: QTLs, cis-eQTLs, trans-eQTLs and non significant genomics markers, and up-regulated, down-regulated or non DE genes and compounds. Again, the function accepts a named list giving for each dataset the name of the column in the corresponding features metadata table to use to generate the groups, passed to the `col_names` argument of the function. It returns a named list where each element is a feature set, and contains the ID of the features in the set:
::: {.targets-chunk}
```{targets make-feature-sets-from-fm}
tar_target(
sets_single_omics,
make_feature_sets_from_fm(
mo_set_complete,
col_names = list(
"snps" = "qtl_type",
"rnaseq" = "de_status",
"metabolome" = "de_status"
)
)
)
```
:::
```{r show-sets-single-omics}
tar_read(sets_single_omics) |>
str()
```
By default, the function keeps the different omics features separate: here for example, there is a "Not DE" set for the transcriptomics dataset, and a "Not DE" set for the metabolomics dataset. When there is no ambiguity as to which omics dataset they represent, the name of the dataset is not included in the set's name. It is possible to merge feature sets with the same name between the omics datasets, by setting the `combine_omics_sets` argument to `TRUE`:
::: {.targets-chunk}
```{targets make-feature-sets-from-fm-merge}
tar_target(
sets_single_omics_merged,
make_feature_sets_from_fm(
mo_set_complete,
col_names = list(
"snps" = "qtl_type",
"rnaseq" = "de_status",
"metabolome" = "de_status"
),
combine_omics_sets = TRUE
)
)
```
:::
```{r show-sets-single-omics-merged}
tar_read(sets_single_omics_merged) |>
str()
```
Now the `not DE` set contains all genes and compounds ID that have not been found differentially expressed.
Alternatively, feature sets can be constructed from an input data-frame. This is useful when looking at pathways or GO annotations, for which features can belong to several sets at once, as this information cannot be stored in a features metadata table. Here, we will use the GO terms assigned to the genes. Note that in this example, the feature sets will contain features from only one omics dataset, but it doesn't have to be; instead the sets could contain features from different omics datasets.
We start by reading in the data-frame of GO terms. We will focus on GO terms relating to biological processes for this example:
::: {.targets-chunk}
```{targets reading-go-file}
list(
tar_target(
rnaseq_go_terms_file,
system.file(
"extdata/transcriptomics_go_annotation.csv",
package = "moiraine"
),
format = "file"
),
tar_target(
rnaseq_go_df,
read_csv(rnaseq_go_terms_file) |>
filter(go_domain == "Biological process")
)
)
```
:::
```{r loading-go-terms}
tar_load(rnaseq_go_df)
dim(rnaseq_go_df)
head(rnaseq_go_df)
```
The data-frame contains one row per gene (`gene_id`) / GO term (`go_id`) pair. In addition, the name of each GO term and the category (biological process, molecular function or cellular component) are indicated in the `go_name` and `go_domain` columns, respectively. When reading in the file, we've restricted the GO terms to only those that correspond to biological processes.
We can turn this data-frame into a list of feature sets with the `make_feature_sets_from_df()` function. The function takes as input the data-frame, as well as the name of the column in the data-frame that contains the features ID (`col_id` argument) and the name of the column in the data-frame that contains the feature sets ID or name (`col_set` argument):
::: {.targets-chunk}
```{targets go-sets}
tar_target(
go_sets,
make_feature_sets_from_df(
rnaseq_go_df,
col_id = "gene_id",
col_set = "go_id"
)
)
```
:::
```{r}
tar_load(go_sets)
length(go_sets)
head(go_sets) |>
str()
```
### Filtering feature sets against the datasets
Before performing any enrichment, we want to make sure that the only features in the sets are the ones that were measured in the datasets. This is part of ensuring that an appropriate background or reference features set is used when performing enrichment (see for example @timmons2015). To this end, the `reduce_feature_sets_data()` function filters a list of feature sets so that only features present in a given `MultiDataSet` object are kept in the sets; it then removes empty feature sets. Note that for this filtering, we are using the `MultiDataSet` object that contains the full omics datasets, before pre-filtering:
::: {.targets-chunk}
```{targets reduce-feature-sets-data}
tar_target(
go_sets_filtered,
reduce_feature_sets_data(go_sets, mo_set_complete)
)
```
:::
The function yields the following message:
```{r show-message-reduce-feature-sets-data}
#| echo: false
temp <- reduce_feature_sets_data(go_sets, mo_set_complete)
```
```{r read-go-sets-filtered}
tar_load(go_sets_filtered)
```
### Checking the number of features assigned to sets
We can then check the number of features that are assigned to at least one set with the `check_feature_sets()` function, to get an idea of the coverage of our annotation. We will restrict the check to the transcriptomics dataset, since the GO annotation only covers genes:
```{r check-feature-sets}
sets_check <- check_feature_sets(
go_sets_filtered,
mo_set_complete,
datasets = "rnaseq"
)
sets_check
```
The function returns a tibble giving the name of the datasets, the number and fraction of features from the dataset that are present in at least one feature set (`n_annotated` and `frac_annotated` columns), the total number of features in the corresponding dataset (`n` column), as well as sentence summarising this information (`message` column) that is useful for reporting. In this example, 70.6% of the genes from the transcriptomics dataset are assigned to at least one GO term. This is important to check as the annotation coverage will impact the quality of the enrichment results that we get.
### Enrichment of latent dimensions
We are now ready to perform an enrichment of the latent dimensions. There are many ways to perform functional enrichment (e.g. see @zhao2023): based on over-representation, feature ranking, or even topology when the feature sets are representing biological pathways. Here, we are not trying to provide an exhaustive list of options; however it is relatively straightforward to extract from an integration method output the features importance or list of selected features for a specific latent dimension, and use it for enrichment with the package of your choice.
In `moiraine`, we use the `gage` package [@luo2009] to perform the enrichment of each latent dimension against a list of feature sets. It uses a two-sample t-test to assess whether features in a set have higher importance scores than features not in the set. The `evaluate_method_enrichment()` function takes as input the results from an integration method in a standardised format (here our `diablo_output` object), and performs an enrichment for each latent dimension using the features sets provided as a list. We can focus on specific datasets or latent dimensions by passing their name to the `datasets` or `latent_dimensions` arguments, respectively. There are a number of parameters that need to be set:
- `use_abs`: by default, the function uses the absolute value of the importance score to perform the enrichment, which allows it to detect features sets in which features are given strong weights that are both positive and negative. If we instead want to search for features sets in which the weight of the features are coordinated (i.e. either all positive or negative), we can set the `use_abs` argument to `FALSE`.
- `min_set_size`: the minimum number of features in a set needed for the set to be considered for enrichment. The default value is 5, i.e. sets with less than 5 features will not be tested for enrichment.
- `add_missing_features` and `mo_data`: when performing multi-omics integration, it is highly likely that the omics datasets have been pre-filtered prior to the analysis, which means that some features will not have received a weight in the results of the integration analysis. This can bias the enrichment results, as these are based on only features that have a weight. So if a features set contains 50 features, but 40 of these were discarded during the pre-filtering step, only the remaining 10 features will be considered when performing an enrichment. This might lead to finding a latent dimension enriched for this set, even though the set might not be biologically relevant since 80% of its features did not even pass the pre-filtering step. To account for that, it is possible to add to the integration results all of the features in a `MultiDataSet` object that are not present in the results. These features will be assigned a weight of 0 to emphasise that they were not found to play any role for the latent dimensions. This is done by setting `add_missing_features` to `TRUE`, and passing to `mo_data` a `MultiDataSet` object that contains the omics datasets that have not been filtered. This way, the background set used in the enrichment will be all features measured in the omics datasets, and not only the ones that passed the pre-filtering step.
- `sets_info_df` and `col_set`: the feature sets are passed as a named list, therefore the only information that we have about these sets are their names as found in the list. To facilitate the interpretation of the enrichment results, it is possible to pass a data-frame giving information about the sets, through the `sets_info_df` argument. This information will be added to the enrichment results table. In that case, the `col_set` argument must also be specified: it takes the name of the column in the data-frame passed to `sets_info_df` that contains the sets ID or named as seen in the list. Here we will create this data-frame as follows:
::: {.targets-chunk}
```{targets go-sets-info}
tar_target(
go_sets_info,
rnaseq_go_df |>
dplyr::select(go_id, go_name) |>
dplyr::distinct()
)
```
:::
```{r show-go-sets-info}
tar_read(go_sets_info) |>
head()
```
We can run the enrichment analysis:
::: {.targets-chunk}
```{targets evaluate-method-enrichment}
tar_target(
diablo_enrichment_results,
evaluate_method_enrichment(
diablo_output,
go_sets_filtered,
datasets = "rnaseq",
use_abs = TRUE,
min_set_size = 10,
add_missing_features = TRUE,
mo_data = mo_set_complete,
sets_info_df = go_sets_info,
col_set = "go_id"
)
)
```
:::
The function returns a table giving for each latent component (`latent_dimension`) and feature set (`set_id`) the enrichment statistics (`stat_mean`), p-value (`pvalue`), adjusted p-value (`adj_pvalue`) as well as the number of features in the set that are present in the integration results (`set_size`). The table is arranged by increasing adjusted p-value, such that the significant results are at the top of the table. It is important to note that the multiple testing correction applied to the p-values occurs independently for each latent dimension, but there is no multiple testing correction performed across the latent dimensions. This can easily be done by the user, and should be considered when reporting the results.
```{r show-diablo-enrichment-results}
tar_read(diablo_enrichment_results) |>
head()
```
In that case, unsurprisingly, neither of the latent components were enriched for any GO term. This makes sense since DIABLO performed a strong feature selection, so very few genes had non-null weights. We can repeat the enrichment with the MOFA integration results to show more interesting results. We will run the enrichment only for the first three factors for conciseness:
::: {.targets-chunk}
```{targets evaluate-method-enrichment-mofa}
tar_target(
mofa_enrichment_results,
evaluate_method_enrichment(
mofa_output,
go_sets_filtered,
datasets = "rnaseq",
latent_dimensions = paste("Factor", 1:3),
use_abs = TRUE,
min_set_size = 10,
add_missing_features = TRUE,
mo_data = mo_set_complete,
sets_info_df = go_sets_info,
col_set = "go_id"
)
)
```
:::
This time, there are a couple of small p-values (even though the adjusted p-values are all quite large):
```{r show-mofa-enrichment-results}
tar_load(mofa_enrichment_results)
mofa_enrichment_results |> head()
```
We can see for example that MOFA factor 3 seems enriched in genes associated with the keratinization GO term (`GO:0031424`). For factor 1 (the factor separating the healthy and infected animals), the keratinization term also yields the smallest p-value, together with several GO terms related to inflammation (even though after correction for multiple testing these p-values are no longer significant):
```{r show-mofa-enrichment-results-factor1}
mofa_enrichment_results |>
filter(latent_dimension == "Factor 1") |>
head()
```
### Plotting features weight for feature sets
We can dig into the results of the enrichment by visualising the distribution of features weight or importance score between features that are in a specific features set vs features that are not in the set. For example, to continue with the MOFA example, let's look at the importance score of genes associated with the GO term `GO:0031424`. We'll only look at factors 1 and 3 for this example:
```{r plot-features-weight-set}
#| fig.height: 8
plot_features_weight_set(
mofa_output,
go_sets_filtered[["GO:0031424"]],
set_name = "GO:0031424 (keratinization)",
features_metric = "importance",
datasets = "rnaseq",
latent_dimensions = paste("Factor", c(1, 3)),
point_alpha = 0.2
)
```
In the resulting plot, we can see for each dataset (in this case, we're only looking at the transcriptomics dataset since we have gene sets) and latent dimensions (rows of the plot) the distribution of features importance for genes in the set vs not in the set. We can see that the GO term investigated contains 20 genes, and in total, 994 genes have been assigned a weight with MOFA (because we have performed some pre-filtering on the transcriptomics dataset prior to running MOFA). Therefore, out of all the genes in the set, 9 of them have no weight (they did not pass the pre-filtering step). We can add these features that did not make it into the MOFA input data in the same way as we did for the enrichment function, namely by setting the `add_missing_features` argument to TRUE and passing the non-filtering `MultiDataSet` object through `mo_data`:
```{r plot-features-weight-set-with-missing}
#| fig.height: 8
plot_features_weight_set(
mofa_output,
go_sets_filtered[["GO:0031424"]],
set_name = "GO:0031424 (keratinization)",
features_metric = "importance",
datasets = "rnaseq",
latent_dimensions = paste("Factor", c(1, 3)),
point_alpha = 0.2,
add_missing_features = TRUE,
mo_data = mo_set_complete
)
```
This is the distribution of values that were used for the enrichment test, but it's not very useful to look at since the vast majority of the genes have a weight of zero.
## Assessing samples clustering
Another aspect that we can evaluate is whether the integration method was able to separate different groups of samples, for example to identify differences between control and treatment samples. This is done by comparing the coordinates of the samples in the space spanned by the latent dimensions created (i.e. the samples score) to a samples label that is obtained from the samples metadata. We can use for that the silhouette score, which is a metric that quantifies how well-defined different clusters are in a given space, i.e. how well-grouped points from a same cluster (here samples group) are, and how separate they are from points from other clusters. First, a silhouette score (or silhouette width) is computed for each individual point (here sample). Then, these values are averaged over all points in a given cluster to yield a cluster silhouette score. The resulting values, both at the point- and cluster- level, can range from -1 to 1. Negative values indicate that the point is grouped with points from another cluster, while positive values indicate that the point is grouped with other points from its cluster. Values close to 0 show that the clustering is ambiguous, while scores close to 1 are indicative of well-defined clusters.
The `compute_samples_silhouette()` function relies on the `cluster::silhouette()` function from the `cluster` package to compute silhouette score. It takes as input the integration results in a standardised format, as well as the `MultiDataSet` object (to access the samples metadata) and the name of the column in the samples metadata table that contains the samples grouping information. In addition, it is possible to specify the distance metric that should be used to compute distances between the samples (through the `distance_metric` argument). By default, the function uses euclidean distances, but other options are available.
Here, we will check how well DIABLO managed to separate the bruising groups:
::: {.targets-chunk}
```{targets compute-samples-silhouette-diablo}
tar_target(
diablo_silhouette,
compute_samples_silhouette(
diablo_output,
mo_set_complete,
"status"
)
)
```
:::
```{r load-diablo-silhouette}
tar_load(diablo_silhouette)
diablo_silhouette
```
The function returns a list with two elements: a tibble of samples silhouette widths (`samples_silhouette`), and a tibble of groups average silhouette (`groups_average_silhouette`). In the first table, the `group` column indicates for each sample the group to which it is assigned (based on the samples metadata), and the `neighbour_group` column indicates the closest other group in the latent dimensions space (which is not very useful in this case since we only have two groups). The samples silhouette width is shown in the `silhouette_width` column. This is very useful to identify "outlier" samples that were not grouped as expected:
```{r diablo-silhouette-outliers}
diablo_silhouette$samples_silhouette |>
filter(silhouette_width < 0)
```
Here, there is one sample from an infected animal that clusters with the control samples in the latent dimensions space.
The average silhouette score for the two disease status groups are both positive, although not close to 1, indicating that DIABLO did a reasonable job of identifying differences between the two groups:
```{r diablo-average-silhouette}
diablo_silhouette$groups_average_silhouette
```
This is expected, since DIABLO is a supervised integration method which tries to maximise the separation between groups of samples.
We can repeat the operation for the MOFA results. By default, the `compute_samples_silhouette()` function uses all latent dimensions to compute distances between samples, but it is possible to specify which latent dimensions should be considered through the `latent_dimensions` argument. For example, we will see how the first three MOFA factors separate the disease status groups:
::: {.targets-chunk}
```{targets compute-samples-silhouette-mofa}
tar_target(
mofa_silhouette,
compute_samples_silhouette(
mofa_output,
mo_set_complete,
"status",
latent_dimensions = paste("Factor", 1:3)
)
)
```
:::
```{r show-mofa-silhouette}
tar_load(mofa_silhouette)
mofa_silhouette$groups_average_silhouette
```
Both groups' silhouette scores are positive, although the BRD group silhouette is not very high. Also, more samples have a silhouette width below 0.
```{r mofa-silhouette-outliers}
mofa_silhouette$samples_silhouette |>
filter(silhouette_width < 0)
```
Note that with MOFA, since the different factors represent different trends within the data, it is possible that using more factors will yield smaller average silhouette groups. For example if we used all constructed factors:
```{r mofa-average-silhouette}
compute_samples_silhouette(
mofa_output,
mo_set_complete,
"status"
)$groups_average_silhouette
```
In such cases, it makes more sense to evaluate the ability of one or two latent dimensions to separate the samples group.
## Recap -- targets list
For convenience, here is the list of targets that we created in this section:
<details>
<summary>Targets list for evaluating the output of an integration method:</summary>
::: {.targets-chunk}
```{targets recap-targets-list}
list(
## Evaluating DIABLO selected features against single-omics results
tar_target(
diablo_selected_vs_singleomics_table,
evaluate_feature_selection_table(
diablo_output,
mo_data = mo_set_complete,
col_names = list(
"snps" = "qtl_type",
"rnaseq" = "de_signif",
"metabolome" = "de_signif"
)
)
),
## Plotting DIABLO features weight against single-omics results
tar_target(
diablo_features_weight_vs_singleomics_plot,
plot_features_weight_covariate(
diablo_output,
mo_data = mo_set_complete,
covariate = list(
"snps" = "qtl_type",
"rnaseq" = "de_status",
"metabolome" = "de_status"
),
remove_null_weight = TRUE
)
),
## Genes GO annotation file
tar_target(
rnaseq_go_terms_file,
system.file(
"extdata/transcriptomics_go_annotation.csv",
package = "moiraine"
),
format = "file"
),
## Genes GO annotation data-frame
tar_target(
rnaseq_go_df,
read_csv(rnaseq_go_terms_file) |>
filter(go_domain == "Biological process")
),
## GO term sets
tar_target(
go_sets,
make_feature_sets_from_df(
rnaseq_go_df,
col_id = "gene_id",
col_set = "go_id"
)
),
## Filtering GO term sets against measured features
tar_target(
go_sets_filtered,
reduce_feature_sets_data(go_sets, mo_set_complete)
),
## Checking genes GO term sets against datasets
tar_target(
go_sets_check,
check_feature_sets(
go_sets_filtered,
mo_set_complete,
datasets = "rnaseq"
)
),
## Table of information about GO terms
tar_target(
go_sets_info,
rnaseq_go_df |>
dplyr::select(go_id, go_name) |>
dplyr::distinct()
),
## MOFA latent components enrichment analysis
tar_target(
mofa_enrichment_results,
evaluate_method_enrichment(
mofa_output,
go_sets_filtered,
datasets = "rnaseq",
latent_dimensions = paste("Factor", 1:3),
use_abs = TRUE,
min_set_size = 10,
add_missing_features = TRUE,
mo_data = mo_set_complete,
sets_info_df = go_sets_info,
col_set = "go_id"
)
),
## Plotting features weight for GO term 'GO:0031424'
tar_target(
mofa_enrichment_go0031424_plot,
plot_features_weight_set(
mofa_output,
go_sets_filtered[["GO:0031424"]],
set_name = "GO:0031424 (keratinization)",
features_metric = "importance",
datasets = "rnaseq",
latent_dimensions = paste("Factor", c(1, 3)),
point_alpha = 0.2
)
),
## Assessing DIABLO samples clustering
tar_target(
diablo_silhouette,
compute_samples_silhouette(
diablo_output,
mo_set_complete,
"status"
)
)
)
```
:::
</details>