-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path28.fst_vs_dxy.Rmd
114 lines (89 loc) · 4.52 KB
/
28.fst_vs_dxy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "Comparison of Fst and dxy"
bibliography: bibliography.bib
output:
github_document:
pandoc_args: --webtex
---
```{r setup, include=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE,message=FALSE,warning = FALSE)
library(tidyverse)
library(PopGenome)
library(ggpubr)
```
With the development of whole genome or genome-wide marker sequencing a suite of studies have identified so called "islands of speciation" or "islands of differentiation" [@Turner2005-pj], [@Turner2010-wn]. These genomic islands are especially interesting because the provide a potential explanation for the widespread observation of sympatric or parapatric speciation. Later reanalysis however revealed that many instances where such islands had been observed probably correspond to regions of low diversity not low gene flow [@Cruickshank2014-jr].
Here, we adopt the suggested methodology of [@Cruickshank2014-jr], and implemented in [@Malinsky2015-qq] to check for islands of differentiation. The central idea is that two statistics are required, Fst, which measures relative levels of divergence and which is typically associated with putative barrier loci, and dxy which measures absolute levels of divergence. In genuine instances of heterogenous gene flow extreme values of Fst should also be associated with extreme values of dxy.
To implement this check we used the PopGenome package to calculate Fst and dxy across the genome in windows of 50kb with a window jump size of 10kb. We then classified windows into two categories, "high" (top 5% based on Fst) and "low" (bottom 95%). In theory the "high" category should include the majority of putative barrier loci and should therefore be associated with high dxy. As can be seen in the plot below this is not the case.
```{r}
sample_data <- read_tsv("data/hpc/vcf_file/Adigi.samplenames.txt", col_names = FALSE) %>%
as.matrix() %>%
t() %>%
as.tibble() %>%
filter(row_number()>=10) %>%
extract(V1,"prefix","([^_]*)",remove = FALSE) %>%
mutate(pop = case_when(
prefix=="AI" ~ "IN",
prefix=="BR" ~ "IN",
prefix %in% c("RS1","RS2","RS3") ~ "SO",
prefix == "AR" ~ "NO"
))
```
```{r}
chrdata <- read_delim("data/genome/ncbi_lengths.txt",col_names = c("scaffold","length")) %>%
arrange(desc(length)) %>%
as.data.frame()
```
```{r}
get_fst_dxy <- function(chr,chrlen){
vcf <- readVCF("data/hpc/vcf_file/Adigi.v2.filtered.vcf.gz",10000,tid = chr,frompos=1, topos=chrlen, samplenames = sample_data$V1)
populations <- split(sample_data$V1,sample_data$pop)
vcf <- set.populations(vcf,populations,diploid = T)
window_size <- 50000
window_jump <- 10000
window_start <- seq(from=1,to=chrlen,by=window_jump)
window_stop <- window_start + window_size
window_start <- window_start[which(window_stop < chrlen)]
window_stop <- window_stop[which(window_stop < chrlen)]
windows <- data.frame(start = window_start, stop = window_stop,
mid = window_start + (window_stop-window_start)/2)
vcf_sw <- sliding.window.transform(vcf, width = 50000, jump = 10000, type = 2)
vcf_sw <- diversity.stats(vcf_sw, pi = TRUE)
vcf_sw <- F_ST.stats(vcf_sw, mode = "nucleotide")
fst <- [email protected]_ST.pairwise %>% t()
dxy <- get.diversity(vcf_sw, between = T)[[2]]/50000
fst_tib <- fst %>% as.tibble() %>% add_column(pos=windows$mid)
dxy_tib <- dxy %>% as.tibble() %>% add_column(pos=windows$mid)
fst_tib %>% pivot_longer(-pos,names_to = "pair",values_to = "Fst") %>%
left_join(pivot_longer(dxy_tib,-pos,names_to = "pair",values_to = "dxy")) %>%
add_column(scaffold = chr)
}
bigchr <- chrdata %>% head(n=20) %>% as.tibble()
if ( !file.exists("cache/fst_dxy.rds")){
fst_dxy <- map2_dfr(bigchr$scaffold,bigchr$length,get_fst_dxy)
write_rds(fst_dxy,"cache/fst_dxy.rds")
} else {
fst_dxy <- read_rds("cache/fst_dxy.rds")
}
```
```{r}
fst_top5pc <- fst_dxy %>% arrange(desc(Fst)) %>% head(n=floor(nrow(fst_dxy)/20)) %>% tail(n=1) %>% pull(Fst)
fst_dxy %>%
filter(Fst>0) %>%
filter(dxy>0) %>%
mutate(category = case_when(
Fst > fst_top5pc ~ "high",
Fst <=fst_top5pc ~ "low"
)) %>%
mutate(pair = case_when(
pair=="pop1/pop2" ~ "IN/NO",
pair=="pop1/pop3" ~ "IN/SO",
pair=="pop2/pop3" ~ "NO/SO"
)) %>%
pivot_longer(c("Fst","dxy"),names_to = "stat", values_to = "value") %>%
ggplot(aes()) +
geom_violin(aes(y=value,x=category,fill=category)) +
facet_grid(stat~pair, scales = "free") +
theme_pubclean() +
ylab("") + theme(legend.position = "none") + xlab("")
#ggsave("figures/fig-S16B.png",width = 7,height = 5)
```