-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path07.demographic_history.Rmd
340 lines (265 loc) · 15.4 KB
/
07.demographic_history.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
title: "Demographic history with SMC"
output:
github_document:
toc: TRUE
toc_depth: 3
bibliography: bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
library(tidyverse)
library(ggsci)
```
## SNP data for Sequentially Markovian Coalescent (SMC) analyses
We only used variants from scaffold with a length greater than 1Mbp in which account for approximately 75% of genome size (N75=983,972, first 142 scaffolds). Additionally, different methods have different limitations in number of samples, we used:
- Three individuals from each location in **PSMC analysis**.
- All samples (unphased, 74 indvs) except the mislabeled sample from each location in **SMC++**.
<!-- - Eight haplotypes (4 samples) from each reef in **MSMC**. -->
## 0. Mutation rate estimate
The per generation mutation rate of *A.digitifera* estimated based on a divergence time of 15.5 Million years to *A. tenuis* is 2.89e-8 in [@Mao2018]. A more recent study that used numerous fossil calibration points and 818 orthologous genes points toward a much older divergence time of 50Mya [@Shinzato2020]. Given the inherent uncertainties in these estimates we chose a value of 37.5Mya as this lies at the midpoint between the youngest Acropora fossils (25 Mya) [@Santodomingo2015-yl] and more ancient estimates from whole genome analyses [@Shinzato2020]. To translate these divergence times to mutation rates we updated calculations in [@Mao2018] as follows;
```bash
# Original calculation in Mao et al
#(79427941/(79427942+363368171)/(2 × 15.5)) x 5 × 10−6 = 2.89 × 10−8
#calibrate divergence time to 37.5
(79427941/(79427942+363368171)/(2 × 37.5)) x 5 × 10−6 = 1.20 x 10-8
```
This gave a per generation mutation rate of 1.2e-8 that we used for all analyses.
## 1. PSMC analysis
We first used bcftools mpileup and call functions to get sample-specific SNP data sets, in which we removed sites with mean map and base quality < 30. Then appied psmc to each sample with 64 atomic time (-p 4+25*2+4+6).
```bash
bcftools mpileup -Q 30 -q 30 -C 50 -f {reference}.fa -r {chr} {sample}.bam | \
bcftools call -c | \
vcfutils.pl vcf2fq -d 10 -D 80 -Q 30 > {sample}_{chr}.fq
cat {sample}_*.fq > {sample}_consensus.fq
#fq2psmcfa
fq2psmcfa {sample}_consensus.fq > {sample}.psmcfa
## run psmc
psmc -p 4+25*2+4+6 -o {sample}.psmc {sample}.psmcfa
```
To perform bootstrapping, we run splitfa to split long scaffold to shorter segments and applied psmc with `-b` option to allow it randomly sample with replacement from these segments for 100 times.
```bash
splitfa {sample}.psmcfa > split.{sample}.psmcfa
psmc -b -p 4+25*2+4+6 -o bootstrap/round-{n}.psmc split.{sample}.psmc.fa # n=1:100
cat bootstrap/round-*.psmc > {sample}_bs_combined.psmc
```
```{r}
##### this is how I made the plot in R, however, I include the final plot directly here to avoid including so many temp files, which is not in cloudstor.
#fn <-list.files(path="data/hpc/demography/psmc/psmc_plot_temp", pattern = "final_bp.*.txt",full.names = T)
#ggplot(aes(x=X1,y=X2)) + geom_step()+ scale_x_log10()
#read_single_psmc <- function(.x) {
# bs_n <- gsub("data/hpc/demography/psmc/psmc_plot_temp/final_bp.(\\d+).txt","\\1",.x) %>% as.integer()
# pop <- case_when(bs_n<=100~"inshore", bs_n>100 && bs_n<=201 ~ "northoffshore", bs_n>=202~"southoffshore")
# tb<-read_tsv(.x,col_names = F) %>% select(X1,X2) %>% add_column(pop=pop,bs_n=bs_n)
# tb
#}
#df <- map_df(.x=fn, .f = read_single_psmc)
#psmc_plot<-ggplot(df,aes(x=X1,y=X2,group=bs_n)) + geom_step(aes(color=pop),alpha=0.3) + geom_step(data=df %>% filter(bs_n %in% c(0,101,202)), color="black", size=.8) +
# scale_x_log10(breaks=c(1e+4,1e+5,1e+6,1e+7), labels=c("10kya","100kya","1mya","10mya"), limits=c(1e4,3e6)) +
# scale_color_startrek(guide=F) + theme_test() + labs(x="",y="")
#psmc_plot
#ggsave(psmc_plot, filename = "figures/psmc.jpg", width = 6, height = 4)
knitr::include_graphics("data/r_essentials/psmc.jpg")
```
**Figure1: The demographic history inferred by PSMC for inshore, north offshore, and south offshore**
## 2. SMC++ analysis
Firstly, a single VCF file for each scaffold was extracted and indexed.
```bash
bcftools view -r {chr} -Oz -o {chr}.vcf.gz Adigi.v2.filtered.74.vcf.gz
tabix {chr}.vcf.gz
```
To distinguish invariant positions with missing data, we mask genome regions where:
a) with low or super high coverage across all samples (less than three reads covered or greater than 3000);
b) We generated the genome mask files created using Heng Li's SNPable tools and extracted the ambiguous positions (c=="0" or c=="1").
```bash
samtools depth -r {chr} -aa -f bamfiles.txt | \
awk '{sum=0; for(i=3; i<=NF; i++) {sum+=$i}; print $1"\t"$2"\t"sum }' | \
awk '{if($3<3 || $3>3000) print $1"\t"$2"\t"$2}' | \
bedtools merge -i stdin | bgzip > {chr}.low_masked.bed.gz
zcat {chr}.low_masked.bed.gz Adigi_{chr}.mask.bed.gz | bedtools sort | bedtools merge |bgzip > {chr}.masked.bed.gz
```
Mask files of all scaffolds were concatenated together and we keep blocks with a length greater than 2Kb. Next, vcf files were converted into a SMC format file using `vcf2smc` in which we send in a vcf file of one scaffold and specify a list samples in the population. All smc+ input files were used together in a single run by varying the choice of distinguished individuals(8 from each pop) and resulting in a composite likelihood estimate.
```bash
smc++ vcf2smc -d {sample} {sample} \
--mask all.masked.bed.gz {chr}.vcf.gz {chr}_{sample}.smc.gz {chr} \
POP:$(cat {sample}.txt | paste -s -d ',')
```
smc++ `estimate` command was ran to estimate population history for each population.
```bash
## cubic spine version
smc++ estimate --cores 30 -o estimate --base {pop} --spline cubic \
--timepoints 20 200000 --em-iterations 50 --thinning 3000 --knots 10 1.2e-8 {pop}.*.smc.gz
## piecewise(default) spine version
smc++ estimate --cores 30 -o estimate --base {pop} \
--em-iterations 50 --timepoints 20 200000 --thinning 3000 \
--knots 40 1.2e-8 {pop}.*.smc.gz
```
Eventually, we chose the piecewise spline for better resemble to PSMC plot and more details. However, the result file are available for plotting ([csv](data/hpc/demography/estimate_em50_cubic_k10/em50_cubic_k10.csv),[cubic spline plot](data/hpc/demography/estimate_em50_cubic_k10/em50_cubic_k10.png)).
### 2.1 Demographic history of three reefs
```{r plot-seting}
mu <- 1.2e-8
gen <- 5
#dtime = data.frame(name=c("LGM","Holocene"), start=c(1.2e+4,0), end=c(1.1e+5,1.2e+4))
#smc <- read_csv("data/hpc/demography/estimate_em50_cubic_k10/em50_cubic_k10.csv")
smc_main <- read_csv("data/hpc/demography/estimate_em50_k40/em50_k40.csv") %>% select(label,x,y) %>% add_column(type="main")
xbreaks <- c(1e+2*1:9, 1e+3*1:9 , 1e+4*1:9 , 1e+5*1:9 , 1e+6*1:9 , 1e+7)
xlabels <- as.character(xbreaks)
xlabels[!(xbreaks%in%c(1e+3,1e+4,1e+5,1e+6,1e+7))] <- ''
xlabels[xbreaks%in%c(1e+3,1e+4,1e+5,1e+6,1e+7)] <-c("1kya","10kya","100kya","1mya","10mya")
ybreaks <- c(1e+3*2:9 , 1e+4*1:9 , 1e+5*1:9 , 1e+6*1:2)
ylabels <- as.character(ybreaks)
ylabels[!(ybreaks%in%c(1e+4,1e+5,1e+6))] <- ''
ylabels[ybreaks%in%c(1e+4,1e+5,1e+6)] <- c("1","10","100")
```
```{r smc-plot, fig.align='left',fig.width=7,fig.height=5}
ggplot(smc_main) + geom_line(aes(x=x,y=y,color=label),size=0.8,show.legend = FALSE) +
#geom_rect(data=dtime,aes(xmin=start,xmax=end,fill=name),ymin=-Inf, ymax=Inf,alpha=0.2,show.legend = FALSE) +
scale_fill_manual(values = c("yellow","grey")) +
#geom_text(data=dtime, aes(x=c(3e+4,2e+3),y=2e+6,label=name), size=4, fontface="bold") +
scale_x_log10(breaks=xbreaks, labels=xlabels)+
scale_y_log10(breaks=ybreaks,labels=ylabels,limits=c(2e+3,2e+6)) +
scale_color_startrek() + theme_test() +
labs(x=expression(paste("Years Ago (g=5, ",mu, "=1.20e-8)")),
y=expression(paste("Effective Population Size ",N[e]," (x", 10^4 ,")"))) +
ggtitle("Historial effective population size estimated by SMC++")
```
**Figure 2: The effective population size estimate using SMC++ approach. The red, blue, and green lines represent the results of inshore, offshore north, and offshore south respectively.**
To do bootstrapping, we performed 20(due to computational constraint) bootstrap estimates in SMC++ with fragmented and re-sampled scaffolds (generate by script [bootstrap_smcpp.py](scripts/bootstrap_smcpp.py)). SMC++ `estimate` was ran with these data sets with the same parameters as before.
```{r}
smc_main <- read_csv("data/hpc/demography/estimate_em50_k40/em50_k40.csv") %>% select(label,x,y) %>% add_column(type="main")
if ( !file.exists("data/r_essentials/smc_bs.rds")){
read_bootstrap <- function(pop, n) {
read_csv(paste0("data/hpc/demography/smc_bootstrap/",pop,"_",n,".csv" )) %>% select(label,x,y) %>% add_column(n=paste0("bs",n))
}
smc_bs <- rbind(map_df(1:20, ~read_bootstrap(pop="inshore",n=.x)),
map_df(1:20, ~read_bootstrap(pop="northoffshore",n=.x)),
map_df(1:20, ~read_bootstrap(pop="southoffshore",n=.x)))
write_rds(smc_bs,"data/r_essentials/smc_bs.rds")
} else {
smc_bs <- read_rds("data/r_essentials/smc_bs.rds")
}
```
```{r bootstrap-plot}
#smc_bootstrap_plot
smc_bootstrap_plot <- ggplot() + geom_line(data=smc_main,aes(x=x,y=y,color=label),size=1,show.legend = FALSE) +
geom_line(data = smc_bs, aes(x=x,y=y,color=label,group=interaction(label,n)),alpha=0.1,show.legend = FALSE)+
scale_x_log10(breaks=xbreaks, labels=xlabels)+
scale_y_log10(breaks=ybreaks,labels=ylabels,limits=c(2e+3,2e+6)) +
scale_color_startrek() + theme_test() +
labs(x=expression(paste("Years Ago (g=5, ",mu, "=1.20e-8)")),
y=expression(paste("Effective Population Size ",N[e]," (x", 10^4 ,")")))
```
### 2.2 Divergence time among populations
The smc++ can estimate joint demography and speculate the splitting time between pair of populations. To do this, we first create a joint frequency spectrum for every pair of populations using `vcf2smc`.
```bash
smc++ vcf2smc {chr}.vcf.gz {pop1}_{pop2}.smc.gz {chr} POP1:{sampleid},.. POP2:{sampleid},..
```
Next,we ran smc++ `split` with same parameters we used in `estimate`.
```bash
smc++ split -o split --base {pop1}_{pop2} \
--timepoints 20 200000 --em-iterations 50 --thinning 3000 \
{pop1}.final.json {pop2}.final.json {pop1}_{pop2}.smc.gz
```
Eventually, we made the pairwise plot using `smc plot` tool with a five year generation time.
```{r split, fig.show="hold", out.width="33%"}
library(png)
library(knitr)
#split_plot_cubic <- list.files(path = "figures/smc_split_cubic",pattern = "*.split.png" ,full.names = TRUE)
#include_graphics(split_plot_cubic)
split_plot_pw <- list.files(path = "figures/smc_split_piecewise",pattern = "*.split.png" ,full.names = TRUE)
include_graphics(split_plot_pw)
```
**Figure 3: Three figures above show the joint demography between any two populations (IN: Inshore, NO: North Offshore, SO: South Offshore) and the estimated splitting time (blue vertical line)**
## Add climate data in plot
```{r}
climate_data <- read_tsv("data/nature07158-s2.txt",skip = 14)
library(cowplot)
smc_plot <- ggplot() + geom_line(data=smc_main,aes(x=x,y=y,color=label),size=1,show.legend = FALSE) +
geom_line(data = smc_bs, aes(x=x,y=y,color=label,group=interaction(label,n)),alpha=0.1,show.legend = FALSE)+
scale_x_log10(breaks=c(1e+3,1e+4,1e+5,1e+6), labels=c("1kya","10kya","100kya","1mya"), limits=c(1e+3,1.3e+6),expand = c(0, 0))+
scale_y_log10(breaks=ybreaks,labels=ylabels,limits=c(2e+3,2e+6)) +
scale_color_startrek() + theme_test() +
labs(x=expression(paste("Years Ago (g=5, ",mu, "=1.20e-8)")),
y=expression(paste("Effective Population Size ",N[e]," (x", 10^4 ,")")))
cp <- ggplot(climate_data %>% filter(Time>1),aes(x=Time*1e3,y=-Ice_tot)) +
geom_line() +
scale_x_log10(breaks=c(1e+3,1e+4,1e+5,1e+6), labels=c("1kya","10kya","100kya","1mya"), limits=c(1e+3,1.3e+6),expand = c(0, 0)) +
theme_minimal() +
ylab("Sea Level") + xlab("Years Ago")
plot_grid(smc_plot,cp, ncol = 1, align = "hv", axis = "lr", rel_heights = c(0.3,0.3,0.3))
```
## Evaluating the uncertainty in mutation rate and generation time.
In SMC++, we re-ran the estimate using three mutation rates/generation time (3, 5, and 7 years) and also the splitting time.
- 1.2e-8 calculated in this study.
- 1.86e-8 from Cooke et al 2020
- 2.89e-8 from Matz et al 2018
```{r}
if ( !file.exists("data/r_essentials/07_main_data.rds")){
all_csv_files <- list.files("data/hpc/demography/smc++_varying_mu",recursive = TRUE,"*.csv",full.names = TRUE)
read_main <- function(path){
pieces <- basename(path) %>% str_match("(mu[123])_main_(g[357])")
read_csv(path) %>%
add_column(mu=pieces[1,2]) %>%
add_column(g=pieces[1,3])
}
main_data <- all_csv_files[grepl(all_csv_files,pattern = "main")] %>%
map_dfr(read_main)
write_rds(main_data,"data/r_essentials/07_main_data.rds")
} else {
main_data <- read_rds("data/r_essentials/07_main_data.rds")
}
botmin <- main_data %>%
group_by(label,mu,g) %>%
mutate(boty=min(y)) %>%
mutate(botx= ifelse(y==boty,x,NA)) %>%
na.omit() %>%
summarise(boty=first(boty),botx=mean(botx))
#ungroup() %>% group_by(label) %>% summarise(xmin=min(botx),xmax=max(botx)) %>%
#add_column(y=10^c(3.3,3.4,3.5))
g.labs <- c("generation: 3 years", "generation: 5 years", "generation: 7 years")
names(g.labs) <- c("g3", "g5", "g7")
bot_plot <- botmin %>%
unite(col = "gid",mu,g,remove = FALSE) %>%
ggplot(aes(x=botx,y=label)) +
geom_point(aes(group=gid,color=mu)) +
geom_line(aes(group=gid,color=mu)) +
scale_color_hue(labels = c("1.2e-8", "1.86e-8","2.89e-8")) +
facet_wrap(~g, ncol = 1,labeller = labeller(g = g.labs)) +
labs(x="Years ago",y ="", color="Mutation rate") +
theme(legend.position = "none",plot.margin = margin(0,0,0,0))
```
```{r}
if (!file.exists("data/r_essentials/07_split_data.rds")){
all_csv_files <- list.files("data/hpc/demography/smc++_varying_mu",recursive = TRUE,"*.csv",full.names = TRUE)
read_split <- function(path){
pieces <- path %>% str_match("(mu[123])/(.+)_(g[357])\\.csv")
read_csv(path) %>%
add_column(mu=pieces[2],split=pieces[3]) %>%
add_column(g=pieces[4])
}
split_data <- all_csv_files[grepl(all_csv_files,pattern = "shore")] %>%
map_dfr(read_split)
write_rds(split_data,"data/r_essentials/07_split_data.rds")
} else {
split_data <- read_rds("data/r_essentials/07_split_data.rds")
}
split_time <- split_data %>%
group_by(label,mu,g,split) %>%
mutate(max_x=max(x)) %>% ungroup() %>%
group_by(split,mu,g) %>% summarise(t=min(max_x))
split_plot <- split_time %>%
unite(col = "gid",mu,g,remove = FALSE) %>%
ggplot(aes(x=t,y=split)) +
geom_point(aes(group=gid,color=mu)) +
geom_line(aes(group=gid,color=mu)) +
scale_color_hue(labels = c("1.2e-8", "1.86e-8","2.89e-8")) +
scale_y_discrete(labels=c("split. IN-NO","split. IN-SO","split. NO-SO")) +
facet_wrap(~g, ncol = 1,labeller = labeller(g = g.labs)) +
labs(x="Years ago",y ="", color="Mutation rate") +
theme(plot.margin = margin(0,0,0,0))
plot_grid(bot_plot,split_plot, nrow = 1, label_size = 12,align = "h", axis = "b", rel_widths = c(0.37,0.63), labels = c("A","B"))
#ggsave("fig-s10.pdf",width = 6.75,height = 3.6)
#ggsave("fig-s10.jpg",width = 6.75,height = 3.6)
```
Figure 3: Variation in the estimated timing of key demographic events under different mutation rates and generation times. All estimates were obtained using SMC++. A. the bottleneck time of inshore (IN), north offshore (NO), and south offshore (SO) A. digitifera populations. B. the split time between each pair of populations.
## References