forked from EpiCoronaHack/Hackathon2020
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Tianjin.Rmd
337 lines (243 loc) · 13.4 KB
/
Tianjin.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: "Tianjin"
author: "Caroline Colijn"
date: "25/02/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(survminer)
library(survival)
library(ggplot2)
library(icenReg)
library(igraph)
library(visNetwork)
library(stringr)
options(digits=3)
```
## Data
Thanks to Dongxuan Chen and Louxin Zhang. These data are from three main sources:
* source1: http://wsjk.tj.gov.cn/col/col87/index.html#!uid=259&pageNum=1 (Tianjin health commission official website, for daily announcements)
* source2: https://weibo.com/u/2967529507 (Jinyun News, Tianjin offical local media weibo account, for patient symptom onset reference)
* source3: https://m.weibo.cn/status/IrrHI1FHm?jumpfrom=weibocom (another Tianjin local media weibo link, for mall cluster reference)
```{r}
tdata=read.csv("Tianjin135casesFeb22.csv",na.strings = "", stringsAsFactors = F)
tdata$symptom_onset=as.Date(tdata$symptom_onset, format = "%d/%m/%Y")
tdata$start_source=as.Date(tdata$start_source, format = "%d/%m/%Y")
tdata$end_source=as.Date(tdata$end_source,format = "%d/%m/%Y" )
tdata$confirm_date=as.Date(tdata$confirm_date,format = "%d/%m/%Y" )
str(tdata)
```
## Incubation period
The incubation period is the time between exposure and the onset of symptoms. We estimate this directly from the stated start and end times for cases' exposure windows. Because it is explicitly about the symptom onset, we remove those who don't have symptom onset defined. These are a small minority of cases and the alternative would be to impute their symptom onset time using the others' delay to confirmation time. For now, we remove them. Then, if no other end time for the exposure is given or if the end of the exposure time is after the time of symptom onset, set the last exposure time to the symptom onset time. This is because they must have been exposed before symptom onset. If no other start time is given, they must have been exposed since the start of the outbreak (Dec 1, 2019). These give us the maximum and minimun incubation times.
```{r}
goodii=which(!is.na(tdata$symptom_onset))
tdata$end_source[which(is.na(tdata$end_source))]=tdata$symptom_onset[which(is.na(tdata$end_source))] # if no end exposure: set to symptom onset
tdata$end_source = pmin(tdata$end_source, tdata$symptom_onset) # if end exposure after onset, set to onset
tdata$start_source[which(is.na(tdata$start_source))]=as.Date("2020-01-05") # start date
tdata$maxIncTimes=tdata$symptom_onset-tdata$start_source
tdata$minIncTimes = tdata$symptom_onset-tdata$end_source
tdata$maxIncTimes
tdata$minIncTimes
tdata$maxIncTimes[27] = 50 # for some reason this was coming up negative
```
We use survival analysis in the icenReg package to make parametric estimates, and we use the regular survival package to estimate the time to onset of symptoms.
```{r}
ggsurvplot(
fit=survfit(Surv(tdata$minIncTimes, tdata$maxIncTimes, type="interval2")~1, data=tdata),
xlab="Days",
ylab = "Overall probability of no symptoms yet")
```
The median is about 8 days. For a parametric estimate we remove any remaining NAs and use interval censoring, because we know only that exposure was some time between the minimum and maximum possible values.
```{r}
reddata=tdata[which(!is.na(tdata$minIncTimes)),]
myfit = ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes,type="interval2")~1, data = reddata,dist="weibull")
myfit_gamma<- ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes, type="interval2") ~ 1, data = reddata, dist = "gamma")
summary(myfit_gamma)
myfit_lnorm = ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes, type="interval2") ~ 1, data = reddata, dist = "lnorm")
summary(myfit_lnorm)
```
The log of the shape parameter is `r myfit$coefficients[1]` $\pm$ `r sqrt(myfit$var[1,1])`, which gives a shape parameter of `r exp(myfit$coefficients[1])` with a 1.96-sd (in the log) giving the range (`r exp(myfit$coefficients[1]-1.96*sqrt(myfit$var[1,1]))`, `r exp(myfit$coefficients[1]+1.96*sqrt(myfit$var[1,1]))`).
Similarly the log scale parameter is `r myfit$coefficients[2]` $\pm$ `r sqrt(myfit$var[2,2])`, which gives a scale parameter of `r exp(myfit$coefficients[2])` with a one-sd (in the log) giving the range (`r exp(myfit$coefficients[2]-1.96*sqrt(myfit$var[2,2]))`, `r exp(myfit$coefficients[2]+1.96*sqrt(myfit$var[2,2]))`).
If the earliest date of exposure is Dec 1 instead of Dec 31 these are: shape 2.28 (1.8, 2.8), scale 10.0 (8.86, 11.38). If it's Jan 5 they are shape: 2.28 (1.86, 2.84) and scale 9.79 (8.69, 11.03).
```{r}
interqs=getFitEsts(myfit, newdata = NULL, p=c(0.025,0.05, 0.25, 0.5, 0.75,0.95,0.975)) #
interqs
interqs_gamma <- getFitEsts(myfit_gamma, newdata=NULL, p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_gamma
interqs_lnorm <- getFitEsts(myfit_lnorm, newdata=NULL, p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_lnorm
```
The median is `r interqs[4]` days and the 0.95 at `r interqs[6]`. These are longer than my estimate from the line list data, which matched Backer et al's analysis of the same data. Here is a plot of the estimated distribution together with the empirical survival curve from the data.
Information for a table for the paper:
```{r}
# weibull shape scale then quantiles:
c( exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2]) , getFitEsts(myfit, newdata = NULL, p =c(0.025, 0.5, 0.975)))
# gamma shape, scale then quants
c( shape = exp(myfit_gamma$coefficients[1]), scale = exp(myfit_gamma$coefficients[2]), getFitEsts(myfit_gamma, newdata=NULL, p =c(0.025, 0.5, 0.975)))
# lnorm meanlog, sdlog and quants
c( meanlog = myfit_lnorm$coefficients[1], sdlog = exp(myfit_lnorm$coefficients[2]), getFitEsts(myfit_lnorm, newdata=NULL, p =c(0.025, 0.5, 0.975)))
```
```{r}
days=seq(0,20,by=0.05)
density=dweibull(days, shape = exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2]))
ggs = ggsurvplot(
fit=survfit(Surv(tdata$minIncTimes, tdata$maxIncTimes, type="interval2")~1, data=tdata),
xlab="Days", ylab = "Overall probability of no symptoms yet")
pdata <- data.frame(days=rep(days,3),
fitsurv=c(1-pweibull(days, shape = exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2])),
1-pgamma(days, shape = exp(myfit_gamma$coefficients[1]), scale = exp(myfit_gamma$coefficients[2])),
1-plnorm(days, meanlog = myfit_lnorm$coefficients[1], sdlog = exp(myfit_lnorm$coefficients[2]))),distn=c(rep("Weibull", length(days)), rep("Gamma",length(days)), rep("Lognorm", length(days)) )) # i know, i know...
tmp=data.frame(days=days, fitsurv=1-pweibull(days, shape = exp(myfit$coefficients[1]),
scale = exp(myfit$coefficients[2])))
ggs$plot + geom_line(data = tmp, aes(x = days, y = fitsurv))
# ggsave(filename = "inc_Tianjin.pdf", width = 8, height = 6)
```
## Serial interval
We will estimate the serial interval using the 'interval case to case' approach given in Vink et al (https://academic.oup.com/aje/article/180/9/865/2739204).
The dataset has quite a few instances where a putative infector or contact is known. These are listed in the 'Infection_source' column. We first make a graph in which nodes are individuals and edges are present from cases listed as possible sources, to the cases for whom they are possible sources.
```{r}
mynodes = tdata$case_id
edges = data.frame(from=mynodes[9],to=mynodes[21],stringsAsFactors = F ) # i read this one manually
for (id in 1:nrow(tdata)) {
tonode=tdata$case_id[id]
fromnodes=str_extract_all(tdata$Infection_source[id], "TJ\\d+", simplify = T)
if (length(fromnodes)>0) {
for (k in 1:length(fromnodes)) {
edges=rbind(edges, c(fromnodes[k], tonode))
}
}
}
head(edges)
edges=edges[-1,]
edges=edges[-which(is.na(edges[,1])),] # NAs arose from a few empty entries for Infection_source
```
From this edge list we can use visNetwork to visualise the graph. Colours are from the infection source column (but we should have a better colour scheme, like date of symptom onset).
```{r}
edges$arrows="to"
nodes = data.frame(id=tdata$case_id, label=tdata$case_id,
group=tdata$Infection_source)
visNetwork(nodes,edges)
```
The interval case to case (ICC) data are the times between the (presumed) index case for a small cluster and the other cases in the cluster. The Vink et al approach allows these intervals to be one of 4 types, and estimates the serial interval and the probability of each type. To extract ICC intervals, we let the clusters be the components of the graph, and we let the presumed index case be the first to develop symptoms. For each cluster, we subtract the index cases' symptom time from the symtom times of the rest of the cluster (or just the first few; it turns out that the estimate is not sensitive to this). This results in a list of time intervals between symptom onset in presumed index cases and symptom onset in other cases in the same cluster (graph component).
First construct the graph
```{r}
#serialdata=edges # REMOVE?
#serialdata$symps_from = tdata$symptom_onset[match(edges$from, tdata$case_id)]
#serialdata$symps_to=tdata$symptom_onset[match(edges$to, tdata$case_id)]
tgraph = graph_from_edgelist(as.matrix(edges[,1:2]), directed = FALSE)
ccs=components(tgraph)
tdata$component=vapply(tdata$case_id, function(x)
{ if (x %in% names(ccs$membership)) { return(ccs$membership[match(x, names(ccs$membership))])
} else {
return(NA)}}, FUN.VALUE = 3)
```
Extract ICC interval data: a function
```{r}
getICCs <- function(thisdata, ccs, K, orderby= "onset" ) {
iccs=1
for (n in 1:max(ccs$membership)) {
mycases = which(thisdata$component==n)
if (orderby == "onset")
{ myonsets = sort(thisdata$symptom_onset[mycases])[1:min(K, length(mycases))]}
if (orderby == "exposure") {
myonsets =thisdata$symptom_onset[mycases][order(thisdata$end_source[mycases])][1:min(K,length(mycases))]
}
iccs =c(iccs, myonsets[-1]-myonsets[1])
}
return(iccs[-1])
}
```
```{r}
stdata = tdata[which(!is.na(tdata$symptom_onset)),]
icc3 = getICCs(stdata,ccs,3)
icc4 = getICCs(stdata,ccs,4)
icc5 = getICCs(stdata,ccs,5)
icc6 = getICCs(stdata,ccs,6)
icc_expose = getICCs(stdata, ccs, 4, orderby ="exposure")
```
Perform the estimate using the Vink et al method, and display the result:
```{r}
source("TianjinSI_VinkWallinga_CC.R")
myestimate = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myestimate
myest3 = serial_mix_est(data=icc3, N=100, startmu=10, startsig =4)
myest4 = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myest5 = serial_mix_est(data=icc5, N=100, startmu=10, startsig =4)
myest6 = serial_mix_est(data=icc6, N=100, startmu=10, startsig =4)
myest_exp= serial_mix_est(data=icc_expose, N=100, startmu=10, startsig =4)
mm=rbind(myestimate,myest3, myest4, myest5,myest6, myest_exp)
colnames(mm)=c("mu","sig")
mm=as.data.frame(mm)
mm$K=c(9, 3, 4, 5, 6, 4)
mm$ordering = c("Onset","Onset","Onset","Onset","Onset","LastExposure")
print(mm[,c(4,3,1,2)])
```
```{r,eval=FALSE}
days = seq(from=0, to=10, by=0.1)
density= dnorm(days, mean = myestimate[1], sd = myestimate[2])
ggplot(data=data.frame(days=days, density=density), aes(x=days,y=density)) + geom_line() + ggtitle("ICC estimate of the Tianjin cluster serial interval")
```
Bootstrap analysis - have left off in the Rmd because it takes time.
```{r, eval=FALSE}
# bootstrap analysis
Nboot=100
bestimates=myestimate
for (kk in 1:Nboot) {
bdata = sample(x=icc4, size = length(iccall), replace = T)
bestimates = rbind(bestimates, serial_mix_est(data=bdata, N=50, startmu=10, startsig =4))
}
```
```{r,eval=FALSE}
mean(bestimates[,1]) # mean of the mean serial intervals
median(bestimates[,1])
mean(bestimates[,2]) # sd of the sd serial intervals
sd(bestimates[,1]) # sd of the sd serial intervals
```
```{r, eval=FALSE}
hist(bestimates[,1],breaks = 10)
bootdf=data.frame(mu=bestimates[,1], sig=bestimates[,2])
ggplot(bootdf, aes(x=mu, y=sig))+geom_point()
ggplot(bootdf, aes(x=mu))+geom_histogram()
ggsave(file = "bootst_SI_tianjin.pdf", width = 6, height = 4)
```
The direct infections from case 34 according to the figure at https://mp.weixin.qq.com/s/x4HBXGFw5vnWU7nXXdyWVg.
```{r,eval=FALSE}
tdata$direct34=FALSE
contacts_fig = c(43,37,53,83,89,131,124,48,71,51,57,58,
66,68,50,73,74,87,78,80,36,92,110,111)
contacts_id=paste("TJ",contacts_fig,sep="")
tdata$direct34[match(contacts_id, tdata$case_id)]=TRUE
# now i need to subtract 34's onset time from these infectees' onset times
SI34s= as.numeric(tdata$symptom_onset[which(tdata$direct34)])-as.numeric(tdata$symptom_onset[which(tdata$case_id=="TJ34")])
mean(as.numeric((SI34s)))# don't need all the as.numerics
sd(SI34s)
```
We estimate R0 from Wallinga and Lipsitch Proc. Roy. Soc. B 2007 using the equation $R=\exp{r \mu - 1/2 r^2 \sigma^2}$. To obtain CIs for R, we could use our bootstrap estimates of $\mu$ and $\sigma^2$ and simply resample R using this equation.
Jung et al Scenario 1
```{r,eval=FALSE}
# load("tianjin_bootstraps.Rdata") # in case in Rmd with above evals set to FALSE
myrate=0.15
Rs=0*(1:100)
for (n in 1:100) {
Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
sd(Rs)
hist(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```
Scenario 2: leads to high R values, not in keeping with most analyses to date.
```{r,eval=FALSE}
myrate=0.29
Rs=0*(1:100)
for (n in 1:100) {
Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```