Tianjin.Rmd

---
title: "Tianjin"
author: "Caroline Colijn"
date: "25/02/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(survminer)
library(survival)
library(ggplot2)
library(icenReg)
library(igraph)
library(visNetwork)
library(stringr)
options(digits=3)

```

## Data 

Thanks to Dongxuan Chen and Louxin Zhang. These data are from three main sources:

* source1: http://wsjk.tj.gov.cn/col/col87/index.html#!uid=259&pageNum=1 (Tianjin health commission official website, for daily announcements)

* source2: https://weibo.com/u/2967529507 (Jinyun News, Tianjin offical local media weibo account, for patient symptom onset reference)

* source3: https://m.weibo.cn/status/IrrHI1FHm?jumpfrom=weibocom (another Tianjin local media weibo link, for mall cluster reference)


```{r}
tdata=read.csv("Tianjin135casesFeb22.csv",na.strings = "", stringsAsFactors = F)
tdata$symptom_onset=as.Date(tdata$symptom_onset, format = "%d/%m/%Y")
tdata$start_source=as.Date(tdata$start_source, format = "%d/%m/%Y")
tdata$end_source=as.Date(tdata$end_source,format = "%d/%m/%Y" )
tdata$confirm_date=as.Date(tdata$confirm_date,format = "%d/%m/%Y" )
str(tdata)
```


## Incubation period

The incubation period is the time between exposure and the onset of symptoms. We estimate this directly from the stated start and end times for cases' exposure windows. Because it is explicitly about the symptom onset, we remove those who don't have symptom onset defined. These are a small minority of cases and the alternative would be to impute their symptom onset time using the others' delay to confirmation time.  For now, we remove them.  Then,  if no other end time for the exposure is given or  if the end of the exposure time is after the time of symptom onset, set the last exposure time to the symptom onset time. This is because they must have been exposed before symptom onset.  If no other start time is given, they must have been exposed since the start of the outbreak (Dec 1, 2019). These give us  the maximum and minimun incubation times. 

```{r}
goodii=which(!is.na(tdata$symptom_onset))

tdata$end_source[which(is.na(tdata$end_source))]=tdata$symptom_onset[which(is.na(tdata$end_source))]  # if no end exposure: set to symptom onset 
tdata$end_source = pmin(tdata$end_source, tdata$symptom_onset) # if end exposure after onset, set to onset 
tdata$start_source[which(is.na(tdata$start_source))]=as.Date("2020-01-05") # start date 

tdata$maxIncTimes=tdata$symptom_onset-tdata$start_source 
tdata$minIncTimes = tdata$symptom_onset-tdata$end_source

tdata$maxIncTimes
tdata$minIncTimes
tdata$maxIncTimes[27] = 50 # for some reason this was coming up negative 
```

We use survival analysis in the icenReg package to make parametric estimates, and we use the regular survival package to estimate the time to onset of symptoms. 

```{r}
ggsurvplot(
  fit=survfit(Surv(tdata$minIncTimes, tdata$maxIncTimes, type="interval2")~1, data=tdata), 
  xlab="Days",
  ylab = "Overall probability of no symptoms yet")
```

The median is about 8 days. For a parametric estimate we remove any remaining NAs and use interval censoring, because we know only that exposure was some time between the minimum and maximum possible values. 

```{r}
reddata=tdata[which(!is.na(tdata$minIncTimes)),]

myfit = ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes,type="interval2")~1, data = reddata,dist="weibull")


myfit_gamma<- ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes, type="interval2") ~ 1, data = reddata, dist = "gamma")
summary(myfit_gamma)

myfit_lnorm =  ic_par(Surv(reddata$minIncTimes, reddata$maxIncTimes, type="interval2") ~ 1, data = reddata, dist = "lnorm")
summary(myfit_lnorm)
```

The log of the shape parameter is `r myfit$coefficients[1]` $\pm$ `r sqrt(myfit$var[1,1])`, which gives a shape parameter of `r exp(myfit$coefficients[1])` with a 1.96-sd (in the log) giving the range (`r exp(myfit$coefficients[1]-1.96*sqrt(myfit$var[1,1]))`, `r exp(myfit$coefficients[1]+1.96*sqrt(myfit$var[1,1]))`).

Similarly the log scale parameter is `r myfit$coefficients[2]` $\pm$ `r sqrt(myfit$var[2,2])`, which gives a scale parameter of `r exp(myfit$coefficients[2])` with a one-sd (in the log) giving the range (`r exp(myfit$coefficients[2]-1.96*sqrt(myfit$var[2,2]))`, `r exp(myfit$coefficients[2]+1.96*sqrt(myfit$var[2,2]))`). 

If the earliest date of exposure is Dec 1 instead of Dec 31 these are: shape 2.28 (1.8, 2.8), scale 10.0 (8.86, 11.38). If it's Jan 5 they are shape: 2.28 (1.86, 2.84) and scale 9.79 (8.69, 11.03). 

```{r}
interqs=getFitEsts(myfit, newdata = NULL, p=c(0.025,0.05, 0.25, 0.5, 0.75,0.95,0.975)) #

interqs
interqs_gamma <- getFitEsts(myfit_gamma, newdata=NULL,  p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_gamma
interqs_lnorm <- getFitEsts(myfit_lnorm, newdata=NULL,  p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_lnorm
```

The median is `r interqs[4]` days and the  0.95 at `r interqs[6]`. These are longer than my estimate from the line list data, which matched Backer et al's analysis of the same data.  Here is a plot of the estimated distribution together with the empirical survival curve from the data. 

Information for a table for the paper: 

```{r}
# weibull shape scale then quantiles: 
c( exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2])  , getFitEsts(myfit, newdata = NULL, p =c(0.025,  0.5,  0.975)))

# gamma shape, scale then quants 
c( shape = exp(myfit_gamma$coefficients[1]), scale = exp(myfit_gamma$coefficients[2]), getFitEsts(myfit_gamma, newdata=NULL,  p =c(0.025,  0.5,  0.975)))

# lnorm meanlog, sdlog and quants 
c( meanlog = myfit_lnorm$coefficients[1], sdlog = exp(myfit_lnorm$coefficients[2]), getFitEsts(myfit_lnorm, newdata=NULL,  p =c(0.025,  0.5,  0.975))) 
```


```{r}
days=seq(0,20,by=0.05)
density=dweibull(days, shape = exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2]))

ggs = ggsurvplot(
  fit=survfit(Surv(tdata$minIncTimes, tdata$maxIncTimes, type="interval2")~1, data=tdata), 
  xlab="Days",  ylab = "Overall probability of no symptoms yet")

pdata <- data.frame(days=rep(days,3),  
            fitsurv=c(1-pweibull(days, shape = exp(myfit$coefficients[1]), scale = exp(myfit$coefficients[2])),
        1-pgamma(days,  shape = exp(myfit_gamma$coefficients[1]), scale = exp(myfit_gamma$coefficients[2])),
        1-plnorm(days,  meanlog = myfit_lnorm$coefficients[1], sdlog = exp(myfit_lnorm$coefficients[2]))),distn=c(rep("Weibull", length(days)), rep("Gamma",length(days)), rep("Lognorm", length(days)) ))  # i know, i know... 


tmp=data.frame(days=days,  fitsurv=1-pweibull(days, shape = exp(myfit$coefficients[1]),
                      scale = exp(myfit$coefficients[2])))
ggs$plot + geom_line(data = tmp, aes(x = days, y = fitsurv))
# ggsave(filename = "inc_Tianjin.pdf", width = 8, height = 6)
```


## Serial interval 


We will estimate the serial interval using the 'interval case to case' approach given in Vink et al (https://academic.oup.com/aje/article/180/9/865/2739204). 

The dataset has quite a few instances where a putative infector or contact is known. These are listed in the 'Infection_source' column. We first make a graph in which nodes are individuals and edges are present from cases listed as possible sources, to the cases for whom they are possible sources. 

```{r}
mynodes = tdata$case_id
edges = data.frame(from=mynodes[9],to=mynodes[21],stringsAsFactors = F ) # i read this one manually 
for (id in 1:nrow(tdata)) {
tonode=tdata$case_id[id]
fromnodes=str_extract_all(tdata$Infection_source[id], "TJ\\d+", simplify = T)
  if (length(fromnodes)>0) {
    for (k in 1:length(fromnodes)) {
      edges=rbind(edges, c(fromnodes[k], tonode))
    }
  }
}
head(edges)
edges=edges[-1,]
edges=edges[-which(is.na(edges[,1])),] # NAs arose from a few empty entries for Infection_source 
```

From this edge list we can use visNetwork to visualise the graph. Colours are from the infection source column (but we should have a better colour scheme, like date of symptom onset). 

```{r}
edges$arrows="to"
nodes = data.frame(id=tdata$case_id, label=tdata$case_id,
                   group=tdata$Infection_source)
visNetwork(nodes,edges)
```

The interval case to case (ICC) data are the times between the (presumed) index case for a small cluster and the other cases in the cluster. The Vink et al approach allows these intervals to be one of 4 types, and estimates the serial interval and the probability of each type. To extract ICC intervals, we let the clusters be the components of the graph, and we let the presumed index case be the first to develop symptoms. For each cluster, we subtract the index cases' symptom time from the symtom times of the rest of the cluster (or just the first few; it turns out that the estimate is not sensitive to this). This results in a list of time intervals between symptom onset in presumed index cases and symptom onset in other cases in the same cluster (graph component). 


First construct the graph

```{r}
#serialdata=edges # REMOVE? 
#serialdata$symps_from = tdata$symptom_onset[match(edges$from, tdata$case_id)]
#serialdata$symps_to=tdata$symptom_onset[match(edges$to, tdata$case_id)]
tgraph = graph_from_edgelist(as.matrix(edges[,1:2]), directed = FALSE)
ccs=components(tgraph)

tdata$component=vapply(tdata$case_id, function(x)
  { if (x %in% names(ccs$membership)) { return(ccs$membership[match(x, names(ccs$membership))])
  } else { 
    return(NA)}}, FUN.VALUE = 3)
```


Extract ICC interval data: a function 

```{r}
 getICCs <- function(thisdata, ccs, K, orderby= "onset" ) {
  iccs=1
for (n in 1:max(ccs$membership)) {
  mycases  = which(thisdata$component==n)
  if (orderby == "onset")
  {  myonsets = sort(thisdata$symptom_onset[mycases])[1:min(K, length(mycases))]}
  if (orderby == "exposure") {
 myonsets =thisdata$symptom_onset[mycases][order(thisdata$end_source[mycases])][1:min(K,length(mycases))]
  }
  iccs =c(iccs, myonsets[-1]-myonsets[1])
}
  return(iccs[-1]) 
  }
```


```{r}
stdata = tdata[which(!is.na(tdata$symptom_onset)),]

icc3 = getICCs(stdata,ccs,3)
icc4 = getICCs(stdata,ccs,4)
icc5 = getICCs(stdata,ccs,5)
icc6 = getICCs(stdata,ccs,6)
icc_expose = getICCs(stdata, ccs, 4, orderby ="exposure")
```

Perform the estimate using the Vink et al method, and display the result:

```{r}
source("TianjinSI_VinkWallinga_CC.R")
myestimate = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myestimate
myest3 = serial_mix_est(data=icc3, N=100, startmu=10, startsig =4)
myest4 = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myest5 = serial_mix_est(data=icc5, N=100, startmu=10, startsig =4)
myest6 = serial_mix_est(data=icc6, N=100, startmu=10, startsig =4)
myest_exp= serial_mix_est(data=icc_expose, N=100, startmu=10, startsig =4)

mm=rbind(myestimate,myest3, myest4, myest5,myest6, myest_exp)
colnames(mm)=c("mu","sig")
mm=as.data.frame(mm)
mm$K=c(9, 3, 4, 5, 6, 4) 
mm$ordering = c("Onset","Onset","Onset","Onset","Onset","LastExposure")
print(mm[,c(4,3,1,2)]) 
```


```{r,eval=FALSE} 
days = seq(from=0, to=10, by=0.1) 
density= dnorm(days, mean = myestimate[1], sd = myestimate[2])
ggplot(data=data.frame(days=days, density=density), aes(x=days,y=density)) + geom_line() + ggtitle("ICC estimate of the Tianjin cluster serial interval")
```


Bootstrap analysis - have left off in the Rmd because it takes time. 

```{r, eval=FALSE}
# bootstrap analysis
Nboot=100
bestimates=myestimate 
for (kk in 1:Nboot) {
bdata = sample(x=icc4, size = length(iccall), replace = T)
bestimates = rbind(bestimates, serial_mix_est(data=bdata, N=50, startmu=10, startsig =4))
}
```

```{r,eval=FALSE}
mean(bestimates[,1]) # mean of the mean serial intervals
median(bestimates[,1])
mean(bestimates[,2]) # sd of the sd serial intervals 
sd(bestimates[,1]) # sd of the sd serial intervals 
```


```{r, eval=FALSE}
hist(bestimates[,1],breaks = 10)
bootdf=data.frame(mu=bestimates[,1], sig=bestimates[,2])
ggplot(bootdf, aes(x=mu, y=sig))+geom_point()

ggplot(bootdf, aes(x=mu))+geom_histogram()
 ggsave(file = "bootst_SI_tianjin.pdf", width = 6, height = 4)
```


The direct infections from case 34 according to the figure at https://mp.weixin.qq.com/s/x4HBXGFw5vnWU7nXXdyWVg. 

```{r,eval=FALSE}
tdata$direct34=FALSE
contacts_fig = c(43,37,53,83,89,131,124,48,71,51,57,58,
                 66,68,50,73,74,87,78,80,36,92,110,111)
contacts_id=paste("TJ",contacts_fig,sep="")
tdata$direct34[match(contacts_id, tdata$case_id)]=TRUE
# now i need to subtract 34's onset time from these infectees' onset times 
SI34s= as.numeric(tdata$symptom_onset[which(tdata$direct34)])-as.numeric(tdata$symptom_onset[which(tdata$case_id=="TJ34")])
mean(as.numeric((SI34s)))# don't need all the as.numerics 
sd(SI34s)

```


We estimate R0 from Wallinga and Lipsitch Proc. Roy. Soc. B 2007 using the equation $R=\exp{r \mu - 1/2 r^2 \sigma^2}$. To obtain CIs for R, we could use our bootstrap estimates of $\mu$ and $\sigma^2$ and simply resample R using this equation. 

Jung et al Scenario 1

```{r,eval=FALSE}
# load("tianjin_bootstraps.Rdata") # in case in Rmd with above evals set to FALSE 
myrate=0.15

Rs=0*(1:100) 
for (n in 1:100) {
  Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
sd(Rs)
hist(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```
Scenario 2: leads to  high R values, not in keeping with most analyses to date. 

```{r,eval=FALSE}
myrate=0.29
Rs=0*(1:100) 
for (n in 1:100) {
  Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```