singapore.Rmd

---
title: "Singapore"
author: "Caroline Colijn and Michelle Coombe"
date: "25/02/2020"
output: html_document
---
  
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(survminer)
library(survival)
library(tidyverse)
library(lubridate)
library(icenReg)
library(igraph)
library(visNetwork)
library(stringr)
options(digits=3)
```

## Data 

Thanks to EpiCoronaHack Cluster team. These data are manually entered from postings from the Government of Singapore website:
  
  * source1: TO CONFIRM WITH LAUREN


```{r}
spdata <- read_csv("~/EpiCoronaHack_Cluster/Clustering/data/COVID-19_Singapore.csv")
spdata <- read_csv("~/Dropbox/Transmission/nCov2019/Singapore/COVID-19_Singapore_feb29_1pm.csv")
# Ensure properly imported
glimpse(spdata)
colSums(is.na(spdata))
# Rename columns 2, 3 and 4 so no spaces
spdata <- rename(spdata, related_cases = starts_with("Related"),
                 cluster_links = "Cluster links",
                 relationship_notes = starts_with("Relation"))
# Change date columns into date objects
spdata <- mutate(spdata, presumed_infected_date = dmy(presumed_infected_date),
                 last_poss_exposure = dmy(last_poss_exposure),
                 symp_presumed_infector = dmy(symp_presumed_infector),
                 date_onset_symptoms = dmy(date_onset_symptoms),
                 date_quarantine = dmy(date_quarantine),
                 date_hospital = dmy(date_hospital),
                 date_confirmation = dmy(date_confirmation),
                 date_discharge = dmy(date_discharge))
# make sure dates parsed properly
range(spdata$presumed_infected_date, na.rm = T)
range(spdata$last_poss_exposure, na.rm = T)
range(spdata$symp_presumed_infector, na.rm = T)
range(spdata$date_onset_symptoms, na.rm = T)
range(spdata$date_quarantine, na.rm = T)
range(spdata$date_hospital, na.rm = T)
range(spdata$date_confirmation, na.rm = T)
range(spdata$date_discharge, na.rm = T)
# Note that case 36 is listed has having symptoms 16 days AFTER being hospitalized; suspect a typo in the month, fixing: 
# spdata$date_onset_symptoms[spdata$CaseID==36] <- ymd("2020-01-24")
# Note that the date of symp_presumed_infector for CaseID 79 changed was originally listed as 2020-02-07 (based on online visualizations) but was changed to 2020-02-10, due to Feb 10, 2020 being on the earliest date of onset of symptoms from case 72, as from online info provided, presumed infective contact for CaseID 79 is from 72 (family member), rather than directly from case 52
spdata$symp_presumed_infector[spdata$CaseID == 79] <- ymd("2020-02-10")
# Change symp_presumed_infector to Feb 10, 2020 (date of symptom onset from caseID 72, the presumed infector)
spdata <- filter(spdata, !is.na(date_onset_symptoms)) #Remove all the cases that do not have info on date of symptom onset 
# NOTE NOTE 12 of these, but they have a date of confiramation and dates of presumed infection - COULD FIX 
```


## Incubation period

The incubation period is the time between exposure and the onset of symptoms. We estimate this directly from the stated start and end times for cases' exposure windows. These are explicitly listed for the Tianjin dataset but in Singapore they are approximated using contact tracing and the route by which a case was exposed. Because it is explicitly about the symptom onset, we remove those who don't have symptom onset defined. (These are a small minority of 12 cases and the alternative would be to impute their symptom onset time using the others' delay to confirmation time.  For now, we remove them).   

Then, if no other end time for the exposure is given or if the end of the exposure time is after the time of symptom onset, set the last exposure time to the symptom onset time. This is because they must have been exposed before symptom onset.   We use four ideas to set the end time for the exposure window: 

* 1: the end source is last possible exposure, if this is given 

* 2:  if it is not given, then we set the end of the exposure window to the time of  symptoms of the presumed infector plus a noise term epsilon (eps)

* 3: and if neither the last possible expsure or the symptom time of the presumed infector are given, the last exposure time is set to the time of symptom onset. 

* 4 Finally, we do not let the last possible exposure time be later than the time of symptom onset 

```{r}
spdata$end_source = spdata$last_poss_exposure # 1 above 

eps=4
hasPresInf = which(is.na(spdata$last_poss_exposure) & !(is.na(spdata$symp_presumed_infector))) # 2 above 
spdata$end_source[hasPresInf] = spdata$presumed_infected_date[hasPresInf]+eps

hasNone = which(is.na(spdata$last_poss_exposure) & is.na(spdata$symp_presumed_infector)) # 3 above 
spdata$end_source[hasNone] = spdata$date_onset_symptoms[hasNone]

spdata$end_source = pmin(spdata$end_source, spdata$date_onset_symptoms) # 4
```

Model the start source 

* 1 if the time of presumed infector is given, use that - epsilon 

* If it is not given use symptom onset minus say 20 days, based on prior 
knowledge 

```{r}
spdata$start_source = spdata$presumed_infected_date - eps # 1
spdata$start_source[is.na(spdata$presumed_infected_date)] = spdata$date_onset_symptoms[is.na(spdata$presumed_infected_date)]-20
```


```{r}
spdata$minIncTimes <- spdata$date_onset_symptoms - spdata$end_source
spdata$maxIncTimes <- spdata$date_onset_symptoms - spdata$start_source
```

We use survival analysis in the icenReg package to make parametric estimates, and we use the regular survival package to estimate the time to onset of symptoms. 

```{r}
ggsurvplot(
fit <- survfit(Surv(spdata$minIncTimes, spdata$maxIncTimes, type="interval2") ~ 1, data = spdata), 
xlab="Days",
ylab = "Overall probability of no symptoms yet")
```


I'll just try one where I stratify by whether the person has a last possible exposure given, or not. 

```{r}
spcopy = spdata; spcopy$has_last = as.factor(!(is.na(spdata$last_poss_exposure)))
spcopyfit <- ic_par(Surv(spcopy$minIncTimes, spcopy$maxIncTimes, type="interval2") ~ has_last, data = spcopy, dist = "weibull")
summary(spcopyfit) 

getFitEsts(spcopyfit, newdata = data.frame(has_last=as.factor(TRUE)), p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
getFitEsts(spcopyfit, newdata = data.frame(has_last=as.factor(FALSE)), p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
# OK - so for those who have a last poss exposure we have inc of 5.22 days , and for everyone, 7.46 days (!) suggesting that using the infected times for those presumed
# infectors is not correct. there are missing intermediate cases. 
ggsurvplot(
fit <- survfit(Surv(spcopy$minIncTimes, spcopy$maxIncTimes, type="interval2") ~ spcopy$has_last), data = spcopy, 
xlab="Days",
ylab = "Overall probability of no symptoms yet",
surv.median.line = c('hv'))
ggsave("inc_sing_by_haslastexp.pdf", height = 6, width = 8)
```

Based on this result I am going to create a supplementary  analysis using only those people who had a last possible exposure; then can create the same for all cases or for just those who don't. 

 We use interval censoring, because we know only that exposure was some time between the minimum and maximum possible values. 

```{r}
# sum(is.na(spdata$minIncTimes)) # 0

# to switch: choose from these two lines

# spfirst = spcopy[which(spcopy$has_last ==TRUE),]
spfirst = spdata 

spfit <- ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "weibull")
summary(spfit)

spfit_gamma<- ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "gamma")
summary(spfit_gamma)

spfit_lnorm =  ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "lnorm")
summary(spfit_lnorm)
```

The log of the shape parameter is `r spfit$coefficients[1]` $\pm$ `r sqrt(spfit$var[1,1])`, which gives a shape parameter of `r exp(spfit$coefficients[1])` with a 1.96-sd (in the log) giving the range (`r exp(spfit$coefficients[1]-1.96*sqrt(spfit$var[1,1]))`, `r exp(spfit$coefficients[1]+1.96*sqrt(spfit$var[1,1]))`).

Similarly the log scale parameter is `r spfit$coefficients[2]` $\pm$ `r sqrt(spfit$var[2,2])`, which gives a scale parameter of `r exp(spfit$coefficients[2])` with a one-sd (in the log) giving the range (`r exp(spfit$coefficients[2]-1.96*sqrt(spfit$var[2,2]))`, `r exp(spfit$coefficients[2]+1.96*sqrt(spfit$var[2,2]))`). 


```{r}
interqs <- getFitEsts(spfit, newdata = NULL, p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975)) #
interqs
interqs_gamma <- getFitEsts(spfit_gamma, newdata=NULL,  p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_gamma
interqs_lnorm <- getFitEsts(spfit_lnorm, newdata=NULL,  p
                      =c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_lnorm
```
Information for a table for the paper: 

```{r}
# weibull shape scale then quantiles: 
c( exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2])  , getFitEsts(spfit, newdata = NULL, p =c(0.025,  0.5,  0.975)))

# gamma shape, scale then quants 
c( shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2]), getFitEsts(spfit_gamma, newdata=NULL,  p =c(0.025,  0.5,  0.975)))

# lnorm meanlog, sdlog and quants 
c( meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]), getFitEsts(spfit_lnorm, newdata=NULL,  p =c(0.025,  0.5,  0.975))) 
```

The median is `r interqs[4]` days and the  0.95 at `r interqs[6]`. 

Here is a plot of the estimated distribution together with the empirical survival curve from the data. 

```{r}
spdays <- seq(0,20, by=0.05)
spdensity <- dweibull(spdays, shape = exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2]))
spdens_gamma=dgamma(spdays, shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2]))
spdens_lnorm=dlnorm(spdays, meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]))

ggsp = ggsurvplot(
fit=survfit(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2")~1, data=spfirst), 
xlab="Days",  ylab = "Overall probability of no symptoms yet")
pdata <- data.frame(days=rep(spdays,3),  
            fitsurv=c(1-pweibull(spdays, shape = exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2])),
        1-pgamma(spdays,  shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2])),
        1-plnorm(spdays,  meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]))),distn=c(rep("Weibull", length(spdays)), rep("Gamma",length(spdays)), rep("Lognorm", length(spdays)) )) 
                                                            

ggsp$plot + geom_line(data = pdata, aes(x = days, y = fitsurv,color=distn))
  # ggsave(filename = "inc_Sing_firstonly.pdf", width = 8, height = 6)
```


## Serial interval 

The simplest serial interval estimate we can make with these data is a direct estimate based on the time of symptoms of the presumed infector, and the time of symptoms of the case. However, this does not account for the fact that the presumed infector is not necessarily the infector. 

```{r}
directSI=spdata$date_onset_symptoms - spdata$symp_presumed_infector
directSI=as.numeric(directSI[!is.na(directSI)])
hist(directSI,breaks = 10)
mean(directSI)
sd(directSI)
```


We will estimate the serial interval using the 'interval case to case' approach given in Vink et al (https://academic.oup.com/aje/article/180/9/865/2739204). 

The dataset has several instances where a putative infector or contact is known. These are listed in the 'related_cases' column. We first make a graph in which nodes are individuals and edges are present from cases listed as possible sources, to the cases for whom they are possible sources. 

```{r}
spnodes <- spdata$CaseID
## How to extract caseIDs from related_cases column - there are multiple values in some cells, separated by commas
spdata$related_cases #7 max within one cell
# Split into separate columns
spdata <- separate(spdata,
col = related_cases,
into = paste("contactID", 1:7, sep = "_"),
fill = "right")
# Turn into numeric values
spdata <- mutate(spdata, 
contactID_1 = as.numeric(contactID_1),
contactID_2 = as.numeric(contactID_2),
contactID_3 = as.numeric(contactID_3),
contactID_4 = as.numeric(contactID_4),
contactID_5 = as.numeric(contactID_5),
contactID_6 = as.numeric(contactID_6),
contactID_7 = as.numeric(contactID_7))
# Select down to columns of interest
spedges <- select(spdata, c(CaseID, starts_with("contactID")))
# Remove rows with NAs for at least one contact
spedges <- filter(spedges, !is.na(spedges$contactID_1)) #43 CasesIDs with 1 or more possible contacts
```

That is nice but visNetwork and igraph require an edge list with from, to nodes. So for each row of spedges we create entries like these.

NOTE still need to check whether the related cases came prior to the stated cases.. (but this may come out in the wash, in the ICC method) 

```{r}
singedges = data.frame(from=2,to=1) 
for (n in 1:nrow(spedges)) {
 for (k in 2:ncol(spedges)) { 
   if (!is.na(spedges[n,k])) {
     singedges=rbind(singedges, c(spedges[[n,k]],spedges[[n,1]])) 
   }  
   }
}
singedges=singedges[-1,]
# create undirected graph by removing duplicates
undir=data.frame(from = pmin(singedges[,1],singedges[,2]), 
                 to=pmax(singedges[,1], singedges[,2]))
undir=unique(undir)
undir = undir[-which(undir[,1]==undir[,2]),]
fedges = data.frame(from=paste("case",undir[,1],sep=""), 
               to=paste("case",undir[,2],sep=""))
```


From this edge list we can use visNetwork to visualise the graph. Make 'group' based on source of probably infection. Colours are from the infection source column (but we could have a better colour scheme, like date of symptom onset). 

```{r}
# Turn 'presumed_reason' into lower case and get trim any whitespace so don't have issues with case sensitivity, etc
spdata$presumed_reason <- str_to_lower(spdata$presumed_reason)
spdata$presumed_reason <- str_trim(spdata$presumed_reason)
table(spdata$presumed_reason)
sum(is.na(spdata$presumed_reason)) #15 NAs
# Make a new column where we group the 'presumed_reason' under a label (known relationship, gathering, wuhan travel) for each of the above three groups
spdata <- mutate(spdata, presumed_reason_group = case_when(!is.na(str_match(presumed_reason, "symptom onset|via")) ~ "Known relationship",
                                                           !is.na(str_match(presumed_reason, "grace")) ~ "Grace Assembly of God",
                                                           !is.na(str_match(presumed_reason, "grand")) ~ "Grand Hyatt Singapore",
                                                           !is.na(str_match(presumed_reason, "life")) ~ "Life Church",
                                                           !is.na(str_match(presumed_reason, "seletar")) ~ "Seletar Aerospace Heights",
                                                           !is.na(str_match(presumed_reason, "yong")) ~ "Yong Thai Hang",
                                                           !is.na(str_match(presumed_reason, "wuhan|airport")) ~ "Wuhan travel", #'airport' case (CaseID 17) does not have 'wuhan' in reason but does have it under 'Case' column that they are from Wuhan
                                                           is.na(presumed_reason) ~ "Unknown",
                                                           TRUE ~ "other")) #should not be any other, so is just a double check this has run correctly, especially as dataset grows
table(spdata$presumed_reason_group)
```

```{r}
nodes.df <- data.frame(id=paste("case",spdata$CaseID,sep=""), label=spdata$CaseID, group=spdata$presumed_reason_group)
glimpse(nodes.df)
spdata$graphID = paste("case",spdata$CaseID,sep="")
visNetwork(nodes.df, fedges) %>% visLegend() 
```

Now we estimate the serial interval using the ICC method; for this we first construct a graph. The "interval case to case" data are from identifying a putative first infector each small cluster in the graph, and finding the times between symptom onset in the first observed case and the others. See Vink et al. 


```{r}

sgraph = graph_from_edgelist(as.matrix(fedges[,1:2]), directed = FALSE)
ccs=components(sgraph)

spdata$component=vapply(spdata$graphID, function(x)
  { if (x %in% names(ccs$membership)) { return(ccs$membership[match(x, names(ccs$membership))])
  } else { 
    return(NA)}}, FUN.VALUE = 3)
```

Now knowing the components of the graph I can extract the ICC intervals. 
I did this in a few ways (commented out lines): taking the first 
case for each cluster to be the first reported symptoms (I get a 5 day serial interval); the first start exposure time (now there are negative ICCs so I get a 4.5 day serial interval) and the latest end exposure time.


Extract ICC interval data: a function 

```{r}
 getICCs <- function(thisdata, ccs, K, orderby= "onset" ) {
  iccs=1
for (n in 1:max(ccs$membership)) {
  mycases  = which(thisdata$component==n)
  if (orderby == "onset")
  {  myonsets = sort(thisdata$date_onset_symptoms[mycases])[1:min(K, length(mycases))]}
  if (orderby == "exposure") {
 myonsets =thisdata$date_onset_symptoms[mycases][order(thisdata$end_source[mycases])][1:min(K,length(mycases))]
 # myonsets =  spdata$date_onset_symptoms[mycases[order(spdata$start_source[mycases])]] # alternative also ORDERS by earliest exposure 
 
 }
  iccs =c(iccs, myonsets[-1]-myonsets[1])
}
  return(iccs[-1]) 
  }
```


```{r}
iccall = getICCs(spdata,ccs,9)
icc3 = getICCs(spdata,ccs,3)
icc4 = getICCs(spdata,ccs,4)
icc5 = getICCs(spdata,ccs,5)
icc6 = getICCs(spdata,ccs,6)
icc_expose = getICCs(spdata, ccs, 4, orderby ="exposure")
```

```{r}
source("TianjinSI_VinkWallinga_CC.R")
myestimate = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myestimate

myest3 = serial_mix_est(data=icc3, N=50, startmu=10, startsig =4)
myest4 = serial_mix_est(data=icc4, N=50, startmu=10, startsig =4)
myest5 = serial_mix_est(data=icc5, N=50, startmu=10, startsig =4)
myest6 = serial_mix_est(data=icc6, N=50, startmu=10, startsig =4)
myest_exp= serial_mix_est(data=icc_expose, N=50, startmu=10, startsig =4)

mm=rbind(myestimate,myest3, myest4, myest5,myest6, myest_exp)
colnames(mm)=c("mu","sig")
mm=as.data.frame(mm)
mm$K=c(4, 3,4, 5, 6, 4) 
mm$ordering = c("Onset","Onset","Onset","Onset","Onset","LastExposure")
print(mm[,c(4,3,1,2)]) 
```


```{r,eval=FALSE}
days = seq(from=0, to=10, by=0.1) 
 density= dnorm(days, mean = myestimate[1], sd = myestimate[2])
ggplot(data=data.frame(days=days, density=density), aes(x=days,y=density)) + geom_line() + ggtitle("ICC estimate of the Singapore cluster serial interval")
ggsave(file="sing_serialint.pdf", height = 4, width = 6)
```

I note that the serial interval gets longer if we include more cases per cluster (because the mixture of 4 pathways in Vink et al does not include longer transmission chains, which forces the assumption that everyone in the cluster was infected by the initial case, which in turn lengthens the estimated serial interval). We do not know the true infection pathways but it is reasonable not to constrain the model to enforce that most are infected by the first few cases. 


```{r, eval=FALSE}
# bootstrap analysis
Nboot=100
bestimates=myestimate # NOTE this loop had errors a few times; I just restarted it. 
for (kk in 1:Nboot) {
bdata = sample(x=icc4, size = length(icc4), replace = T)
bestimates = rbind(bestimates, serial_mix_est(data=bdata, N=50, startmu=10, startsig =4))
}

#The mean of the mean serial intervals is`r mean(bestimates[,1])` days and the standard deviation of these means is `r sd(bestimates[,1])`. 

```

```{r, eval=FALSE}
hist(bestimates[,1],breaks = 30)
bootdf=data.frame(mu=bestimates[,1], sig=bestimates[,2])
ggplot(bootdf, aes(x=mu, y=sig))+geom_point()
ggplot(bootdf, aes(x=mu))+geom_histogram()
 ggsave(file = "bootst_SI_sing.pdf", width = 6, height = 4)
mean(bestimates[,1]) # higher estimates also have high uncertainty 
median(bestimates[,1])
sd(bestimates[,1])
mean(bestimates[,2])
sd(bestimates[,2])
# range for mean serial interval in the paper is 4.49 (result for k=4) plus/min 1.96*sd(bestimates) 
```

We estimate R0 from Wallinga and Lipsitch Proc. Roy. Soc. B 2007 using the equation $R=\exp{r \mu - 1/2 r^2 \sigma^2}$. To obtain CIs for R, we could use our bootstrap estimates of $\mu$ and $\sigma^2$ and simply resample R using this equation. 

Jung et al Scenario 1

```{r,eval=FALSE}
load("sing_boots_100.Rdata") # in case in Rmd with above evals set to FALSE 
myrate=0.15

Rs=0*(1:100) 
for (n in 1:100) {
  Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
sd(Rs)
hist(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```

```{r,eval=FALSE}
myrate=0.29
Rs=0*(1:100) 
for (n in 1:100) {
  Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```