-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharvinUserGuide.Rmd
179 lines (143 loc) · 11.5 KB
/
arvinUserGuide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "ARVIN : Identifying Risk Noncoding Variants Using Disease-relevant Gene Regulatory Networks"
author: "Long Gao, Yasin Uzun, Kai Tan"
date: "May 18, 2018"
output:
pdf_document:
toc: true
toc_depth: 4
---
```{r setup, include=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## 1 Introduction
Identifying causal noncoding variants remains a daunting task. Because noncoding variants exert their effects in the context of a gene regulatory network (GRN), we hypothesize that explicit use of disease-relevant GRN can significantly improve the inference accuracy of noncoding risk variants. We describe Annotation of Regulatory Variants using Integrated Networks (ARVIN), a general computational framework for predicting causal noncoding variants. For each disease, ARVIN first constructs a GRN using multi-dimensional omics data oncell/tissue-type relevant to the disease. ARVIN then uses a set of novel regulatory network-based features, combined with sequence-based features to make predictions.
This user guide explains ARVIN package. If you want to make a quick start and run ARVIN, please refer to the related document in this repository (ARVIN_Quick_Start.docx).
ARVIN uses genome annotation and motif data. You need to download http://tanlab4generegulation.org/arvin_annotation_data.tar.gz to run ARVIN. In addition, please make sure following dependency packages also installed including "caret", "randomForest", and "igraph".
## 2 Network construction
As a first step of ARVIN, an integrative GRN for each disease-relevant cell/tissue type should be constructed. In this network, there are two types of nodes which are genes and snps/variants as well as two types of edges which are gene-gene interacstions and snp-gene interactions. Gene interaction network is used as the network backbone, and then this backbone is integrated with enhancer-promoter interactions(snp-gene interactions). For each type of nodes and edges, normalized scores are computed using cell/tissue speicific information.
### 2.1 Enhancer-promoter interaction
ARVIN uses enhancer-promoter interaction for mapping SNPS to genes via enhancers. The enhancer-promoter interaction data must be in tab separated format as follows:
#Chr Start End Target Score
chr9 22124001 22126001 ENST00000452276 0.93
chr2 242792001 242794001 ENST00000485966 0.792
You can use IM-PET software for predicting enhancer-promoter interactions. For this purpose, you can download IM-PET from http://tanlab4generegulation.org/IM-PET.html . IM-PET uses enhancer predictions (computed using CSI-ANN) and gene expression data to compute enhancer-promoter interactions. You can use the output of IM-PET (which is as shown above) as input for ARVIN.
### 2.2 Obtain gene-gene interaction network
Gene interaction network can be obtained from multiple sources including protein-protein interaction networks and functional interaction networks such as BioGRID, BIND, HumantNet, STRING and etc.
### 2.3 Network scoring
To make different types of scores comparable, we used a min-max normalization to normalize scores within each category. For gene interaction network, the interaction score indicating how strong the interactions between genes can be directly used from corresponding database followed by normalization. SNP-gene interaction score can be assigned with enhancer-promoter score or scores indicating the interaction strength between enhancer and target promoters. SNPs weight can be represented by TF binding disruption score computed using our script/function. Genes can be weighted using differential expression information.
### 2.4 Network input file format
There are two types of network input files users need to prepare. One is the node attribute file and the other is network/edge attribute file. The node attribute file has 3 columns. The first column denotes snp id or gene id, and the second column denotes the score of this snp or gene. The third column specifiy if this node is a snp or gene. In the network file, there are 4 columns. The first two columns list two nodes of a given edge. The third collumn has the normalized score for this edge. The forth column specifies edge type indicating if this interaction is between snps and genes or genes.
```{r, message=FALSE, warning=FALSE}
edgeFile <- "example_1/EdgeFile.txt"
nodeFile <- "example_1/NodeFile.txt"
SNP_pFile <- "example_1/snp_pval.txt"
edge_data <- read.table(edgeFile, sep="\t")
node_data <- read.table(nodeFile, sep="\t")
colnames(node_data) <- c("Node", "Node score", "Node type")
colnames(edge_data) <- c("First node", "Second node", "Edge score", "Edge type")
edge_data[98:103,]
node_data[98:103,]
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
devtools::load_all(".")
library(ARVIN)
library(igraph)
Net <- makeNet(edgeFile, nodeFile)
node_data <- read.table(nodeFile)
node_data <- subset(node_data, node_data$V3 == "eSNP")
eSNP_seeds <- as.character(node_data$V1)#use all SNPs as seeds for search
node_data <- NULL#free memory
edge_data <- read.table(edgeFile)
V_weight <- V(Net)$weight#put the node weight into an array to save time
names(V_weight) <- V(Net)$name#set the names for the array
E_adj <- as_adj(Net,attr="weight")#use adj matrix to save time for getting edge weight
colnames(E_adj) <- names(V_weight)
row.names(E_adj) <- names(V_weight)
Adj_List <- c()
batch <- 1000#split the matrix into a few parts in case of a memory error
index <- floor(dim(E_adj)[1]/batch)
for(i in 1:index){
ifelse(i != index,
end <- i * batch,
end <- dim(E_adj)[1]
)
start <- (i-1) * batch + 1
cur_list <- apply(E_adj[start:end,], 1, function(x) x[x!=0])
Adj_List <- append(Adj_List, cur_list)
}
```
## 3 Prepare features for risk variants prediction
### 3.1 Network-based features
For most of network features such as the centrality, we wrapped up functions from "igraph" package to calculate their values. We also implemented our module identification algorithm to find modules containing snps we are intereted in. To calculate all network based features, users can simply call NetFeature(). SNPs in the same enhancer usually have similar topological features. However, we can further distinguish them using their TF motif breaking scores.
```{r, message=FALSE, warning=FALSE}
Nodes <- as.character(edge_data[,2])
Net <- makeNet(edgeFile, nodeFile)
topoFeature <-NetFeature(Net, nodeFile, edgeFile, SNP_pFile)
head(topoFeature)
```
#### 3.1.1 Betweenness centrality
Betweenness is a centrality measure of a vertex within a graph. Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.
```{r, message=FALSE, warning=FALSE}
bet_vals <- BetFeature(Net, edge_data)
head(bet_vals)
```
#### 3.1.2 Closeness centrality
Closeness is a measure of the degree to which an individual is near all other individuals in a network. It is the inverse of the sum of the shortest distances between each node and every other node in the network. Closeness is the reciprocal of farness.
```{r, message=FALSE, warning=FALSE}
close_vals <- CloseFeature(Net, edge_data)
head(close_vals)
```
#### 3.1.3 Pagerank centrality
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank is a way of measuring the importance of website pages. In the biological networks, we can also use this algorithm to measure the importance of genes/nodes.
```{r, message=FALSE, warning=FALSE}
page_vals <- PageFeature(Net, edge_data)
head(page_vals)
```
#### 3.1.4 Weighted degree
The weighted degree of a node is like the degree. It's based on the number of edge for a node, but ponderated by the weigtht of each edge. It's doing the sum of the weight of the edges.
```{r, message=FALSE, warning=FALSE}
wd_vals <- WDFeature(Adj_List, edge_data)
head(wd_vals)
```
#### 3.1.5 Module score
Gene modules downstream of an eSNP. Our overall hypothesis is that a causal eSNP contributes to disease risk by directly causing expression changes in genes of diseaserelevant pathways. Thus, in addition to the direct target gene of the eSNP, other genes in the same pathway can also provide discriminative information. With the weighted GRN, our goal is to identify “heavy” gene modules in the network that connects a given eSNP to a set of genes
```{r, message=FALSE, warning=FALSE}
mod_vals <- ModuleFeature(Adj_List, E_adj, eSNP_seeds, V_weight, Nodes)
head(mod_vals)
```
### 3.2 GWAVA features
ARVIN uses sequence features for the input SNPs generated by GWAVA. GWAVA is an open-source software developed by Sanger Institute. You can either upload the SNPs to GWAVA web page and get the output or download the source and run locally.
For running GWAVA online navigate to https://www.sanger.ac.uk/sanger/StatGen_Gwava, upload the list of input SNPs and get the features in csv format, which will be input for ARVIN.
If you prefer to run it locally, you need to dowload the source code from ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/src/ and annotation data from ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/source_data/ . Then you can run it local by running gwava_annotate.py and generate the features, which will be input for ARVIN.
```{r, message=F, warning=FALSE}
gwava <- read.table("example_1/gwava_matrix.txt", header=T, sep="\t")
gwava[1:8,168:174]
```
### 3.3 FunSeq features
ARVIN also uses sequence features generated by FunSeq. FunSeq can also be run online or binaries can be downloaded to run locally.
For running FunSeq online, navigate to http://funseq.gersteinlab.org/analysis and upload the list of SNPs that you want to analyze. In the web page, it is noted that the input SNPs can be uploaded in bed format, SNP coordinates followed by reference and alternate alleles; but we discovered that it fails to process bed input. In order to have it run, you the first two seperators need to be two spaces and last two separators need to be tabs, as follows:
chr16··4526757··4526758 G A
chr14··52733136··52733137 C A
where each dot (·) represents a space.
Then, FunSeq will generate the features by selecting “bed” as the output format., which will be used as input by ARVIN.
If you prefer to run FunSeq locally, you can download FunSeq binaries from http://funseq.gersteinlab.org/static/funseq-0.1.tar.gz and extract it into your local. You will also need to download FunSeq annotation data from http://funseq.gersteinlab.org/static/data/data.tar.gz , extract it into directory that you saved the binaries. Then you can run FunSeq binary file by setting the output format to bed.
```{r, message=F, warning=FALSE}
funseq <- read.table("example_1/funseq_matrix.txt", header=T, sep="\t")
head(funseq)
```
## 4 Build a classifier for prioritizing risk varints
### 4.1 Train a random forest classifier
Combining network, GWAVA features and FunSeq features, a random forest model can be trained to predict risk SNPs.
```{r, message=FALSE, warning=FALSE}
#combine 3 types of features
group <- as.character(read.table("example_1/snp_labels.txt")[,1])
features <- data.frame(topoFeature, gwava, funseq, group)
RFmodel <- trainMod(features)#train a random forest classifier
```
### 4.2 Predict causal disease variants
By providing the trained random forest model with feature values, users can compute the prediction score for a list of snps or candidate variants for being risk snps or non-risk snps.
```{r, message=FALSE, warning=FALSE}
prob <- predMod(features[,-dim(features)[2]], RFmodel)#estimate the probablity score using trained model after removing the label/group informatin
head(prob)#display the probablity score as positive or negative snps
```