forked from liulab-dfci/RIMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-HLA.Rmd
95 lines (75 loc) · 9.69 KB
/
08-HLA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# HLA Typing {#HLA}
Tumor mutations can result in altered proteins, which might act as neoantigens and potentially elicit an immune response. Algorithms, such as those available in <a href="https://github.com/griffithlab/pVACtools">pVAC-Tools</a> can be used to predict neoantigen peptides.
In order to elicit an adaptive immune response, a peptide must be presented on a cell's surface bound to major histocompatibility (MHC) protein complexes. The human leukocyte antigen (HLA) complex on chromosome 6 encodes most of the proteins that make up the human MHCs. The HLA complex is highly polymorphic and encodes MHCs that have different propensities to present different peptides. Two major classes of human MHC, class I and class II, (encoded by HLA-I and HLA-II alleles, respectively) are involved in antigen presentation. MHC class I protein complexes are expressed on all normal cells and present internally degraded proteins, while MHC class II protein complexes are normally expressed on professional antigen-presenting cells and present processed external antigens. The antigen binding portions of the MHC class I protein complexes are encoded by the HLA-I genes HLA-A, HLA-B and HLA-C. The antigen binding portions of the MHC class II proteins are encoded by alpha (A) and beta (B) genes in the DP, DQ and DR regions of the HLA complexes. These HLA-II genes include HLA-DQA1, HLA-DQB1 and HLA-DRB1.
Neoantigen prediction algorithms usually integrate peptide/MHC binding predictions using tools such as <a href="https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1">NetMHCpan</a>, <a href= "https://github.com/openvax/mhcflurry">MHCflurry</a>, and <a href="https://pubmed.ncbi.nlm.nih.gov/17608956/">SMMalign</a>. Knowing the HLA type is necessary to identify potential neoantigens for targeted immunotherapy. Current alignment-based HLA typing methods require DNA or RNA sequencing inputs and predict HLA-I class only or both of HLA-I and HLA-II classes. For example, <a href="https://github.com/RabadanLab/arcasHLA">arcasHLA</a>, <a href="https://github.com/ExpressionAnalysis/HLAProfiler">HLAProfiler</a>, and <a href="https://github.com/TRON-Bioinformatics/seq2HLA">seq2HLA</a> have been developed to perform high-solution HLA typing from RNA-seq data. <a href="https://github.com/FRED-2/OptiType">OptiType</a> and <a href="https://pubmed.ncbi.nlm.nih.gov/29858810/">PHLAT</a> are other HLA identification tools for RNA, whole-exome, and whole-genome sequencing data. HLA reference sequences can be obtained from the <a href="http://www.imgt.org/">ImMunoGeneTics (IMGT) database</a>.
## Identify HLA type
HLA Typing is a part of the neoantigen prediction module of RIMA.
RIMA uses <a href="https://github.com/RabadanLab/arcasHLA">arcasHLA</a> to predict HLA types for both MHC Class I & Class II from the bulk RNA-seq data. The sorted alignment BAM files generated by STAR are used for input to arcasHLA. Here we use one sample from the Zhao trial as an example. We first extract the HLA reads from the alignment file:
```
### Extract fastq reads
arcasHLA extract analysis/STAR/SRR8281218/SRR8281218.sorted.bam -t 16 -v --sample SRR8281218 -o analysis/neoantigen/SRR8281218
### Output from extraction
analysis/neoantigen/SRR8281218/SRR8281218.extracted.1.fq.gz
analysis/neoantigen/SRR8281218/SRR8281218.extracted.2.fq.gz
```
Then we identify the HLA alleles using the extracted reads:
```
arcasHLA genotype analysis/neoantigen/SRR8281218/SRR8281218.extracted.1.fq.gz analysis/neoantigen/SRR8281218/SRR8281218.extracted.2.fq.gz -g A,B,C,DQA1,DQB1,DRB1 -t 16 -v -o analysis/neoantigen/SRR8281218
###Output from extraction
analysis/neoantigen/SRR8281218/SRR8281218.genotype.json
cat analysis/neoantigen/SRR8281218/SRR8281218.genotype.json
###
{"A": ["A*26:01:01", "A*03:01:01"], "B": ["B*35:01:01", "B*07:02:01"], "C": ["C*07:02:01", "C*04:01:01"], "DQA1": ["DQA1*02:01:01", "DQA1*03:01:01"], "DQB1": ["DQB1*03:02:01"], "DRB1": ["DRB1*04:02:01", "DRB1*07:01:01"]}
###
```
**Merge individual HLAs**
RIMA also merges the individual HLA results from arcasHLA into a summary file named 'genotypes.tsv':
```
subject A1 A2 B1 B2 C1 C2 DQA11 DQA12 DQB11 DQB12 DRB11 DRB12
SRR8281238 A*01:01:01 A*02:01:01 B*35:01:01 B*08:01:01 C*07:01:01 C*04:01:01 DQA1*05:01:01 DQA1*01:01:01 DQB1*02:01:01 DQB1*05:01:01 DRB1*01:01:01 DRB1*03:01:01
SRR8281233 A*01:01:01 A*02:01:01 B*57:01:01 B*44:02:01 C*05:01:01 C*06:02:01 DQA1*01:02:01 DQA1*03:01:01 DQB1*06:02:01 DQB1*03:02:01 DRB1*04:01:01 DRB1*15:01:01
SRR8281236 A*33:01:01 A*24:02:01 B*14:02:01 B*15:01:01 C*06:02:01 C*08:02:01 DQA1*01:02:02 DQA1*03:01:01 DQB1*03:02:01 DQB1*05:02:01 DRB1*04:03:01 DRB1*16:02:01
SRR8281243 A*01:01:01 A*24:02:01 B*35:02:01 B*41:01:01 C*04:01:01 C*17:01:01 DQA1*01:02:01 DQA1*01:05:01 DQB1*05:01:01 DQB1*06:09:01 DRB1*10:01:01 DRB1*13:02:01
SRR8281251 A*24:02:01 A*01:01:01 B*35:02:01 B*41:01:01 C*04:01:01 C*17:01:01 DQA1*01:02:01 DQA1*01:05:01 DQB1*05:01:01 DQB1*06:09:01 DRB1*10:01:01 DRB1*13:02:01
SRR8281230 A*01:01:01 A*02:01:01 B*57:01:01 B*44:02:01 C*05:01:01 C*06:02:01 DQA1*03:01:01 DQA1*01:02:01 DQB1*03:02:01 DQB1*06:02:01 DRB1*04:01:01 DRB1*15:01:01
SRR8281250 A*01:01:01 A*02:01:01 B*35:01:01 B*08:01:01 C*07:01:01 C*04:01:01 DQA1*05:01:01 DQA1*01:01:01 DQB1*02:01:01 DQB1*05:01:01 DRB1*01:01:01 DRB1*03:01:01
SRR8281244 A*25:01:01 A*02:01:01 B*18:01:01 B*08:01:01 C*07:02:01 C*12:03:01 DQA1*01:02:01 DQA1*01:02:01 DQB1*06:02:01 DQB1*06:02:01 DRB1*15:01:01 DRB1*15:01:01
SRR8281218 A*26:01:01 A*03:01:01 B*35:01:01 B*07:02:01 C*07:02:01 C*04:01:01 DQA1*02:01:01 DQA1*03:01:01 DQB1*03:02:01 DQB1*03:02:01 DRB1*04:02:01 DRB1*07:01:01
```
## HLA Oncoplot
To better visualize and compare the HLA distribution in different samples, we use arcasHLA to convert HLA results to P-group format. The P-group format shows the alleles which share the same amino acid sequence in the antigen-binding region. (eg. A*01:01P, 'P' indicates p-group nomenclature)
```
###
#HLA summary file: genotypes.tsv
arcasHLA convert -r p-group genotypes.tsv -o genotypes.p-group.tsv
###
```
```
subject A1 A2 B1 B2 C1 C2 DQA11 DQA12 DQB11 DQB12 DRB11 DRB12
SRR8281233 A*01:01P A*02:01P B*57:01P B*44:02P C*05:01P C*06:02P DQA1*01:02P DQA1*03:01P DQB1*06:02P DQB1*03:02P DRB1*04:01P DRB1*15:01P
SRR8281236 A*33:01P A*24:02P B*14:02P B*15:01P C*06:02P C*08:02P DQA1*01:02P DQA1*03:01P DQB1*03:02P DQB1*05:02P DRB1*04:03P DRB1*16:02P
SRR8281243 A*01:01P A*24:02P B*35:02P B*41:01P C*04:01P C*17:01P DQA1*01:02P DQA1*01:01P DQB1*05:01P DQB1*06:09P DRB1*10:01P DRB1*13:02P
SRR8281244 A*25:01P A*02:01P B*18:01P B*08:01P C*07:02P C*12:03P DQA1*01:02P DQA1*01:02P DQB1*06:02P DQB1*06:02P DRB1*15:01P DRB1*15:01P
SRR8281250 A*01:01P A*02:01P B*35:01P B*08:01P C*07:01P C*04:01P DQA1*05:01P DQA1*01:01P DQB1*02:01P DQB1*05:01P DRB1*01:01P DRB1*03:01P
SRR8281219 A*03:01P A*26:01P B*35:01P B*07:02P C*07:02P C*04:01P DQA1*02:01P DQA1*03:01P DQB1*03:02P DQB1*03:03P DRB1*04:02P DRB1*07:01P
SRR8281245 A*25:01P A*02:01P B*18:01P B*08:01P C*07:02P C*12:03P DQA1*01:02P DQA1*01:02P DQB1*06:02P DQB1*06:02P DRB1*15:01P DRB1*15:01P
SRR8281230 A*01:01P A*02:01P B*57:01P B*44:02P C*05:01P C*06:02P DQA1*03:01P DQA1*01:02P DQB1*06:02P DQB1*03:02P DRB1*04:01P DRB1*15:01P
SRR8281251 A*24:02P A*01:01P B*35:02P B*41:01P C*04:01P C*17:01P DQA1*01:02P DQA1*01:01P DQB1*05:01P DQB1*06:09P DRB1*10:01P DRB1*13:02P
SRR8281218 A*26:01P A*03:01P B*35:01P B*07:02P C*07:02P C*04:01P DQA1*02:01P DQA1*03:01P DQB1*03:02P DQB1*03:02P DRB1*04:02P DRB1*07:01P
SRR8281238 A*01:01P A*02:01P B*35:01P B*08:01P C*07:01P C*04:01P DQA1*05:01P DQA1*01:01P DQB1*02:01P DQB1*05:01P DRB1*01:01P DRB1*03:01P
SRR8281226 A*33:01P A*24:02P B*14:02P B*15:01P C*06:02P C*08:02P DQA1*03:01P DQA1*01:02P DQB1*05:02P DQB1*03:02P DRB1*04:03P DRB1*16:02P
```
RIMA generates an HLA oncoplot which shows the most frequent HLA alleles across patient cohorts and the expression level of HLA genes (HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, and HLA-DRB1). The percentages on the left of the heatmap represent the frequency of the allele among samples. The colors in the heatmap indicate which allele (e.g. A = HLA-A) and whether the allele was found in the first (A1), the second (A2) or both copies of the patient's gene. The colored bars to the right indicate the absolute number of alleles found. The x axis represents patient samples. The samples are sorted by group and then by the expression of HLA genes within the group. The HLA expression shown on the top represents the mean TPM of HLA genes in each sample.
```{r}
#load HLA p-group summary file
hla <- read.table("genotypes.p-group.tsv", header = TRUE, sep = "\t")
#load metasheet
meta <- read.table("metasheet.csv", sep = ",", header = TRUE, row.names = 1)
#load gene expression matrix
exprn <- read.table("tpm_convertID.txt", header = TRUE, row.names = 1, sep = ",", check.names = FALSE)
source("hla_oncoplot.R")
#group indicates which columns are used for hla comparison
index <- c("Responder", "Gender")
p <- hla_oncoplot(hla, exprn, meta, groups = index)
print(p)
```