-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTOO_paper_notes
71 lines (61 loc) · 4.76 KB
/
TOO_paper_notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
citation: Li R, Liao B, Wang B, Dai C, Liang X, Tian G, Wu F. Identification of Tumor Tissue of Origin with RNA-Seq Data and Using Gradient Boosting Strategy. Biomed Res Int. 2021 Feb 17;2021:6653793. doi: 10.1155/2021/6653793. PMID: 33681364; PMCID: PMC7904362.
Abstract
- Cancer of unknown primary
- Metastatic carcinoma
- Tissue of origin
- Accurately inferring the tissue of origin in CUP
Methods
- Gradient boosting framework
- 20 types of solid tumors used
- Expression sequencing data from TCGA (the cancer genome atlas) used
- Used 30/70 split to train and test data using SVC in mine
- RNA seqdata from 79 tumor samples (from 6 cancer types) with known origins were downloaded from GEO (Gene Expression Omnibus) and independent dataset in study
Results
- 400 genes selected to train gradient boosting model for identification of primary site tumor
- Overall 10-fold cross - validation accuracy of method was 96.1% across 20 types of cancer
- While accuracy for independent data set reached 83.5%
Conclusion
- Gradient boosting framework has potential practical usage in identifying tumor tissue of origin based on training data and independent testing data
Introduction
- CUP accounts for 3% - 5% of tumors (less than 50% of CUPS could be accurately diagnosed)
- Many cancerous cells retain features of their primary TOO’s during metastasis
- Gene expression of metastatic cancer should be consistent with gene expression of its primary tissue
- A gene expression profile of the tissue origin is always retained during the process of tumor occurrence, development, and metastasis
differential expression (CancerTYPEID)
- Gene expression profile analysis by using microarray data provided diagnoses of cancer types with high accuracy
- Pathwork Tissue of Origin (TOO)
- formalin-fixed, paraffin-embedded (FFPE) tissues
- This method primarily included two algorithms, one for standardization and the other for classification
- RNA-seq is a high-throughput sequencing approach that sequences mRNA, small RNA, and noncoding RNA by using high-throughput sequencing technology
- "Here, we conducted an experiment to identify the tissue of origin with a gradient boosting classifier [17] and RNA-seq technique"
Materials & Methods
- ICGC Data Portal (https://dcc.icgc.org/releases/release_26/) download
- "M∗N matrix where M represents the sample size and N represents the number of genes. The matrix was generated by normalizing the expression value of each sample and each gene from TCGA"
- Gradient boosting (GBDT) is a machine learning method for regression and classification in studies (used for gene selection and final classification with cross-validation)
- Combines multiple weak learners (usually decision tree) into prediction models
Workflow
- Data preprocessing -> GDBT for selection of gene features -> # of genetic characteristics and feature selection - > Gradient boosting classifier for identification -> Results
Data Prep
- RNA-seq data downloaded for 21 common cancers (20 after SKSM was removed)
- Data was then normalized with RSEM
- SKCM were taken out of dataset as metastatic cases that originated from SKSM are relatively higher than those from other cases
- 400 genes selected for future prediction - genes were ranked by importance scores calculated by gradient boosting algorithm (400 genes has >95% accuracy)
Classification
- N_estimators set to 200 (200 weak classifiers) - meaning 200 decision trees
- 10-fold cross validation - divided the data set into 10 subsets - 9 were merged to a training set and 1 to the test set - average precision was 96.1% after repeating algorithm 10 times
- Confusion matrix made for each gene
- Other methods tested - K-nearest neighbor - decision tree - Adaboost - SVM
- GO enrichment analysis
- "The F1 score is equivalent to the harmonic average of precision and precision. If any number of the recall and precision decreases, the F1 score will decrease"
- The training data in this experiment was not parallel to the training data. Therefore, the results of this study might be influenced by the training method
- Ensemble learning
Support Vector Machines and project plan
- set of supervised learning methods used for classification, regression and outliers detection
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Read paper
- Data will not be loaded in - it will be parsed line by line - this is more efficient - ‘data structures’
- Svm - read doc then use small example
- Once complete switch out data with genomic data
- Classifiers in dictionaries
- SVM - supervised machine learning technique - creates hyperplane to divide possible outcomes
- Extracting ICGC sample ID, Gene ID, Normalized read count from each of these files - use only one ID for all the genes associated with it