TOO_paper_notes

citation: Li R, Liao B, Wang B, Dai C, Liang X, Tian G, Wu F. Identification of Tumor Tissue of Origin with RNA-Seq Data and Using Gradient Boosting Strategy. Biomed Res Int. 2021 Feb 17;2021:6653793. doi: 10.1155/2021/6653793. PMID: 33681364; PMCID: PMC7904362.

Abstract
- Cancer of unknown primary 
- Metastatic carcinoma 
- Tissue of origin
- Accurately inferring the tissue of origin in CUP

Methods
- Gradient boosting framework 
- 20 types of solid tumors used 
- Expression sequencing data from TCGA (the cancer genome atlas) used 
- Used 30/70 split to train and test data using SVC in mine 
- RNA seqdata from 79 tumor samples (from 6 cancer types) with known origins were downloaded from GEO (Gene Expression Omnibus) and independent dataset in study

Results
- 400 genes selected to train gradient boosting model for identification of primary site tumor 
- Overall 10-fold cross - validation accuracy of method was 96.1% across 20 types of cancer 
- While accuracy for independent data set reached 83.5%

Conclusion
- Gradient boosting framework has potential practical usage in identifying tumor tissue of origin based on training data and independent testing data 

Introduction
- CUP accounts for 3% - 5% of tumors (less than 50% of CUPS could be accurately diagnosed) 
- Many cancerous cells retain features of their primary TOO’s during metastasis 
- Gene expression of metastatic cancer should be consistent with gene expression of its primary tissue 
- A gene expression profile of the tissue origin is always retained during the process of tumor occurrence, development, and metastasis 
differential expression (CancerTYPEID)
- Gene expression profile analysis by using microarray data provided diagnoses of cancer types with high accuracy
- Pathwork Tissue of Origin (TOO)
- formalin-fixed, paraffin-embedded (FFPE) tissues
- This method primarily included two algorithms, one for standardization and the other for classification
- RNA-seq is a high-throughput sequencing approach that sequences mRNA, small RNA, and noncoding RNA by using high-throughput sequencing technology
- "Here, we conducted an experiment to identify the tissue of origin with a gradient boosting classifier [17] and RNA-seq technique"

Materials & Methods
- ICGC Data Portal (https://dcc.icgc.org/releases/release_26/) download
- "M∗N matrix where M represents the sample size and N represents the number of genes. The matrix was generated by normalizing the expression value of each sample and each gene from TCGA"
- Gradient boosting (GBDT) is a machine learning method for regression and classification in studies (used for gene selection and final classification with cross-validation) 
- Combines multiple weak learners (usually decision tree) into prediction models

Workflow 
- Data preprocessing -> GDBT for selection of gene features -> # of genetic characteristics and feature selection - > Gradient boosting classifier for identification -> Results 

Data Prep
- RNA-seq data downloaded for 21 common cancers (20 after SKSM was removed)
- Data was then normalized with RSEM
- SKCM were taken out of dataset as metastatic cases that originated from SKSM are relatively higher than those from other cases
- 400 genes selected for future prediction - genes were ranked by importance scores calculated by gradient boosting algorithm (400 genes has >95% accuracy) 

Classification 
- N_estimators set to 200 (200 weak classifiers) - meaning 200 decision trees
- 10-fold cross validation - divided the data set into 10 subsets - 9 were merged to a training set and 1 to the test set - average precision was 96.1% after repeating algorithm 10 times 
- Confusion matrix made for each gene 
- Other methods tested - K-nearest neighbor - decision tree - Adaboost - SVM
- GO enrichment analysis 
- "The F1 score is equivalent to the harmonic average of precision and precision. If any number of the recall and precision decreases, the F1 score will decrease"
- The training data in this experiment was not parallel to the training data. Therefore, the results of this study might be influenced by the training method
- Ensemble learning 

Support Vector Machines and project plan
- set of supervised learning methods used for classification, regression and outliers detection
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Read paper 
- Data will not be loaded in - it will be parsed line by line - this is more efficient - ‘data structures’
- Svm - read doc then use small example 
- Once complete switch out data with genomic data 
- Classifiers in dictionaries 
- SVM - supervised machine learning technique - creates hyperplane to divide possible outcomes
- Extracting ICGC sample ID, Gene ID, Normalized read count from each of these files - use only one ID for all the genes associated with it