Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details of preprocessing steps for data #1

Open
jaxensmith opened this issue Jun 24, 2024 · 3 comments
Open

Details of preprocessing steps for data #1

jaxensmith opened this issue Jun 24, 2024 · 3 comments

Comments

@jaxensmith
Copy link

Hello! Great tool you have developed.

I was curious as to the specific preprocessing steps for the proteomic data. For instance, CPTAC LUAD seems to contain 11485 proteins, where the original was 12400+. What were the thresholds for missing values, and what imputation strategy was used, and why?

Thanks.

@WangJin93
Copy link
Owner

Hello!

Thank you for your interest in our work and for being the first to raise this question. Indeed, data sources are crucial. I have verified the issue you mentioned, but it seems there is a discrepancy with the information you provided.

I downloaded the file CPTAC3_Lung_Adeno_Carcinoma_Proteome.tmt10.tsv, which contains only 11,032 protein data entries. I am not sure where the 11,485 proteins you mentioned came from. Additionally, from this link https://proteomic.datacommons.cancer.gov/pdc/analysis/f1c59a53-ab7c-11e9-9a07-0a80fada099c?StudyName=CPTAC%20LUAD%20Discovery%20Study%20-%20Proteome we can also view the heatmap for this dataset, which includes 11,029 protein data entries. In fact, the original data used by our tool is entirely consistent with the data in this heatmap.

@jaxensmith
Copy link
Author

jaxensmith commented Jun 25, 2024

Hi, thanks for the quick reply.

I have attached the CPTAC proteomic data for tumour samples, which contains data for 12.433 proteins. Also, the file you provided contains many missing values. So, what is the imputation strategy employed in your workflow?

https://pdc.cancer.gov/pdc/cptac-pancancer under proteome
LUAD_proteomics_gene_abundance_log2_reference_intensity_normalized_Tumor.txt

@WangJin93
Copy link
Owner

Hi,
I couldn't find the data you provided on the CPTAC website. Where did you get it from? I noticed that there are also many missing values in this file. In fact, as I mentioned earlier, the data I used was the raw data downloaded from the CPTAC website. For proteomics data, CPTAC provided log2 (Unshared peptide) and log2 (Shared peptide) data, and the tool used log2 (Unshared peptide) values. In addition, the protein expression level was retained to 5 decimal places. In fact, this data is exactly the same as the data in the heatmap provided by the CPTAC website, and I did not perform any additional processing. Please click on this link to view the data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants