Skip to content
pieterlukasse edited this page Mar 23, 2016 · 199 revisions

Introduction

This page describes the file formats that cancer study data should assume in order to be successfully imported into the database. Unless otherwise noted, all data files are in tabular-TSV (tab separated value) format and have an associated metadata file which is in a multiline record format. The metadata and data files should follow a few rules documented in the Data Loading page.

Formats

Cancer Study

As described in the Data Loading tool page, the following file is needed to describe the cancer study:

Meta file

This file contains metadata about the cancer study. The file contains the following fields:

  1. type_of_cancer: The cancer type abbreviation, e.g., "brca". This should be the same cancer type as specified in the meta_cancer_type.txt file, if available.
  2. cancer_study_identifier: A string used to uniquely identify this cancer study within the database, e.g., "brca_joneslab_2013".
  3. name: The name of the cancer study, e.g., "Breast Cancer (Jones Lab 2013)".
  4. description: A description of the cancer study, e.g., "Comprehensive profiling of 103 breast cancer samples. Generated by the Jones Lab 2013". This description may contain one or more URLs to relevant information.
  5. citation (optional): A relevant citation, e.g., "TCGA, Nature 2012".
  6. pmid (optional): A relevant pubmed id.
  7. short_name: A short name used for display used on various web pages within the cBioPortal, e.g., "BRCA (Jones)".
  8. groups (optional): When using an authenticating cBioPortal, lists the user-groups that are allowed access to this study. Multiple groups are separated with a semicolon ";". The study will be invisible to users not in at least one of the listed groups, as if it wasn't loaded at all. e.g., "PUBLIC;GDAC;SU2C-PI3K". see User-Authorization for more information on groups
  9. add_global_case_list (optional): set to 'true' if you would like the "All samples" case list to be generated automatically for you. See also Case lists.
Example

An example meta_study.txt file would be:

type_of_cancer: brca
cancer_study_identifier: brca_joneslab_2013
name: Breast Cancer (Jones Lab 2013)
description: Comprehensive profiling of 103 breast cancer samples. Generated by the Jones Lab 2013.
add_global_case_list: true

Cancer Type

If the type_of_cancer specified in the meta_study.txt does not yet exist in the type_of_cancer database table, a meta_cancer_type.txt file is also mandatory.

Meta file

The file is comprised of the following fields:

  1. genetic_alteration_type: CANCER_TYPE
  2. datatype: CANCER_TYPE
  3. data_filename: <your datafile>
Example

An example meta_cancer_type.txt file would be:

genetic_alteration_type: CANCER_TYPE
datatype: CANCER_TYPE
data_filename: cancer_type.txt

Data file

The file is comprised of the following columns in the order specified:

  1. type_of_cancer: The cancer type abbreviation, e.g., "brca".
  2. name: The name of the cancer type, e.g., "Breast Invasive Carcinoma".
  3. clinical_trial_keywords: A comma separated list of keywords used to help associated clinical trial data with this cancer study, e.g., "breast,breast invasive".
  4. dedicated_color: The color associated with this cancer study, e.g., "HotPink". We follow the awareness ribbons color schema. This color is associated with the cancer study on various web pages within the cBioPortal.
  5. parent_type_of_cancer: The type_of_cancer field of the cancer type of which this is a subtype, e.g., "Breast".
Example

An example record would be:

brca<TAB>Breast Invasive Carcinoma<TAB>breast,breast invasive<TAB>HotPink<TAB>Breast

Clinical Data

The clinical data is used to capture both clinical attributes and the mapping between patient and sample ids. The software supports multiple samples per patient.

As of March 2016, the clinical file is split into a patient file and a clinical file. The sample file is required, whereas the patient file is optional.

Meta files

The two clinical metadata files (or just one metadata file if you choose to leave the patient file out) have to contain the following fields:

  1. cancer_study_identifier: same value specified in meta_study.txt
  2. genetic_alteration_type: CLINICAL
  3. datatype: PATIENT_ATTRIBUTES or SAMPLE_ATTRIBUTES
  4. data_filename: <your datafile>
Examples

An example metadata file, e.g. named meta_clinical_sample.txt, would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: CLINICAL
datatype: SAMPLE_ATTRIBUTES
data_filename: data_clinical_samples.txt

An example metadata file, e.g. named meta_clinical_patient.txt, would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: CLINICAL
datatype: PATIENT_ATTRIBUTES
data_filename: data_clinical_patients.txt

Data files

For both patients and samples, the clinical data file is a two dimensional matrix with multiple clinical attributes. When the attributes are defined in the patient file they are considered to be patient attributes; when they are defined in the sample file they are considered to be sample attributes.

The first four rows of the clinical data file contain tab-delimited metadata about the clinical attributes. These rows have to start with a '#' symbol. Each of these four rows contain different type of information regarding each of the attributes that are defined in the fifth row:

  • Row 1: The attribute Display Names: The display name for each clinical attribute
  • Row 2: The attribute Descriptions: Long(er) description of each clinical attribute
  • Row 3: The attribute Datatype: The datatype of each clinical attribute (must be one of: STRING, NUMBER, BOOLEAN)
  • Row 4: The attribute Priority: A number which indicates the importance of each attribute. In the future, higher priority attributes will appear in more prominent places than lower priority ones on relevant pages (such as the Study View). A lower number indicates a higher priority.

Example metadata rows

Below is an example of the first 4 rows with the respective metadata for the attributes defined in the 5th row.

#Patient Identifier<TAB>Overall Survival Status<TAB>Overall Survival (Months)<TAB>Disease Free Status<TAB>Disease Free (Months)<TAB>...
#Patient identifier<TAB>Overall survival status<TAB>Overall survival in months since diagnosis<TAB>Disease free status<TAB>Disease free in months since treatment<TAB>...
#STRING<TAB>STRING<TAB>NUMBER<TAB>STRING<TAB>NUMBER<TAB>...
#1<TAB>1<TAB>1<TAB>1<TAB>1<TAB>
PATIENT_ID<TAB>OS_STATUS<TAB>OS_MONTHS<TAB>DFS_STATUS<TAB>DFS_MONTHS<TAB>...
....
data - see examples below
....

Following the metadata rows comes a tab delimited list of clinical attributes (column headers). The sixth row is the first row to contain actual data.

#####The patient file#####

The file containing the patient attributes has one required column:

  • PATIENT_ID (required): a unique patient ID.

The following columns are used by the study view as well as the patient view. In the the study view they are used to create the survival plots. In the patient view they are used to add information to the [header] (http://www.cbioportal.org/case.do?cancer_study_id=lgg_ucsf_2014&case_id=P05).

  • OS_STATUS: Overall patient survival status
    • Possible values: DECEASED, LIVING
    • In the patient view, LIVING creates a green label, DECEASED a red label.
    • In visualisation of Timeline data, DECEASED will result in a new event of type STATUS
  • OS_MONTHS (required if OS_STATUS is DECEASED): Overall survival in months since initial diagnosis
  • DFS_STATUS: Disease free status since initial treatment
    • Possible values: DiseaseFree, Recurred/Progressed
    • In the patient view, DiseaseFree creates a green label, Recurred/Progressed a red label.
  • DFS_MONTHS: Disease free (months) since initial treatment

Optional attributes:

  • Other Clinical Attribute Headers: Clinical attribute headers are free-form. You can add any additional clinical attribute and cBioPortal will add them to the database. Be sure to provide the correct 'Datatype', as described above, for optimal search, sorting, filtering (in clinical data tab) and display.
Example patient data file
#Patient Identifier<TAB>Overall Survival Status<TAB>Overall Survival (Months)<TAB>Disease Free Status<TAB>Disease Free (Months)<TAB>...
#Patient identifier<TAB>Overall survival status<TAB>Overall survival in months since diagnosis<TAB>Disease free status<TAB>Disease free in months since treatment<TAB>...
#STRING<TAB>STRING<TAB>NUMBER<TAB>STRING<TAB>NUMBER<TAB>...
#1<TAB>1<TAB>1<TAB>1<TAB>1<TAB>
PATIENT_ID<TAB>OS_STATUS<TAB>OS_MONTHS<TAB>DFS_STATUS<TAB>DFS_MONTHS<TAB>...
PATIENT_ID_1<TAB>DECEASED<TAB>17.97<TAB>Recurred/Progressed<TAB>30.98<TAB>...
PATIENT_ID_2<TAB>LIVING<TAB>63.01<TAB>DiseaseFree<TAB>63.01<TAB>...
...

#####The samples file##### The file containing the sample attributes has two required columns:

  • PATIENT_ID (required): A patient ID.
  • SAMPLE_ID (required): A sample ID.

By adding PATIENT_ID here, cBioPortal will map the given sample to this patient. This enables one to associate multiple samples to one patient. For example, a single patient may have had multiple biopsies, each of which has been genomically profiled. See this example for a patient with multiple samples.

The following columns are required if you want the pan-cancer summary statistics tab in a pan-cancer study:

  • CANCER_TYPE: Cancer Type
  • CANCER_TYPE_DETAILED: Cancer Type Detailed, a sub-type of the specified CANCER_TYPE

The following columns affect the header of the patient view by adding text to the samples:

  • KNOWN_MOLECULAR_CLASSIFIER
  • GLEASON_SCORE
  • GLEASON_SCORE_1 and GLEASON_SCORE_2: if both are defined, overwrites GLEASON_SCORE
  • HISTOLOGY
  • TUMOR_STAGE_2009
  • TUMOR_GRADE
  • ETS/RAF/SPINK1_STATUS
  • TMPRSS2-ERG_FUSION_STATUS
  • ERG-FUSION_ACGH
  • SERUM_PSA
  • DRIVER_MUTATIONS

The following columns affect the Timeline data visualization:

  • OTHER_SAMPLE_ID: sometimes the timeline data (see the timeline data section) will not have the SAMPLE_ID but instead an alias to the sample (in the field SPECIMEN_REFERENCE_NUMBER). To ensure that the timeline data field SPECIMEN_REFERENCE_NUMBER is correctly linked to this sample, be sure to add this column OTHER_SAMPLE_ID as an attribute to your sample attributes file.
  • SAMPLE_TYPE: gives sample icon in the timeline a color.
    • If set to recurrence, progressed, progression or recurred: orange
    • If set to metastatic or metastasis: red
    • Otherwise: black

Optional attributes

  • Other Clinical Attribute Headers: Clinical attribute headers are free-form. You can add any additional clinical attribute you have tracked and cBioPortal will add them to the database. Be sure to provide the correct 'Datatype', as described above (for the header lines), for optimal search, sorting, filtering (in clinical data tab) and display.
Example sample data file
#Patient Identifier<TAB>Sample Identifier<TAB>Subtype<TAB>...
#Patient identifier<TAB>Sample Identifier<TAB>Subtype description<TAB>...
#STRING<TAB>STRING<TAB>STRING<TAB>...
#1<TAB>1<TAB>1<TAB>...
PATIENT_ID<TAB>SAMPLE_ID<TAB>SUBTYPE<TAB>...
PATIENT_ID_1<TAB>SAMPLE_ID_1<TAB>basal-like<TAB>...
PATIENT_ID_2<TAB>SAMPLE_ID_2<TAB>Her2 enriched<TAB>...
...

Discrete Copy Number Data

The discrete copy number data file contain values that would be derived from copy-number analysis algorithms like GISTIC or RAE. GISTIC can be installed or run online using the GISTIC 2.0 module on GenePattern. For some help on using GISTIC, check the Data Loading Tips and Best Practices page.

Meta file

The meta file is comprised of the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: COPY_NUMBER_ALTERATION
  3. datatype: DISCRETE
  4. stable_id: gistic, cna, cna_rae or cna_consensus
  5. show_profile_in_analysis_tab: true
  6. profile_name: A name for the discrete copy number data, e.g., "Putative copy-number alterations from GISTIC"
  7. profile_description: A description of the copy number data, e.g., "Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification."
  8. data_filename: <your datafile>
Example

An example metadata file could be named meta_CNA.txt and its contents could be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: DISCRETE
stable_id: gistic
show_profile_in_analysis_tab: true
profile_name: Putative copy-number alterations from GISTIC
profile_description: Putative copy-number from GISTIC 2.0. Values: -2 = homozygous deletion; -1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification.
data_filename: data_CNA.txt

Data file

For each gene (row) in the data file, the following columns are required in the order specified:

One or both of:

  • Hugo_Symbol: A HUGO gene symbol.
  • Entrez_Gene_Id: A Entrez Gene identifier.

And:

  • An additional column for each sample in the dataset using the sample id as the column header.

For each gene-sample combination, a copy number level is specified:

  • "-2" is a deep loss, possibly a homozygous deletion
  • "-1" is a single-copy loss (heterozygous deletion)
  • "0" is diploid
  • "1" indicates a low-level gain
  • "2" is a high-level amplification.

Example

An example data file which includes the required column header would look like:

Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
ACAP3<TAB>116983<TAB>0<TAB>-1<TAB>...
AGRN<TAB>375790<TAB>2<TAB>0<TAB>...
...
...

Continuous Copy Number Data

Meta file

The continuous copy number metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: COPY_NUMBER_ALTERATION.
  3. datatype: CONTINUOUS
  4. stable_id: linear_CNA
  5. show_profile_in_analysis_tab: false.
  6. profile_name: A name for the copy number data, e.g., "copy-number values".
  7. profile_description: A description of the copy number data, e.g., "copy-number values for each gene (from Affymetrix SNP6).".
  8. data_filename: <your datafile>

cBioPortal also supports log2 copy number data. If your data is in log2, change the following fields:

  1. datatype: LOG2-VALUE
  2. stable_id: log2CNA

TODO: In issue #571 log2 is changed to linear. This means the information that it is a log value is now lost. It should be discussed, as this is probably not a good idea.

Example

An example metadata file, e.g. meta_CNA_log2.txt, would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: LOG2-VALUE
stable_id: log2CNA
show_profile_in_analysis_tab: false
profile_description: Log2 copy-number values for each gene (from Affymetrix SNP6).
profile_name: Log2 copy-number values
data_filename: data_log2CNA.txt

Data file

The log2 copy number data file follows the same format as expression data files. See Expression Data for a description of the expression data file format.

Segmented Data

A SEG file (segmented data; .seg or .cbs) is a tab-delimited text file that lists loci and associated numeric values. The segmented data file format is the output of the Circular Binary Segmentation algorithm (Olshen et al., 2004). Segment data for import into the cBioPortal should be based on build 37 (hg19). This Segment data enables the 'CNA' lane in the Genomic overview of the Patient view (as can be seen in this example).

Meta file

The segmented metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: COPY_NUMBER_ALTERATION
  3. datatype: SEG
  4. reference_genome_id: Reference genome version. Supported values: "hg19"
  5. description: A description of the segmented data, e.g., "Segment data for the XYZ cancer study.".
  6. data_filename: <your datafile>

Example:

An example metadata file, e.g. meta_cna_seg.txt, would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: COPY_NUMBER_ALTERATION
datatype: SEG
reference_genome_id: hg19
description: Somatic CNA data (copy number ratio from tumor samples minus ratio from matched normals) from TCGA.
data_filename: brca_tcga_data_cna_hg19.seg

Data file

The first row contains column headings and each subsequent row contains a locus and an associated numeric value. See also the Broad IGV page on this format.

Example:

An example data file which includes the required column header would look like:

'ID<TAB>chrom<TAB>loc.start<TAB>loc.end<TAB>num.mark<TAB>seg.mean
SAMPLE_ID_1<TAB>1<TAB>3208470<TAB>245880329<TAB>128923<TAB>0.0025
SAMPLE_ID_2<TAB>2<TAB>474222<TAB>5505492<TAB>2639<TAB>-0.0112
SAMPLE_ID_2<TAB>2<TAB>5506070<TAB>5506204<TAB>2<TAB>-1.5012
SAMPLE_ID_2<TAB>2<TAB>5512374<TAB>159004775<TAB>80678<TAB>-0.0013
...
...

Expression Data

An expression data file is a two dimensional matrix with a gene per row and a sample per column. For each gene-sample pair, a real number represents the gene expression in that sample.

Meta file

The expression metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: MRNA_EXPRESSION
  3. datatype: CONTINUOUS, DISCRETE or Z-SCORE
  4. stable_id: see table below.
  5. show_profile_in_analysis_tab: false (you can set to true if Z-SCORE to enable it in the oncoprint, for example).
  6. profile_name: A name for the expression data, e.g., "mRNA expression (microarray)".
  7. profile_description: A description of the expression data, e.g., "Expression levels (Agilent microarray).".
  8. data_filename: <your datafile>

Supported stable_id values for MRNA_EXPRESSION

For historical reasons, the following static set of stable_id values is expected.

genetic_alteration_type datatype stable_id description
MRNA_EXPRESSION CONTINUOUS mrna_U133 Affymetrix U133 Array
MRNA_EXPRESSION Z-SCORE mrna_U133_Zscores Affymetrix U133 Array
MRNA_EXPRESSION Z-SCORE rna_seq_mrna_median_Zscores RNA-seq data
MRNA_EXPRESSION Z-SCORE mrna_median_Zscores mRNA data
MRNA_EXPRESSION CONTINUOUS rna_seq_mrna RNA-seq data
MRNA_EXPRESSION CONTINUOUS rna_seq_v2_mrna RNA-seq data
MRNA_EXPRESSION Z-SCORE rna_seq_v2_mrna_median_Zscores RNA-seq data
MRNA_EXPRESSION CONTINUOUS mirna MicroRNA data
MRNA_EXPRESSION Z-SCORE mirna_median_Zscores MicroRNA data
MRNA_EXPRESSION Z-SCORE mrna_merged_median_Zscores ?
MRNA_EXPRESSION CONTINUOUS mrna mRNA data
MRNA_EXPRESSION DISCRETE mrna_outliers mRNA data of outliers
MRNA_EXPRESSION Z-SCORE mrna_zbynorm ?
MRNA_EXPRESSION CONTINUOUS rna_seq_mrna_capture data from Roche mRNA Capture Kit
MRNA_EXPRESSION Z-SCORE rna_seq_mrna_capture_Zscores data from Roche mRNA Capture Kit

Example

An example metadata, e.g. meta_expression_file.txt file would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: MRNA_EXPRESSION
datatype: CONTINUOUS
stable_id: rna_seq_mrna
show_profile_in_analysis_tab: false
profile_name: mRNA expression 
profile_description: Expression levels 
data_filename: data_expression_file.txt

Data file

For each gene (row) in the data file, the following columns are required in the order specified:

One or both of:

  • Hugo_Symbol: A HUGO gene symbol.
  • Entrez_Gene_Id: A Entrez Gene identifier.

And:

  • An additional column for each sample in the dataset using the sample id as the column header.

For each gene-sample combination, a value is specified:

  • A real number for each sample id (column) in the dataset, representing the expression value for the gene in the respective sample.
z-score instructions

For mRNA expression data, we typically expect the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population. That reference population is either all tumors that are diploid for the gene in question, or, when available, normal adjacent tissue. The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples. Note, the importer tool can create normalized (z-score) expression data on your behalf. Please visit the Z-Score normalization script wiki page for more information. A corresponding z-score metadata file would be something like:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: MRNA_EXPRESSION
datatype: Z-SCORE
stable_id: rna_seq_mrna_median_Zscores
show_profile_in_analysis_tab: true
profile_name: mRNA expression z-scores
profile_description: Expression levels z-scores
data_filename: data_expression_zscores_file.txt

Examples of data files:

An example data file which includes the required column header and leaves out Hugo_Symbol (recommended) would look like:

Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
116983<TAB>-0.005<TAB>-0.550<TAB>...
375790<TAB>0.142<TAB>0.091<TAB>...
...
...

An example data file which includes both Hugo_Symbo and Entrez_Gene_Id would look like (supported, but not recommended as it increases the chances of errors regarding ambiguous Hugo symbols):

Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
ACAP3<TAB>116983<TAB>-0.005<TAB>-0.550<TAB>...
AGRN<TAB>375790<TAB>0.142<TAB>0.091<TAB>...
...
...

An example data file with only Hugo_Symbol column (supported, but not recommended as it increases the chances of errors regarding ambiguous Hugo symbols):

Hugo_Symbol<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
ACAP3<TAB>-0.005<TAB>-0.550<TAB>...
AGRN<TAB>0.142<TAB>0.091<TAB>...
...
...

Mutation Data

The mutation data file extends the Mutation Annotation Format (MAF) created as part of the Cancer Genome Atlas project, by adding extra annotations to each mutation record. If your mutation data is already in VCF format (which by default most variant callers produce) you can use this vcf2maf converter. Please note that all data should be mapped to UniProt canonical isoforms. This can be done by calling the vcf2maf or maf2maf with the --custom-enst flag and the mapping file available here. This will ensure the SWISSPROT column, which contains the UniProt canonical isoform, can be used correctly by cBioPortal.

Meta file

The mutation metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: MUTATION_EXTENDED
  3. datatype: MAF
  4. stable_id: mutations
  5. show_profile_in_analysis_tab: true
  6. profile_name: A name for the mutation data, e.g., "Mutations".
  7. profile_description: A description of the mutation data, e.g., "Mutation data from whole exome sequencing.".
  8. data_filename: <your datafile>

An example metadata file would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: MUTATION_EXTENDED
datatype: MAF
stable_id: mutations
show_profile_in_analysis_tab: true
profile_description: Mutation data from whole exome sequencing.
profile_name: Mutations
data_filename: brca_tcga_pub.maf

Data file

A data file contains at least the MAF file annotation results. The minimal mutation annotations file can contain just three of the MAF columns plus one annotation column (which is normally added to the end of each MAF row):

  • Hugo_Symbol: (MAF column) A HUGO gene symbol.
  • Tumor_Sample_Barcode: (MAF column) This is the sample ID. Either a TCGA barcode (patient identifier will be extracted), or for non-TCGA data, a literal SAMPLE_ID as listed in the clinical data file.
  • Variant_Classification: (MAF column) Translational effect of variant allele, e.g. Missense_Mutation, Silent, etc. cBioPortal skips the following types during the import: Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR and RNA
  • HGVSp_Short: (annotation column) Amino Acid Change, e.g. p.V600E.

Note: next to Hugo_Symbol, it is recommended to have the Entrez gene ID:

  • Entrez_Gene_Id (Optional, but recommended): An Entrez Gene identifier.

The following extra annotation columns are also important for making sure mutation specific UI functionality works well in the portal:

  • Protein_position: (annotation column) Required to initialize the 3D viewer in mutations view
  • SWISSPROT: (annotation column) swissprot code, e.g. O11H1_HUMAN. Is not absolutely required, but not having it may result in inconsistent PDB structure matching in mutations view. ⚠️ After running vcf2maf (or VEP) the SWISSPROT column contains the uniprot accession and NOT the entry name (e.g. for TP53 the column will contain P04637 and not P53_HUMAN). cBioPortal currently only supports the entry name.

MAF format

Adding your mutation annotation columns to the complete MAF rows can also be done. In this way, the portal will parse and store the extra MAF fields as well. For example, mutation data that you find on cBioPortal.org comes from MAF files that have been further enriched with information from mutationassessor.org, which leads to a 'Mutation Assessor” column in the mutation table.

The MAF format recognized by the portal (excluding the annotation columns already mentioned above) has 32 columns + 4 columns with information on reference and variant allele counts in tumor and normal samples. A more detailed example MAF can be found on our Downloads page. Description of each column is provided below:

  1. Hugo_Symbol (Required): A HUGO gene symbol.
  2. Entrez_Gene_Id (Optional, but desired): A Entrez Gene identifier.
  3. Center (Optional): The sequencing center.
  4. NCBI_Build (Optional): Must be "37".
  5. Chromosome (Optional): A chromosome number, e.g., "7".
  6. Start_Position (Optional): Start position of event.
  7. End_Position (Optional): End position of event.
  8. Strand (Optional): We assume that the mutation is reported for the + strand.
  9. Variant_Classification (Required): Translational effect of variant allele, e.g. Missense_Mutation, Silent, etc.
  10. Variant_Type (Optional): Variant Type, e.g. SNP, DNP, etc.
  11. Reference_Allele (Optional): The plus strand reference allele at this position.
  12. Tumor_Seq_Allele1 (Optional): Primary data genotype.
  13. Tumor_Seq_Allele2 (Optional): Primary data genotype.
  14. dbSNP_RS (Optional): Latest dbSNP rs ID.
  15. dbSNP_Val_Status (Optional): dbSNP validation status.
  16. Tumor_Sample_Barcode (Required): This is the sample ID. Either a TCGA barcode (patient identifier will be extracted), or for non-TCGA data, a literal SAMPLE_ID as listed in the clinical data file.
  17. Matched_Norm_Sample_Barcode (Optional): The sample ID for the matched normal sample.
  18. Match_Norm_Seq_Allele1 (Optional): Primary data.
  19. Match_Norm_Seq_Allele2 (Optional): Primary data.
  20. Tumor_Validation_Allele1 (Optional): Secondary data from orthogonal technology.
  21. Tumor_Validation_Allele2 (Optional): Secondary data from orthogonal technology.
  22. Match_Norm_Validation_Allele1 (Optional): Secondary data from orthogonal technology.
  23. Match_Norm_Validation_Allele2 (Optional): Secondary data from orthogonal technology.
  24. Verification_Status (Optional): Second pass results from independent attempt using same methods as primary data source.
  25. Validation_Status (Optional): -- "Valid" or "Unknown".
  26. Mutation_Status (Optional): Ideally "Somatic".
  27. Sequencing_Phase (Optional): Indicates current sequencing phase.
  28. Sequence_Source (Optional): Molecular assay type used to produce the analytes used for sequencing.
  29. Validation_Method (Optional): The assay platforms used for the validation call.
  30. Score (Optional): Not in use.
  31. BAM_File (Optional): Not used.
  32. Sequencer (Optional): Instrument used to produce primary data.
  33. t_alt_count (Optional): Variant allele count (tumor).
  34. t_ref_count (Optional): Reference allele count (tumor).
  35. n_alt_count (Optional): Variant allele count (normal).
  36. n_ref_count (Optional): Reference allele count (normal).

Methylation Data

The Portal expects a single value for each gene in each sample, usually a beta-value from the Infinium methylation array platform.

Meta file

The methylation metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: METHYLATION
  3. datatype: CONTINUOUS
  4. stable_id: "methylation_hm27" or "methylation_hm450" (depending on platform).
  5. show_profile_in_analysis_tab: false
  6. profile_name: A name for the methylation data, e.g., "Methlytation (HM27)".
  7. profile_description: A description of the methlytation data, e.g., "Methylation beta-values (HM27 platform). For genes with multiple methylation probes, the probe least correlated with expression is selected.".
  8. data_filename: <your datafile>

Example

An example metadata file would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: METHYLATION
datatype: CONTINUOUS
stable_id: methylation_hm27
show_profile_in_analysis_tab: false
profile_name: Methylation (HM27)
profile_description: Methylation beta-values (HM27 platform). For genes with multiple methylation probes, the probe least correlated with expression is selected.
data_filename: data_methylation_hm27.txt

Data file

The methylation data file follows the same format as expression data files. See Expression Data for a description of the expression data file format. The Portal expects a single value for each gene in each sample, usually a beta-value from the Infinium methylation array platform.

RPPA Data

Protein expression measured by reverse-phase protein array. Antibody-sample pairs, with a real number representing the RPPA level for that sample.

Meta file

The RPPA metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: PROTEIN_LEVEL
  3. datatype: LOG2-VALUE or Z-SCORE
  4. stable_id: rppa or rppa_Zscores
  5. show_profile_in_analysis_tab: false (true for Z-SCORE datatype)
  6. profile_name: A name for the RPPA data, e.g., "RPPA data".
  7. profile_description: A description of the RPPA data, e.g., "RPPA levels.".
  8. data_filename: <your datafile>

An example metadata file would be:

cancer_study_identifier: brca_tcga
genetic_alteration_type: PROTEIN_LEVEL
datatype: LOG2-VALUE
stable_id: rppa
show_profile_in_analysis_tab: false
profile_description: Protein expression measured by reverse-phase protein array
profile_name: Protein expression (RPPA)
data_filename: data_rppa.txt

NB: You also need a Z-SCORE file if you want RPPA to be available in query UI and in Oncoprint visualization. E.g.:

cancer_study_identifier: brca_tcga
genetic_alteration_type: PROTEIN_LEVEL
datatype: Z-SCORE
data_filename: data_rppa.txt
stable_id: rppa_Zscores
show_profile_in_analysis_tab: true
profile_description: Protein expression Z-scores (RPPA)
profile_name: Protein expression Z-scores (RPPA)

Data file

An RPPA data file is a two dimensional matrix with an antibody per row and a sample per column. For each antibody-sample pair, a real number represents the RPPA level for that sample. The antibody information should contain a HUGO gene symbol and an antibody ID pair separated by the "|" symbol.

Example

An example data file which includes the required column header would look like:

Composite.Element.REF<TAB>SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>...
BRAF|B-Raf-M-NA<TAB>1.09506676325<TAB>0.5843256495...
EGFR|EGFR-R-C<TAB>1.70444582025<TAB>1.0982864685...
...

Fusion Data

This type data is not yet being validated. It can, however, be uploaded.

Meta file

The fusion metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: FUSION
  3. datatype: FUSION
  4. stable_id: fusion
  5. show_profile_in_analysis_tab: true.
  6. profile_name: A name for the fusion data, e.g., "Fusions.".
  7. profile_description: A description of the fusion data.
  8. data_filename: <your datafile>

Example

An example metadata file would be:

cancer_study_identifier: brca_tcga_pub
genetic_alteration_type: FUSION
datatype: FUSION
stable_id: fusion
profile_description: Fusions.
show_profile_in_analysis_tab: true
profile_name: Fusions
data_filename: data_fusions.txt

Data file

A fusion data file is a two dimensional matrix with one gene per row. For each gene (row) in the data file, the following tab-delimited values are required in the order specified:

  1. Hugo_Symbol: A HUGO gene symbol.
  2. Entrez_Gene_Id: A Entrez Gene identifier.
  3. Center: The sequencing center.
  4. Tumor_Sample_Barcode: This is the sample ID.
  5. Fusion: A description of the fusion, e.g., "TMPRSS2-ERG fusion".
  6. DNA support: Fusion detected from DNA sequence data, "yes" or "no".
  7. RNA support: Fusion detected from RNA sequence data, "yes" or "no".
  8. Method: Fusion detected algorithm/tool.
  9. Frame: "in-frame" or "frameshift".

An example data file which includes the required column header would look like:

Hugo_Symbol<TAB>Entrez_Gene_Id<TAB>Center<TAB>Tumor_Sample_Barcode<TAB>Fusion<TAB>DNA support<TAB>RNA support<TAB>Method<TAB>Frame>
ALK<TAB>238<TAB>center.edu<TAB>SAMPLE_ID_1<TAB>Fusion<TAB>unknown<TAB>yes<TAB>unknown<TAB>in-frame
ALK<TAB>238<TAB>center.edu<TAB>SAMPLE_ID_2<TAB>Fusion<TAB>unknown<TAB>yes<TAB>unknown<TAB>in-frame
RET<TAB>5979<TAB>center.edu<TAB>SAMPLE_ID_3<TAB>Fusion<TAB>unknown<TAB>yes<TAB>unknown<TAB>in-frame
...
...

Case Lists

There should be 1 or more case lists associated with each cancer study. You should provide at least one case list which contains all sample ids (the importer can generate this for your if you set the attribute add_global_case_list to 'true' in the Study metadata.

When not using the add_global_case_list attribute in Study metadata, or if you want to add custom case lists:

  • the case list files should be placed in a sub-directory called "case_lists" which exists alongside all the other cancer study data.

The case list file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. stable_id: typically the cancer_study_identifier with an relevant suffix, e.g., "_custom". There are some naming rules to follow if you want the case list to be selected automatically in the query UI base on the selected sample profiles. See subsection below.
  3. case_list_name: A name for the patient list, e.g., "All Tumors".
  4. case_list_description: A description of the patient list, e.g., "All tumor samples (825 samples).".
  5. case_list_ids: A tab-delimited list of sample ids from the dataset.
  6. case_list_category: Optional alternative way of linking your case list to a specific molecular profile. E.g. setting this to all_cases_with_cna_data will signal to the portal that this is the list of samples to be associated with CNA data in some of the analysis.

Example

An example case list file would be:

cancer_study_identifier: brca_tcga_pub
stable_id: brca_tcga_pub_custom
case_list_name: Custom subset of samples
case_list_description: Custom subset of samples (825 samples)
case_list_ids: SAMPLE_ID_1<TAB>SAMPLE_ID_2<TAB>SAMPLE_ID_3<TAB>...

⚠️ In order for sample counts to propagate to the data sets widget on the home page and the table on the Data Sets page, the following case list suffixes need to be used in the stable_id property. This is also needed for correct statistics in the Study view page when calculating the frequency of CNA and of mutations per gene in the respective summary tables.

  • Sequenced Samples: "_sequenced" e.g. "brca_tcga_pub_sequenced".
  • CNA Patients: "_cna" e.g. "brca_tcga_pub_cna" (:warning: size of this list is used for determining the frequency(%) of CNA in genes in study view. If this case list is not given, the system will assume that all samples have been sequenced and will calculate the frequency according to that).
  • mRNA (RNA-SeqV2): "_rna_seq_v2_mrna" e.g. "brca_tcga_pub_rna_seq_v2_mrna".
  • mRNA (microarray): "_mrna" e.g. "brca_tcga_pub_mrna".
  • Methylation (HM27): "_methylation_hm27" e.g. "brca_tcga_pub_methylation_hm27".
  • RPPA: "_rppa" e.g. "brca_tcga_pub_rppa".
  • Complete: "_3way_complete" e.g. "brca_tcga_pub_3way_complete", (mRNA, CNA, & sequencing).

Finally, if you are not using add_global_case_list attribute in Study metadata, you need to generate the "All samples" case list as well and give it the following stable_id:

  • All Samples: "_all" e.g. "brca_tcga_pub_all".

Timeline Data

The timeline data is a representation of the various events that occur during the course of treatment for a patient from initial diagnosis. In cBioPortal timeline data is represented as one or more tracks in the patient view. Each main track is based on an event type, such as "Specimen", "Imaging", "Lab_test", etc.

⚠️ some clinical attributes affect the timeline visualization. Please check the Clinical Data section for more information.

This type data is not yet being validated. It can, however, be uploaded.

Meta file

Each event type requires its own meta file. A timeline meta file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: TIMELINE
  3. datatype: TIMELINE
  4. data_filename: <your datafile>

An example metadata file would be:

cancer_study_identifier: brca_tcga
genetic_alteration_type: TIMELINE
datatype: TIMELINE
data_filename: data_timeline_imaging.txt

Data file

Each event type requires its own data file, which contains all the events that each patient undergoes. The data format used for timeline data is extremely flexible. There are three required columns:

  1. PATIENT_ID: the patient ID from the dataset
  2. START_DATE: the start point of any event, calculated in *days from the date of diagnosis (which will act as point zero on the timeline scale)
  3. EVENT_TYPE: the category of the event. You are free to define any type of event here. For several event types cBioPortal has column naming suggestions and for several events there are column names which have special effects. See event types for more information.

There is one conditionally required column:

  1. STOP_DATE: The end date of the event is calculated in days from the date of diagnosis (which will act as point zero on the timeline scale). If the event occurs over time (e.g. a Treatment, ...) STOP_DATE should be used. If the event occurs at a time point (e.g. a Lab_test, Imaging, ...) STOP_DATE should not be used.

And one optional columns with a special effect:

  1. SPECIMEN_REFERENCE_NUMBER: when this column has values that match the SAMPLE_ID/OTHER_SAMPLE_ID (defined in the clinical data file), the timeline will show case labels with black/red/etc 1, 2, 3, 4 circles. This only works for the first track and only if no STOP_DATE is set.

#####Event Types As previously mentioned, the EVENT_TYPE can be anything. However, several event types have columns with special effects. Furthermore, for some event types cBioPortal has column naming suggestions.

EVENT_TYPE: TREATMENT

Suggested columns

  • TREATMENT_TYPE: This can be either Medical Therapy or Radiation Therapy.
  • SUBTYPE: Depending upon the TREATMENT_TYPE, this can either be Chemotherapy, Hormone Therapy, Targeted Therapy etc. (for Medical Therapies) or WPRT, IVRT etc. (for Radiation Therapies).
  • AGENT: for medical therapies, the agent is defined with number of cycles if applicable and for radiation therapy, the agent is defined as standard dose given to the patient during the course.
  • Based on different cancer types you can add additional data here.

Special: When using the AGENT and SUBTYPE columns, each agent and subtype will be split into its own track.

EVENT_TYPE: LAB_TEST

Suggested columns

  • TEST: type of test performed
  • RESULT: corresponding value of the test
  • Based on different cancer types you can add additional data here.

Special: When using the TEST and RESULT columns, each test gets its own track and the dots are sized by the values of the RESULT if the TEST is PSA, ALK, TEST, HGB, PHOS or LDH.

EVENT_TYPE: IMAGING

Suggested columns

  • DIAGNOSTIC_TYPE: This attribute will cover the different diagnostics tools used (for example: MRI, CT scan etc.)
  • DIAGNOSTIC_TYPE_DETAILED: Detailed description of the event type.
  • RESULT: Results of the diagnostic tests
  • SOURCE: Where was the Imaging done.
  • Based on different cancer types you can add additional data here.

Special: all dots in the IMAGING track are squares.

EVENT_TYPE: STATUS

Suggested columns

  • STATUS: If the EVENT_TYPE is status, data is entered under STATUS to define either the best response from the treatment or if there is a diagnosis of any stage progression etc.
  • SOURCE: Where the status was monitored.
  • Based on different cancer types you can add additional data here.

EVENT_TYPE: SPECIMEN

Suggested columns

  • SPECIMEN_REFERENCE_NUMBER: This corresponds to the SAMPLE_ID/OTHER_SAMPLE_ID
  • SPECIMEN_SITE: This is the site from where the specimen was collected.
  • SPECIMEN_TYPE: This can either be tissue or blood.
  • SOURCE: Where was the specimen collection done.
  • Based on different cancer types you can add additional data here.

Special: when the SPECIMEN_REFERENCE_NUMBER column has values that match the SAMPLE_ID/OTHER_SAMPLE_ID (defined in the clinical data file), the timeline will show case labels with black/red/etc 1, 2, 3, 4 circles. This only works for the first track and only if no STOP_DATE is set.

#####Clinical Track Ordering Clinical tracks are ordered as follows (if available):

  1. Specimen
  2. Surgery
  3. Status
  4. Diagnostics
  5. Diagnostic
  6. Imaging
  7. Lab_test
  8. Treatment
  9. First custom event
  10. etc.

Example

An example timeline file for SPECIMEN would be:

PATIENT_ID<TAB>START_DATE<TAB>EVENT_TYPE<TAB>SPECIMEN_REFERENCE_NUMBER<TAB>SPECIMEN_SITE<TAB>SPECIMEN_TYPE<TAB>SOURCE<TAB>MyCustomColumn
CACO2<TAB>0<TAB>SPECIMEN<TAB>CACO2_S1<TAB>liver<TAB>tissue<TAB>hospital<TAB>T1
CACO2<TAB>100<TAB>SPECIMEN<TAB>CACO2_S2<TAB>lung<TAB>tissue<TAB>hospital<TAB>T2
...

Assuming the sample identifiers were also defined in the clinical file, this will lead to a timeline track with numbered specimen samples.

An example timeline file for Lab_test would be:

PATIENT_ID<TAB>START_DATE<TAB>EVENT_TYPE<TAB>TEST<TAB>RESULT
CACO2<TAB>100<TAB>LAB_TEST<TAB>PSA<TAB>10
CACO2<TAB>250<TAB>LAB_TEST<TAB>PSA<TAB>100
...

This will lead to a timeline track for Lab_test with an additional subtrack specifically for PSA. PSA's events will be sized based on the result.

Gistic Data

Running GISTIC 2.0 on e.g. GenePattern not only provides the Discrete Copy Number Data, but also provides an amp_genes and a del_genes file. These cannot be directly imported into cBioPortal, but first have to be converted to a different file format. Currently, there is no easy way available to do this. However, the cBioPortal team is aiming to make the necessary cbioportal_pipelines functionality available via issue #873.

After uploading a gistic_amp and/or gistic_del file, a new button becomes available in the Enter Gene Set section, called "Select Genes from Recurrent CNAs (Gistic)".

Meta file

The Gistic metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: GISTIC_GENES_AMP or GISTIC_GENES_DEL
  3. datatype: Q-VALUE
  4. reference_genome_id: reference genome version. Supported values: "hg19"
  5. data_filename: <your datafile>

An example metadata file would be:

cancer_study_identifier: brca_tcga
genetic_alteration_type: GISTIC_GENES_AMP
datatype: Q-VALUE
reference_genome_id: hg19
data_filename: data_gistic_genes_amp.txt

Data file

The following fields from the generated Gistic file are used by the cBioPortal importer:

  • chromosome: chromosome on which the region was found, without the chr prefix
  • peak_start: start coordinate of the region of maximal amplification or deletion within the significant region
  • peak_end: end coordinate of the region of maximal amplification or deletion within the significant region
  • genes_in_region: comma-separated list of HUGO gene symbols in the `wide peak' (allowing for single-sample errors in the peak boundaries)
  • amp: 1 for amp, 0 for del
  • cytoband: cytogenetic band specification of the region, including chromosome (Giemsa stain)
  • q_value: the q-value of the peak region

Example

An example data file which includes the required column header would look like:

chromosome<TAB>peak_start<TAB>peak_end<TAB>genes_in_region<TAB>amp<TAB>cytoband<TAB>q_value<TAB>
1<TAB>150563314<TAB>150621176<TAB>SNORA40|ENSG00000253047.1,RN7SL600P,RN7SL473P,C1orf138,LINC00568,CTSS,ECM1,ENSA,MCL1,RPRD2,ADAMTSL4,GOLPH3L,TARS2,HORMAD1,MIR4257,<TAB>1<TAB>1q21.3<TAB>2.7818E-43<TAB>
1<TAB>85988564<TAB>85991712<TAB>DDAH1,<TAB>1<TAB>1p22.3<TAB>4.1251E-13<TAB>
...

MutSig Data

MutSig stands for "Mutation Significance". MutSig analyzes lists of mutations discovered in DNA sequencing, to identify genes that were mutated more often than expected by chance given background mutation processes. You can download mutsig from broadinstitute (MutSigCV 1.4 is available) or run mutsig (MutSigCV 1.2 is available) using GenePattern.

Note: The tcga files that are uploaded to cBioPortal are generated using MutSig2.0. This version is not available outside broadinstitute.

The MutSigCV 1.2 output is different from the MutSig2.0 header. TODO: test the 1.4 version. Requires > 10GB of memory

After uploading a MutSig file, a new button becomes available in the Enter Gene Set section, called "Select From Recurrently Mutated Genes (MutSig)".

This type data is not yet being validated. It can, however, be uploaded.

Meta file

The MutSig metadata file should contain the following fields:

  1. cancer_study_identifier: same value as specified in study meta file
  2. genetic_alteration_type: MUTSIG
  3. datatype: Q-VALUE
  4. data_filename: <your datafile>

An example metadata file would be:

cancer_study_identifier: brca_tcga
genetic_alteration_type: MUTSIG
datatype: Q-VALUE
data_filename: data_mutsig.txt

Data file

The following fields from a MutSig file are used by the cBioPortal importer:

  • rank
  • gene: this is the HUGO symbol
  • N (or Nnon): bases covered
  • n (or nnon): number of mutations
  • p: result of testing the hypothesis that all of the observed mutations in this gene are a consequence of random background mutation processes, taking into account the list of bases that are successfully interrogated by sequencing (i.e., “covered”) and the list of observed somatic mutations, as well as the length and composition of the gene in addition to the background mutation rates in different sequence contexts (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059829/)
  • q: p value correct for multiple testing

Example

An example data file which includes the required column header would look like:

rank<TAB>gene<TAB>N<TAB>n<TAB>p<TAB>q
1<TAB>RUNX1<TAB>1051659<TAB>29<TAB>1.11E-16<TAB>1.88E-12
2<TAB>PIK3CA<TAB>3200341<TAB>351<TAB><1.00e-15<TAB><2.36e-12
...