Chapter 11 TPC Pipelines(1)
11.1 Extensive tumor data
11.1.1 Compiling types of identifiers
- For TCGA samples, there are 5 main data types and 23 subtypes (except for Custom metadata).
For PCAWG samples, there are 5 main data types and 17 subtypes (except for Custom metadata).
For CCLE samples, there are 3 main data types and 6 subtypes (except for Custom metadata).
About the core data type —— Molecular profile, the initial step of all analysis pipelines involves the check and modification of molecular datasets through the Modify datasets[opt]
widget. For some molecular types of TPC databases, there are alternative datasets for users to select, which greatly enriches the analytical possibilities
- Among the 7 types of TCGA molecules, 4 of them have the alternative datasets.
Tips: For DNA Methylation, mean value of all CpG sites under one gene will be calculated for the gene by default. In the
Modify datasets[opt]
widget, users can furtherly limit its CpG sites and modify the aggregation method (e.g. median).
Among the 5 types of PCAWG molecules, 2 of them have the alternative datasets.
Among the 4 types of CCLE molecules, 1 of them has the alternative datasets.
11.1.2 Reference of full identifiers
- In the “Help→ID query” page, all identifiers for “Molecular profile”, “Tumor index”, “Immune Infiltration”, “Pathway activity”, “Phenotype data” types of TPC samples are compiling for easy query before analysis.
- For example, valid identifiers under the TCGA mRNA Expression subtype are shown in the following figure.
11.1.3 User-defined identifiers
11.1.3.1 Upload of random identifiers
- Users are enabled to upload random identifiers for joint analysis in the “Upload metadata” part of S1 step.
- The example file can be downloaded via the right button to see the format requirements.
- .CSV file;
- The first column names “Sample” with sample ids (e.g. “TCGA-OR-A5J1-01” for TCGA samples). Only the intersected samples between this column and all TPC samples are considered;
- The other columns are random identifiers with continuous or categorical values.
- Upon successful identifiers upload, they will be added into the “Custom metadata” subtype of “Phenotype data”.
11.1.3.2 Design of custom signature
- Users can also design custom signature score based on self-defined molecular formula in the “Add signature” part of S1 step.
- Upon successful signature design and score calculation, they will also be added into the “Custom metadata” subtype of “Phenotype data”.
- As shown in the figure below, the signature can be set as following steps after clicking Edit button.
- Firstly, modify the name and select the molecular type;
- Then, design the molecular formula comprised of multiple molecules with respective coefficients;
- The coefficient of each molecule is divided into two parts (Directions and Absolute Coef).
- The “Del” button will cancel the last set molecule and the “Reset” button will undo all selected molecules.
- The real-time formula is shown in the bottom.
- Finally, click the “Query data” button to calculate the signature scores and click the right button to add the signature into the “Custom metadata” subtype of “Phenotype data”. In addition, user can obtain the result locally via the bottom button.
11.2 Sample processing
11.2.1 Sample filtering
After selecting one tumor, all its samples will be selected by default. Here, we provide two methods to obtain specific subpopulation in the “Filter samples” part of S1 step.
11.2.1.1 Quick filter
Users can directly make quick selections based on tissue code (pathological) types of samples. Based on the 14th to 15th characters of the TCGA sample ID (e.g., TCGA-19-1787-01), samples can be categorized into different types. The common code types involved in UCSCXenaShiny analysis are shown in the following table.
Code | Definition | Short Letter Code |
---|---|---|
01 | Primary Solid Tumor | TP |
02 | Recurrent Solid Tumor | TR |
03 | Primary Blood Derived Cancer - Peripheral Blood | TB |
05 | Additional - New Primary | TAP |
06 | Metastatic | TM |
07 | Additional Metastatic | TAM |
11 | Solid Tissue Normal | NT |
For PCAWG database, you can perform similar operation to quickly select tumor or normal samples. For CCLE database, there is no quick filter.
11.2.1.2 Exact filter
Through this module, users are enabled to perform personalized sample filtering. This followings are operational steps:
- Set the candidate conditions from comprehensive identifiers. Each addition will rename the selected identifier as “Condition_Num”.
- When selecting molecular identifier as condition, it may take some time to download from the USCS Xena, depending on the network.
- For TPC samples, some basic clinical variables have been selected into the candidate pool. For TCGA samples, they include Age, Code, Gender, Stage_ajcc, Stage_clinical, Grade.
- Get an overview of the distribution of each candidate condition for clear filtering .
- For continuous variables, the mean and the distribution of multiple percentiles are displayed.
- For categorical variables, the frequency of each category is displayed.
- Perform filtering operation based on one or multiple conditions from the above candidate list. Total 6 operators are devised for versatile filtering give the type of each condition.
- For categorical variable, “+” and “-” are used for retain or discard specific categories of samples.
- For continuous variable, “>” and “<” are used to set absolute cutoff, while “%>” and “%<” are used to set percentile cutoff.
Tips: If the percentile cutoff is involved, the results may vary based on the order of filtering, as illustrated in the example above figure.
- The first instance: Samples with the top 50% high expression of TP53 in female samples. → 23 samples
- The second instance: Females from the top 50% of samples with high TP53 expression. → 28 samples
11.2.2 Sample grouping
For binary comparison or survival analysis, it is necessary to divide the samples into two groups based on one specific condition. Here, we also provide a personalized grouping module to flexibly set two ranges of sample populations.
Generally, it has the two main steps:
Select one condition and observe its distribution. Similar to the filtering module, it can bring a clear pre-knowledge by observing the distribution of one continuous or categorical condition.
Set two non-overlapping sample ranges to group.
- For categorical variable, assign one or more classifications for both groups (Left figure)
- For continuous variable, set the minimum and max thresholds for both group.
- By default, the first group (Group1) has lower value with left close and right open range, e.g. [0, 1). If its left range is blank, it will represent the minimum value of the condition.
- And, the second group (Group2) has larger value with left close and right close range, e.g. [0, 1]. If its right range is blank, it will represent the maximum value of the condition.
- The thresholds can be percentile values (Middle figure) or absolute values (Right figure) depending on purposes.
Tips: By default, the Group1 has the prior order during analysis and visualization. You can click the switch “Whether reverse levels” to change the order.
11.3 Data fetch
After sample selection and processing, we usually need to fetch data for downstream analysis.
11.3.1 Single data
Most of time, we just need to select one identifier and fetch its values. According to the hierarchical organization (Data type → Data subtype → Identifier), select one interesting identifier and then click the “Query button” to fetch data. It’s worth noting that, for Molecular profile data types, it may take some time to download from the UCSC Xena, depending on the network.
11.3.2 Multiple data
For batch analysis, we can select multiple identifiers based on one of three ways.
- In the first way (Left figure), user can select batch identifiers of one specific subtype one by one.
- In the second way (Middle figure), user can select all identifiers of one specific subtype. Notably, for some molecular types with much identifiers, all the identifiers (gene/transcript) belong to one MSigDB signature can be selected.
- These molecular types are “mRNA Expression”, “DNA Methylation”, “Mutation status”, “Copy Number Variation”, “Transcript Expression”.
- The MSigDB gene signatures are obtained via
msigdbr
package, which can be 9 main categories (bottom table). - Two links are available for detailed information of selected signature and MSigDB database.
- In the third way (Right figure), user can directly upload an identifier file with following requirements:
- TXT format;
- One column without colname;
- Only the valid identifiers that overlapped with the selected data subtype are retained
Category | Notes | Subcategory |
---|---|---|
H | hallmark gene sets | / |
C1 | positional gene sets | / |
C2 | curated gene sets | CGP, CP, CP:BIOCARTA, CP:KEGG, CP:PID, CP:REACTOME, CP:WIKIPATHWAYS |
C3 | regulatory target gene sets | MIR:MIRDB, MIR:MIR_Legacy, TFT:GTRD, TFT:TFT_Legacy |
C4 | computational gene sets | CGN, CM |
C5 | ontology gene sets | GO:BP, GO:CC, GO:MF, HPO |
C6 | oncogenic signature gene sets | / |
C7 | immunologic signature gene sets | IMMUNESIGDB, VAX |
C8 | cell type signature gene sets | / |