Chapter 4 Molecular Data Query
4.1 TCGA query
- Search molecular identifiers for TCGA samples via list
pancan_identifier_help
referring to Figure 3.2
Database | Type | Datasets | Function |
---|---|---|---|
TCGA | mRNA | 3 | get_pancan_gene_value() |
TCGA | transcript | 3 | get_pancan_transcript_value() |
TCGA | protein | 1 | get_pancan_protein_value() |
TCGA | mutation | 1 | get_pancan_mutation_status() |
TCGA | cnv | 3 | get_pancan_cn_value() |
TCGA | methylation | 2 | get_pancan_methylation_value() |
TCGA | miRNA | 1 | get_pancan_miRNA_value() |
4.1.1 get_pancan_gene_value()
get_pancan_gene_value(identifier, norm = c("tpm", "fpkm", "nc"))
data.list = get_pancan_gene_value("TP53", norm = "tpm")
data = data.list$expression
head(data.frame(value=data))
## value
## GTEX-S4Q7-0003-SM-3NM8M 4.785
## TCGA-19-1787-01 5.887
## TCGA-S9-A7J2-01 5.517
## GTEX-QV31-1626-SM-2S1QC 4.431
## TCGA-G3-A3CH-11 2.382
## TCGA-B5-A5OE-01 5.765
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
toilHub | TcgaTargetGtex_rsem_gene_tpm | 19131 | log2(tpm+0.001) |
toilHub | TcgaTargetGtex_rsem_gene_fpkm | 19131 | log2(fpkm+0.001) |
toilHub | TcgaTargetGtex_RSEM_Hugo_norm_count | 19120 | log2(norm_count+1) |
4.1.2 get_pancan_transcript_value()
get_pancan_transcript_value(identifier, norm = c("tpm", "fpkm", "nc"))
data.list = get_pancan_transcript_value("ENST00000456328", norm = "tpm")
data = data.list$expression
head(data.frame(value=data))
## value
## GTEX-S4Q7-0003-SM-3NM8M -5.012
## TCGA-19-1787-01 -9.966
## TCGA-S9-A7J2-01 -4.035
## GTEX-QV31-1626-SM-2S1QC -9.966
## TCGA-G3-A3CH-11 -9.966
## GTEX-13OVI-1026-SM-5L3EM -9.966
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
toilHub | TcgaTargetGtex_rsem_isoform_tpm | 19131 | log2(tpm+0.001) |
toilHub | TcgaTargetGtex_RSEM_isoform_fpkm | 19129 | log2(fpkm+0.001) |
toilHub | TcgaTargetGtex_rsem_isopct | 19131 | IsoPct |
4.1.3 get_pancan_protein_value()
get_pancan_protein_value(identifier)
data.list = get_pancan_protein_value("ACC_pS79")
data = data.list$expression
head(data.frame(value=data))
## value
## TCGA-FI-A2EY-01 2.2170
## TCGA-DF-A2KS-01 0.4139
## TCGA-A5-A1OH-01 0.0000
## TCGA-AX-A2H7-01 0.3248
## TCGA-AX-A2HA-01 -1.2410
## TCGA-A5-A2K4-01 -0.2814
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pancanAtlasHub | TCGA-RPPA-pancan-clean.xena | 7744 | norm_value |
4.1.4 get_pancan_mutation_status()
get_pancan_mutation_status(identifier)
## value
## TCGA-02-0003-01 1
## TCGA-02-0033-01 1
## TCGA-02-0047-01 0
## TCGA-02-0055-01 1
## TCGA-02-2470-01 0
## TCGA-02-2483-01 1
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pancanAtlasHub | mc3.v0.2.8.PUBLIC.nonsilentGene.xena | 9104 | NA |
4.1.5 get_pancan_cn_value()
get_pancan_cn_value(identifier, gistic2 = TRUE, use_thresholded_data = FALSE)
## value
## TCGA-A5-A0GI-01 0.014
## TCGA-S9-A7J2-01 0.068
## TCGA-06-0150-01 0.015
## TCGA-AR-A1AH-01 -0.761
## TCGA-EK-A2RE-01 -0.024
## TCGA-44-6778-01 -0.317
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
tcgaHub | …_Gistic2_all_data_by_genes | 10845 | Gistic2 copy number |
tcgaHub | …_Gistic2_all_thresholded.by_genes | 10845 | -2,-1,0,1,2: 2 copy del,1 copy del,no change,amplification,high-amplification |
pancanAtlasHub | …_SNP_6_whitelisted.gene.xena | 10873 | log(tumor/normal) |
4.1.6 get_pancan_methylation_value()
get_pancan_methylation_value(
identifier,
type = c("450K", "27K"),
rule_out = NULL,
aggr = c("NA", "mean", "Q0", "Q25", "Q50", "Q75", "Q100")
)
rule_out
: exclude some CpG site(s) under one gene;aggr
: select one aggregation method to calculate gene-level methylation (Default: “NA”(mean)).
## value
## TCGA-S6-A8JX-01 0.07085
## TCGA-SO-A8JP-01 0.08410
## TCGA-YU-A90Q-01 0.08465
## TCGA-2G-AAH8-01 0.09373
## TCGA-2G-AAGY-05 0.09546
## TCGA-XE-AAOL-01 0.09774
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
gdcHub | GDC-PANCAN.methylation450.tsv | 9736 | beta value |
gdcHub | GDC-PANCAN.methylation27.tsv | 2595 | beta value |
4.1.7 get_pancan_miRNA_value()
get_pancan_miRNA_value(identifier, gistic2 = TRUE, use_thresholded_data = FALSE)
data.list = get_pancan_miRNA_value("hsa-let-7a-2-3p")
data = data.list$expression
head(data.frame(value=data))
## value
## TCGA-C4-A0F6-01 0.99
## TCGA-CU-A0YO-01 1.91
## TCGA-BT-A0S7-01 3.02
## TCGA-CU-A0YR-01 0.85
## TCGA-BL-A0C8-01 0.85
## TCGA-C4-A0F0-01 2.70
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pancanAtlasHub | pancanMiRs_EBadjOnProtocolPlatformWithoutRepsWithUnCorrectMiRs_08_04_16.xena | 10818 | log2(norm_value+1) |
4.2 PCAWG query
- Search molecular identifiers for TCGA samples via list
pcawg_identifier
referring to Figure 3.3
Database | Type | Datasets | Function |
---|---|---|---|
PCAWG | mRNA | 1 | get_pcawg_gene_value() |
PCAWG | fusion | 1 | get_pcawg_fusion_value() |
PCAWG | miRNA | 2 | get_pcawg_miRNA_value() |
PCAWG | promoter | 3 | get_pcawg_promoter_value() |
PCAWG | APOBEC | 1 | get_pcawg_APOBEC_mutagenesis_value() |
4.2.1 get_pcawg_gene_value()
get_pcawg_gene_value(identifier)
## value
## SP89389 1.798
## SP21193 6.542
## SP13206 4.690
## SP103623 4.143
## SP47089 4.846
## SP32742 5.010
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pcawgHub | tophat_star_fpkm_uq.v2_aliquot_gl.sp.log | 1521 | log2(fpkm-uq+0.001) |
4.2.2 get_pcawg_fusion_value()
get_pcawg_fusion_value(identifier)
## value
## SP23639 0
## SP23769 0
## SP23925 0
## SP24129 0
## SP24236 0
## SP24565 0
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pcawgHub | pcawg3_fusions_PKU_EBI.gene_centric.sp.xena | 1359 | binary fusion call, 1 fusion, 0 otherwise |
4.2.3 get_pcawg_miRNA_value()
get_pcawg_miRNA_value(identifier, norm = c("TMM", "UQ"))
data.list = get_pcawg_miRNA_value("hsa-let-7a-2-3p")
data = data.list$data
head(data.frame(value=data))
## value
## SP1029 2.894
## SP1588 2.286
## SP119599 2.484
## SP1437 1.801
## SP1347 1.529
## SP106899 1.788
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pcawgHub | x3t2m1.mature.TMM.mirna.matrix.log | 1524 | log2(cpm-TMM+0.1) |
pcawgHub | x3t2m1.mature.UQ.mirna.matrix.log | 1524 | log2(cpm-uq+0.1) |
4.2.4 get_pcawg_promoter_value()
get_pcawg_promoter_value(identifier, type = c("raw", "relative", "outlier"))
## value
## SP23639 34.51
## SP23769 35.16
## SP23925 24.63
## SP24129 44.71
## SP24236 172.40
## SP24565 15.51
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pcawgHub | rawPromoterActivity.sp | 1359 | raw promoter activity |
pcawgHub | promoterCentricTable_0.2_1.0.sp | 1359 | -1 (low expression), 0 (normal), 1 (high expression) |
pcawgHub | relativePromoterActivity.sp | 1359 | portion of transcription activity of the gene driven by the promoter |
4.2.5 get_pcawg_APOBEC_mutagenesis_value()
get_pcawg_APOBEC_mutagenesis_value(identifier)
data.list = get_pcawg_APOBEC_mutagenesis_value("A3A_or_A3B")
data = data.list$data
head(data.frame(value=data))
## value
## SP117425 0
## SP117332 0
## SP117655 1
## SP99293 1
## SP99329 1
## SP99309 1
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
pcawgHub | MAF_Aug31_2016_sorted_A3A_A3B_comparePlus.sp | 2072 | NA |
4.3 CCLE query
- Search molecular identifiers for TCGA samples via list
ccle_identifier
referring to Figure 3.3
Database | Type | Datasets | Function |
---|---|---|---|
CCLE | mRNA | 2 | get_ccle_gene_value() |
CCLE | protein | 1 | get_ccle_protein_value() |
CCLE | mutation | 1 | get_ccle_mutation_status() |
CCLE | cnv | 1 | get_ccle_cn_value() |
4.3.1 get_ccle_gene_value()
get_ccle_gene_value(identifier, norm = c("rpkm", "nc"))
data.list = get_ccle_gene_value("TP53", norm = "rpkm")
data = data.list$expression
head(data.frame(value=data))
## value
## 22RV1_PROSTATE 7.537
## 2313287_STOMACH 45.590
## 253JBV_URINARY_TRACT 28.510
## 253J_URINARY_TRACT 28.040
## 42MGBA_CENTRAL_NERVOUS_SYSTEM 13.920
## 5637_URINARY_TRACT 33.350
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
publicHub | ccle/CCLE_DepMap_18Q2_RNAseq_RPKM_20180502 | 1076 | RPKM |
publicHub | ccle/CCLE_DepMap_18Q2_RNAseq_reads_20180502.log2 | 1076 | log2(count+1) |
4.3.2 get_ccle_protein_value()
get_ccle_protein_value(identifier)
data.list = get_ccle_protein_value("14-3-3_beta")
data = data.list$expression
head(data.frame(value=data))
## value
## DMS53_LUNG -0.10490
## SW1116_LARGE_INTESTINE 0.35850
## NCIH1694_LUNG 0.02874
## P3HR1_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 0.12000
## HUT78_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE -0.26900
## UMUC3_URINARY_TRACT -0.17120
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
publicHub | ccle/CCLE_RPPA_20180123 | 899 | NA |
4.3.3 get_ccle_mutation_status()
get_ccle_mutation_status(identifier)
data = get_ccle_mutation_status("TP53")
data = data[data$genes=="TP53",c("sampleID", "genes")]
head(na.omit(data))
## # A tibble: 6 × 2
## sampleID genes
## <chr> <chr>
## 1 22RV1_PROSTATE TP53
## 2 22RV1_PROSTATE TP53
## 3 A253_SALIVARY_GLAND TP53
## 4 A431_SKIN TP53
## 5 A4FUK_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE TP53
## 6 A673_BONE TP53
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
publicHub | ccle/CCLE_DepMap_18Q2_maf_20180502 | 1549 | NA |
4.3.4 get_ccle_cn_value()
get_ccle_cn_value(identifier)
## value
## LOUNH91_LUNG -0.0709
## T98G_CENTRAL_NERVOUS_SYSTEM 0.2473
## IPC298_SKIN -0.7917
## RPMI8226_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE -0.5341
## MIAPACA2_PANCREAS 0.1259
## HS695T_SKIN 0.2196
Information of available datasets:
Xena Hub | Xena Datasets | Sample Size | Unit |
---|---|---|---|
publicHub | ccle/CCLE_copynumber_byGene_2013-12-03 | 1043 | log(copy number/2) |
4.4 General query
4.4.1 query_pancan_value()
A function that integrates all above functions for quick query of TPC molecular data
query_pancan_value(
molecule,
data_type = c("mRNA", "transcript", "protein", "mutation", "cnv", "methylation",
"miRNA", "fusion", "promoter", "APOBEC"),
database = c("toil", "ccle", "pcawg"),
reset_id = NULL,
opt_pancan = .opt_pancan
)
## $toil_mRNA
## $toil_mRNA$norm
## [1] "tpm"
##
##
## $toil_transcript
## list()
##
## $toil_protein
## list()
##
## $toil_mutation
## list()
##
## $toil_cnv
## $toil_cnv$gistic2
## [1] TRUE
##
## $toil_cnv$use_thresholded_data
## [1] FALSE
##
##
## $toil_methylation
## $toil_methylation$type
## [1] "450K"
##
## $toil_methylation$rule_out
## NULL
##
## $toil_methylation$aggr
## [1] "NA"
##
##
## $toil_miRNA
## list()
##
## $pcawg_mRNA
## list()
##
## $pcawg_fusion
## list()
##
## $pcawg_miRNA
## $pcawg_miRNA$norm
## [1] "TMM"
##
##
## $pcawg_promoter
## $pcawg_promoter$type
## [1] "relative"
##
##
## $pcawg_APOBEC
## list()
##
## $ccle_mRNA
## $ccle_mRNA$norm
## [1] "rpkm"
##
##
## $ccle_protein
## list()
##
## $ccle_mutation
## list()
##
## $ccle_cnv
## list()
- Single molecule query with modified opt_pancan
opt_pancan = .opt_pancan
opt_pancan$toil_mRNA$norm = "nc"
data.list = query_pancan_value(
molecule = "TP53",
data_type = "mRNA",
database = "toil",
opt_pancan = opt_pancan
)
data = data.list$expression
head(data.frame(value=data))
## value
## GTEX-S4Q7-0003-SM-3NM8M 11.130
## TCGA-S9-A7J2-01 11.350
## GTEX-QV31-1626-SM-2S1QC 10.160
## TCGA-G3-A3CH-11 9.632
## GTEX-13OVI-1026-SM-5L3EM 9.761
## GTEX-13OW5-0626-SM-5J2N2 9.609
- Molecular signature query
# a space must exist in the signature string
signature <- "TP53 + 2*KRAS - 1.3*PTEN"
data.list = query_pancan_value(
molecule = signature,
data_type = "mRNA",
database = "toil",
opt_pancan = opt_pancan
)
data = data.list$value
head(data.frame(value=data))
## value
## GTEX-S4Q7-0003-SM-3NM8M 15.756
## TCGA-S9-A7J2-01 18.465
## GTEX-QV31-1626-SM-2S1QC 15.402
## TCGA-G3-A3CH-11 13.944
## GTEX-13OVI-1026-SM-5L3EM 13.439
## GTEX-13OW5-0626-SM-5J2N2 13.699
4.4.2 query_molecule_value()
A function to query general molecular data of most matrix datasets of UCSC Xena repository
- Genomic matrix repository
data_meta = UCSCXenaTools::XenaData
data_meta_gm = subset(data_meta, Type=="genomicMatrix")
# see the 'XenaDatasets' column
head(data_meta_gm[,c("XenaHostNames","XenaCohorts","XenaDatasets","DataSubtype")])
## # A tibble: 6 × 4
## XenaHostNames XenaCohorts XenaDatasets DataSubtype
## <chr> <chr> <chr> <chr>
## 1 publicHub Breast Cancer Cell Lines (Neve 2006) ucsfNeve_pub… gene expre…
## 2 publicHub Glioma (Kotliarov 2006) kotliarov200… copy number
## 3 publicHub Lung Cancer CGH (Weir 2007) weir2007_pub… copy number
## 4 publicHub Cancer Cell Line Encyclopedia (Breast) ccle/CCLE_co… copy number
## 5 publicHub Breast Cancer (Chin 2006) chin2006_pub… gene expre…
## 6 publicHub Breast Cancer (Chin 2006) chin2006_pub… copy number
query_molecule_value(dataset, molecule)
dataset <- "TCGA-BRCA.htseq_fpkm.tsv"
data <- query_molecule_value(dataset, "TP53") # also support signature
head(data.frame(value=data))
## value
## TCGA-E9-A1NI-01A 4.854
## TCGA-A1-A0SP-01A 2.554
## TCGA-BH-A1EU-11A 4.515
## TCGA-A8-A06X-01A 3.844
## TCGA-E2-A14T-01A 4.255
## TCGA-AC-A8OS-01A 3.655