Chapter 4 Molecular data query

4.1 TCGA query

  • Search molecular identifiers for TCGA samples via list pancan_identifier_help referring to Figure 3.2
Table 4.1: Specilized functions to query TCGA molecular data
Database Type Datasets Function
TCGA mRNA 3 get_pancan_gene_value()
TCGA transcript 3 get_pancan_transcript_value()
TCGA protein 1 get_pancan_protein_value()
TCGA mutation 1 get_pancan_mutation_status()
TCGA cnv 3 get_pancan_cn_value()
TCGA methylation 2 get_pancan_methylation_value()
TCGA miRNA 1 get_pancan_miRNA_value()

4.1.1 get_pancan_gene_value()

  • get_pancan_gene_value(identifier, norm = c("tpm", "fpkm", "nc"))
data.list = get_pancan_gene_value("TP53", norm = "tpm")
data = data.list$expression
head(data.frame(value=data))
##                         value
## GTEX-S4Q7-0003-SM-3NM8M 4.785
## TCGA-19-1787-01         5.887
## TCGA-S9-A7J2-01         5.517
## GTEX-QV31-1626-SM-2S1QC 4.431
## TCGA-G3-A3CH-11         2.382
## TCGA-B5-A5OE-01         5.765

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
toilHub TcgaTargetGtex_rsem_gene_tpm 19131 log2(tpm+0.001)
toilHub TcgaTargetGtex_rsem_gene_fpkm 19131 log2(fpkm+0.001)
toilHub TcgaTargetGtex_RSEM_Hugo_norm_count 19120 log2(norm_count+1)

4.1.2 get_pancan_transcript_value()

  • get_pancan_transcript_value(identifier, norm = c("tpm", "fpkm", "nc"))
data.list = get_pancan_transcript_value("ENST00000456328", norm = "tpm")
data = data.list$expression
head(data.frame(value=data))
##                           value
## GTEX-S4Q7-0003-SM-3NM8M  -5.012
## TCGA-19-1787-01          -9.966
## TCGA-S9-A7J2-01          -4.035
## GTEX-QV31-1626-SM-2S1QC  -9.966
## TCGA-G3-A3CH-11          -9.966
## GTEX-13OVI-1026-SM-5L3EM -9.966

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
toilHub TcgaTargetGtex_rsem_isoform_tpm 19131 log2(tpm+0.001)
toilHub TcgaTargetGtex_RSEM_isoform_fpkm 19129 log2(fpkm+0.001)
toilHub TcgaTargetGtex_rsem_isopct 19131 IsoPct

4.1.3 get_pancan_protein_value()

  • get_pancan_protein_value(identifier)
data.list = get_pancan_protein_value("ACC_pS79")
data = data.list$expression
head(data.frame(value=data))
##                   value
## TCGA-FI-A2EY-01  2.2170
## TCGA-DF-A2KS-01  0.4139
## TCGA-A5-A1OH-01  0.0000
## TCGA-AX-A2H7-01  0.3248
## TCGA-AX-A2HA-01 -1.2410
## TCGA-A5-A2K4-01 -0.2814

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pancanAtlasHub TCGA-RPPA-pancan-clean.xena 7744 norm_value

4.1.4 get_pancan_mutation_status()

  • get_pancan_mutation_status(identifier)
data = get_pancan_mutation_status("TP53")
head(data.frame(value=data))
##                 value
## TCGA-02-0003-01     1
## TCGA-02-0033-01     1
## TCGA-02-0047-01     0
## TCGA-02-0055-01     1
## TCGA-02-2470-01     0
## TCGA-02-2483-01     1

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pancanAtlasHub mc3.v0.2.8.PUBLIC.nonsilentGene.xena 9104 NA

4.1.5 get_pancan_cn_value()

  • get_pancan_cn_value(identifier, gistic2 = TRUE, use_thresholded_data = FALSE)
data.list = get_pancan_cn_value("TP53")
data = data.list$data
head(data.frame(value=data))
##                  value
## TCGA-A5-A0GI-01  0.014
## TCGA-S9-A7J2-01  0.068
## TCGA-06-0150-01  0.015
## TCGA-AR-A1AH-01 -0.761
## TCGA-EK-A2RE-01 -0.024
## TCGA-44-6778-01 -0.317

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
tcgaHub …_Gistic2_all_data_by_genes 10845 Gistic2 copy number
tcgaHub …_Gistic2_all_thresholded.by_genes 10845 -2,-1,0,1,2: 2 copy del,1 copy del,no change,amplification,high-amplification
pancanAtlasHub …_SNP_6_whitelisted.gene.xena 10873 log(tumor/normal)

4.1.6 get_pancan_methylation_value()

get_pancan_methylation_value(
  identifier,
  type = c("450K", "27K"),
  rule_out = NULL,
  aggr = c("NA", "mean", "Q0", "Q25", "Q50", "Q75", "Q100")
)
  • rule_out: exclude some CpG site(s) under one gene;
  • aggr: select one aggregation method to calculate gene-level methylation (Default: “NA”(mean)).
data.list = get_pancan_methylation_value("TP53")
data = data.list$data
head(data.frame(value=data))
##                   value
## TCGA-S6-A8JX-01 0.07085
## TCGA-SO-A8JP-01 0.08410
## TCGA-YU-A90Q-01 0.08465
## TCGA-2G-AAH8-01 0.09373
## TCGA-2G-AAGY-05 0.09546
## TCGA-XE-AAOL-01 0.09774

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
gdcHub GDC-PANCAN.methylation450.tsv 9736 beta value
gdcHub GDC-PANCAN.methylation27.tsv 2595 beta value

4.1.7 get_pancan_miRNA_value()

  • get_pancan_miRNA_value(identifier, gistic2 = TRUE, use_thresholded_data = FALSE)
data.list = get_pancan_miRNA_value("hsa-let-7a-2-3p")
data = data.list$expression
head(data.frame(value=data))
##                 value
## TCGA-C4-A0F6-01  0.99
## TCGA-CU-A0YO-01  1.91
## TCGA-BT-A0S7-01  3.02
## TCGA-CU-A0YR-01  0.85
## TCGA-BL-A0C8-01  0.85
## TCGA-C4-A0F0-01  2.70

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pancanAtlasHub pancanMiRs_EBadjOnProtocolPlatformWithoutRepsWithUnCorrectMiRs_08_04_16.xena 10818 log2(norm_value+1)

4.2 PCAWG query

  • Search molecular identifiers for TCGA samples via list pcawg_identifier referring to Figure 3.3
Table 4.2: Specilized functions to query PCAWG molecular data
Database Type Datasets Function
PCAWG mRNA 1 get_pcawg_gene_value()
PCAWG fusion 1 get_pcawg_fusion_value()
PCAWG miRNA 2 get_pcawg_miRNA_value()
PCAWG promoter 3 get_pcawg_promoter_value()
PCAWG APOBEC 1 get_pcawg_APOBEC_mutagenesis_value()

4.2.1 get_pcawg_gene_value()

  • get_pcawg_gene_value(identifier)
data.list = get_pcawg_gene_value("TP53")
data = data.list$data
head(data.frame(value=data))
##          value
## SP89389  1.798
## SP21193  6.542
## SP13206  4.690
## SP103623 4.143
## SP47089  4.846
## SP32742  5.010

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pcawgHub tophat_star_fpkm_uq.v2_aliquot_gl.sp.log 1521 log2(fpkm-uq+0.001)

4.2.2 get_pcawg_fusion_value()

  • get_pcawg_fusion_value(identifier)
data.list = get_pcawg_fusion_value("SAMD11")
data = data.list$data
head(data.frame(value=data))
##         value
## SP23639     0
## SP23769     0
## SP23925     0
## SP24129     0
## SP24236     0
## SP24565     0

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pcawgHub pcawg3_fusions_PKU_EBI.gene_centric.sp.xena 1359 binary fusion call, 1 fusion, 0 otherwise

4.2.3 get_pcawg_miRNA_value()

  • get_pcawg_miRNA_value(identifier, norm = c("TMM", "UQ"))
data.list = get_pcawg_miRNA_value("hsa-let-7a-2-3p")
data = data.list$data
head(data.frame(value=data))
##          value
## SP1029   2.894
## SP1588   2.286
## SP119599 2.484
## SP1437   1.801
## SP1347   1.529
## SP106899 1.788

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pcawgHub x3t2m1.mature.TMM.mirna.matrix.log 1524 log2(cpm-TMM+0.1)
pcawgHub x3t2m1.mature.UQ.mirna.matrix.log 1524 log2(cpm-uq+0.1)

4.2.4 get_pcawg_promoter_value()

  • get_pcawg_promoter_value(identifier, type = c("raw", "relative", "outlier"))
data.list = get_pcawg_promoter_value("prmtr.1")
data = data.list$data
head(data.frame(value=data))
##          value
## SP23639  34.51
## SP23769  35.16
## SP23925  24.63
## SP24129  44.71
## SP24236 172.40
## SP24565  15.51

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pcawgHub rawPromoterActivity.sp 1359 raw promoter activity
pcawgHub promoterCentricTable_0.2_1.0.sp 1359 -1 (low expression), 0 (normal), 1 (high expression)
pcawgHub relativePromoterActivity.sp 1359 portion of transcription activity of the gene driven by the promoter

4.2.5 get_pcawg_APOBEC_mutagenesis_value()

  • get_pcawg_APOBEC_mutagenesis_value(identifier)
data.list = get_pcawg_APOBEC_mutagenesis_value("A3A_or_A3B")
data = data.list$data
head(data.frame(value=data))
##          value
## SP117425     0
## SP117332     0
## SP117655     1
## SP99293      1
## SP99329      1
## SP99309      1

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
pcawgHub MAF_Aug31_2016_sorted_A3A_A3B_comparePlus.sp 2072 NA

4.3 CCLE query

  • Search molecular identifiers for TCGA samples via list ccle_identifier referring to Figure 3.3
Table 4.3: Specilized functions to query CCLE molecular data
Database Type Datasets Function
CCLE mRNA 2 get_ccle_gene_value()
CCLE protein 1 get_ccle_protein_value()
CCLE mutation 1 get_ccle_mutation_status()
CCLE cnv 1 get_ccle_cn_value()

4.3.1 get_ccle_gene_value()

  • get_ccle_gene_value(identifier, norm = c("rpkm", "nc"))
data.list = get_ccle_gene_value("TP53", norm = "rpkm")
data = data.list$expression
head(data.frame(value=data))
##                                value
## 22RV1_PROSTATE                 7.537
## 2313287_STOMACH               45.590
## 253JBV_URINARY_TRACT          28.510
## 253J_URINARY_TRACT            28.040
## 42MGBA_CENTRAL_NERVOUS_SYSTEM 13.920
## 5637_URINARY_TRACT            33.350

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
publicHub ccle/CCLE_DepMap_18Q2_RNAseq_RPKM_20180502 1076 RPKM
publicHub ccle/CCLE_DepMap_18Q2_RNAseq_reads_20180502.log2 1076 log2(count+1)

4.3.2 get_ccle_gene_value()

  • get_ccle_protein_value(identifier)
data.list = get_ccle_protein_value("14-3-3_beta")
data = data.list$expression
head(data.frame(value=data))
##                                             value
## DMS53_LUNG                               -0.10490
## SW1116_LARGE_INTESTINE                    0.35850
## NCIH1694_LUNG                             0.02874
## P3HR1_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE  0.12000
## HUT78_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE -0.26900
## UMUC3_URINARY_TRACT                      -0.17120

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
publicHub ccle/CCLE_RPPA_20180123 899 NA

4.3.3 get_ccle_mutation_status()

  • get_ccle_mutation_status(identifier)
data = get_ccle_mutation_status("TP53")
data = data[data$genes=="TP53",c("sampleID", "genes")]
head(na.omit(data))
## # A tibble: 6 × 2
##   sampleID                                 genes
##   <chr>                                    <chr>
## 1 22RV1_PROSTATE                           TP53 
## 2 22RV1_PROSTATE                           TP53 
## 3 A253_SALIVARY_GLAND                      TP53 
## 4 A431_SKIN                                TP53 
## 5 A4FUK_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE TP53 
## 6 A673_BONE                                TP53

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
publicHub ccle/CCLE_DepMap_18Q2_maf_20180502 1549 NA

4.3.4 get_ccle_cn_value()

  • get_ccle_cn_value(identifier)
data.list = get_ccle_cn_value("TP53")
data = data.list$data
head(data.frame(value=data))
##                                               value
## LOUNH91_LUNG                                -0.0709
## T98G_CENTRAL_NERVOUS_SYSTEM                  0.2473
## IPC298_SKIN                                 -0.7917
## RPMI8226_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE -0.5341
## MIAPACA2_PANCREAS                            0.1259
## HS695T_SKIN                                  0.2196

Information of available datasets:

Xena Hub Xena Datasets Sample Size Unit
publicHub ccle/CCLE_copynumber_byGene_2013-12-03 1043 log(copy number/2)

4.4 General query

4.4.1 query_pancan_value()

A function that integrates all above functions for quick query of TPC molecular data

query_pancan_value(
  molecule,
  data_type = c("mRNA", "transcript", "protein", "mutation", "cnv", "methylation",
    "miRNA", "fusion", "promoter", "APOBEC"),
  database = c("toil", "ccle", "pcawg"),
  reset_id = NULL,
  opt_pancan = .opt_pancan
)
.opt_pancan
## $toil_mRNA
## $toil_mRNA$norm
## [1] "tpm"
## 
## 
## $toil_transcript
## list()
## 
## $toil_protein
## list()
## 
## $toil_mutation
## list()
## 
## $toil_cnv
## $toil_cnv$gistic2
## [1] TRUE
## 
## $toil_cnv$use_thresholded_data
## [1] FALSE
## 
## 
## $toil_methylation
## $toil_methylation$type
## [1] "450K"
## 
## $toil_methylation$rule_out
## NULL
## 
## $toil_methylation$aggr
## [1] "NA"
## 
## 
## $toil_miRNA
## list()
## 
## $pcawg_mRNA
## list()
## 
## $pcawg_fusion
## list()
## 
## $pcawg_miRNA
## $pcawg_miRNA$norm
## [1] "TMM"
## 
## 
## $pcawg_promoter
## $pcawg_promoter$type
## [1] "relative"
## 
## 
## $pcawg_APOBEC
## list()
## 
## $ccle_mRNA
## $ccle_mRNA$norm
## [1] "rpkm"
## 
## 
## $ccle_protein
## list()
## 
## $ccle_mutation
## list()
## 
## $ccle_cnv
## list()
  • Single molecule query with modified opt_pancan
opt_pancan = .opt_pancan
opt_pancan$toil_mRNA$norm = "nc"
data.list = query_pancan_value(
  molecule = "TP53",
  data_type = "mRNA",
  database = "toil",
  opt_pancan = opt_pancan
)
data = data.list$expression
head(data.frame(value=data))
##                           value
## GTEX-S4Q7-0003-SM-3NM8M  11.130
## TCGA-S9-A7J2-01          11.350
## GTEX-QV31-1626-SM-2S1QC  10.160
## TCGA-G3-A3CH-11           9.632
## GTEX-13OVI-1026-SM-5L3EM  9.761
## GTEX-13OW5-0626-SM-5J2N2  9.609
  • Molecular signature query
# a space must exist in the signature string
signature <- "TP53 + 2*KRAS - 1.3*PTEN" 
data.list = query_pancan_value(
  molecule = signature,
  data_type = "mRNA",
  database = "toil",
  opt_pancan = opt_pancan
)
data = data.list$value
head(data.frame(value=data))
##                           value
## GTEX-S4Q7-0003-SM-3NM8M  15.756
## TCGA-S9-A7J2-01          18.465
## GTEX-QV31-1626-SM-2S1QC  15.402
## TCGA-G3-A3CH-11          13.944
## GTEX-13OVI-1026-SM-5L3EM 13.439
## GTEX-13OW5-0626-SM-5J2N2 13.699

4.4.2 query_molecule_value()

A function to query general molecular data of most matrix datasets of UCSC Xena repository

  • Genomic matrix repository
data_meta = UCSCXenaTools::XenaData
data_meta_gm = subset(data_meta, Type=="genomicMatrix")
# see the 'XenaDatasets' column
head(data_meta_gm[,c("XenaHostNames","XenaCohorts","XenaDatasets","DataSubtype")])
## # A tibble: 6 × 4
##   XenaHostNames XenaCohorts                            XenaDatasets  DataSubtype
##   <chr>         <chr>                                  <chr>         <chr>      
## 1 publicHub     Breast Cancer Cell Lines (Neve 2006)   ucsfNeve_pub… gene expre…
## 2 publicHub     Glioma (Kotliarov 2006)                kotliarov200… copy number
## 3 publicHub     Lung Cancer CGH (Weir 2007)            weir2007_pub… copy number
## 4 publicHub     Cancer Cell Line Encyclopedia (Breast) ccle/CCLE_co… copy number
## 5 publicHub     Breast Cancer (Chin 2006)              chin2006_pub… gene expre…
## 6 publicHub     Breast Cancer (Chin 2006)              chin2006_pub… copy number
  • query_molecule_value(dataset, molecule)
dataset <- "TCGA-BRCA.htseq_fpkm.tsv"
data <- query_molecule_value(dataset, "TP53") # also support signature 
head(data.frame(value=data))
##                  value
## TCGA-E9-A1NI-01A 4.854
## TCGA-A1-A0SP-01A 2.554
## TCGA-BH-A1EU-11A 4.515
## TCGA-A8-A06X-01A 3.844
## TCGA-E2-A14T-01A 4.255
## TCGA-AC-A8OS-01A 3.655