Chapter 1 Introduction

1.1 UCSC Xena Datasets

UCSC Xena platform is developed by the UC Santa Cruz Genomics Institute and serves as a comprehensive repository to provide thousands of processed omics datasets from large cancer research projects (e.g. TCGA, PCAWG and CCLE) or individual research groups and enable unprecedented research opportunities.

The hierarchy for storing and querying datasets is as follows:

  • The UCSC Xena repository is generally comprised of 11 data hubs from various portals;
  • Each data hub could include many sample cohorts;
  • Each sample cohort could involve multiple omics profiling or phenotype data;
  • Each omics profiling could have several datasets due to normalization methods or other factors.
The hierarchy of datasets in UCSC Xena

Figure 1.1: The hierarchy of datasets in UCSC Xena

Note: In the UCSCXenaShiny, we summarized all the data (sub)types into four main types, referring to clinicalMatrix, genomicsMatrix, genomicSegment, mutationVector, where the genomicsMatrix is the key component.

The main types and subtypes of datasets

Figure 1.2: The main types and subtypes of datasets

The following figure shows the numbers of cohorts and datasets for each datahub of UCSC Xena. Although some hubs (like toilHub, pancanAtlasHub) have relatively lower numbers of datasets, they mainly focus of pan-cancer integration, which could be more valuable in some research. Here, we will briefly introduce these hubs for better use.

The statitics for cohorts and datasets of each data hub

Figure 1.3: The statitics for cohorts and datasets of each data hub

(1) tcgaHub (TCGA hub)

  • Statistics: 38 cohorts, 715 datasets;
  • Source: TCGA Data Coordinating Center (DCC), Jan 2016
  • Description: The hub is specific to the TCGA project with both individual tumor or integrative pan-cancer cohorts

(2) gdcHub (GDC hub)

  • Statistics: 42 cohorts, 534 datasets;
  • Source: GDC Data Portal (GDC), v18.0, 2019-08-28
  • Description: The hub incorporates TCGA project with both individual tumor or integrative pan-cancer cohorts, as well as TARGET project (childhood cancers).

(3) pancanAtlasHub (Pan-Cancer Atlas Hub)

  • Statistics: 1 cohort, 22 datasets;
  • Source: Pan-Cancer Atlas publications on Cell.
  • Description: The hub collects the curated pan-cancer TCGA data generated by the PanCan Atlas consortium working groups.

(4) toilHub (UCSC Toil RNA-seq Recompute)

(5) publicHub (UCSC Public Hub)

  • Statistics: 37 cohort, 114 datasets;
  • Source: Public resources from extensive collection.
  • Description: The hub collects various cancer (or cell lines, e.g. CCLE, CMAP) omics data from other public studies.

(6) icgcHub (ICGC Xena Hub)

  • Statistics: 3 cohort, 23 datasets;
  • Source: International Cancer Genome Consortium(ICGC).
  • Description: The hub considers the genomics data from ICGC but only included the US(TCGA) related projects for Gene/Protein expression data.

(7) pcawgHub (PCAWG Hub)

(8) atacseqHub(ATAC-seq Hub)

(9) kidfirstHub(Kids First Xena Hub)

(10) treehouseHub

(11) singlecellHub

  • Statistics: 16 cohorts, 71 datasets
  • Source: Human Cell Atlas (HCA)
  • Description: The hub includes several scRNA-seq datasets from HCA, invovling cancer or normal tissues, human or mouse origin.

1.2 TCGA abbreviations

Table 1.1: TCGA abbreviations
Abbreviation Name
1 LAML Acute Myeloid Leukemia
2 ACC Adrenocortical carcinoma
3 BLCA Bladder Urothelial Carcinoma
4 LGG Brain Lower Grade Glioma
5 BRCA Breast invasive carcinoma
6 CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
7 CHOL Cholangiocarcinoma
8 LCML Chronic Myelogenous Leukemia
9 COAD Colon adenocarcinoma
10 CNTL Controls
11 ESCA Esophageal carcinoma
12 FPPP FFPE Pilot Phase II
13 GBM Glioblastoma multiforme
14 HNSC Head and Neck squamous cell carcinoma
15 KICH Kidney Chromophobe
16 KIRC Kidney renal clear cell carcinoma
17 KIRP Kidney renal papillary cell carcinoma
18 LIHC Liver hepatocellular carcinoma
19 LUAD Lung adenocarcinoma
20 LUSC Lung squamous cell carcinoma
21 DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
22 MESO Mesothelioma
23 MISC Miscellaneous
24 OV Ovarian serous cystadenocarcinoma
25 PAAD Pancreatic adenocarcinoma
26 PCPG Pheochromocytoma and Paraganglioma
27 PRAD Prostate adenocarcinoma
28 READ Rectum adenocarcinoma
29 SARC Sarcoma
30 SKCM Skin Cutaneous Melanoma
31 STAD Stomach adenocarcinoma
32 TGCT Testicular Germ Cell Tumors
33 THYM Thymoma
34 THCA Thyroid carcinoma
35 UCS Uterine Carcinosarcoma
36 UCEC Uterine Corpus Endometrial Carcinoma
37 UVM Uveal Melanoma

1.3 PCAWG abbreviations

Table 1.2: PCAWG abbreviations
Abbreviation Name
1 BLCA-US Bladder Urothelial Cancer - TCGA, US
2 BRCA-US Breast Cancer - TCGA, US
3 CESC-US Cervical Squamous Cell Carcinoma - TCGA, US
4 CLLE-ES Chronic Lymphocytic Leukemia - ES
5 COAD-US Colon Adenocarcinoma - TCGA, US
6 DLBC-US Lymphoid Neoplasm Diffuse Large B-cell Lymphoma - TCGA, US
7 ESAD-UK Esophageal Adenocarcinoma - UK
8 GBM-US Brain Glioblastoma Multiforme - TCGA, US
9 HNSC-US Head and Neck Squamous Cell Carcinoma - TCGA, US
10 KICH-US Kidney Chromophobe - TCGA, US
11 KIRC-US Kidney Renal Clear Cell Carcinoma - TCGA, US
12 KIRP-US Kidney Renal Papillary Cell Carcinoma - TCGA, US
13 LAML-US Acute Myeloid Leukemia - TCGA, US
14 LGG-US Brain Lower Grade Glioma - TCGA, US
15 LIHC-US Liver Hepatocellular carcinoma - TCGA, US
16 LIRI-JP Liver Cancer - RIKEN, JP
17 LUAD-US Lung Adenocarcinoma - TCGA, US
18 LUSC-US Lung Squamous Cell Carcinoma - TCGA, US
19 MALY-DE Malignant Lymphoma - DE
20 OV-AU Ovarian Cancer - AU
21 OV-US Ovarian Serous Cystadenocarcinoma - TCGA, US
22 PACA-AU Pancreatic Cancer - AU
23 PRAD-US Prostate Adenocarcinoma - TCGA, US
24 READ-US Rectum Adenocarcinoma - TCGA, US
25 RECA-EU Renal Cell Cancer - EU/FR
26 SARC-US Sarcoma - TCGA, US
27 SKCM-US Skin Cutaneous melanoma - TCGA, US
28 STAD-US Gastric Adenocarcinoma - TCGA, US
29 THCA-US Head and Neck Thyroid Carcinoma - TCGA, US
30 UCEC-US Uterine Corpus Endometrial Carcinoma- TCGA, US