单细胞分析工具--CELLxGENE数据库

网站：https://cellxgene.cziscience.com/

API：https://chanzuckerberg.github.io/cellxgene-census/

Schema：https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md

CELLxGENE 是一套工具，可帮助科学家查找、下载、探索、分析、注释和发布单细胞数据集。它包含几个功能强大的工具，具有各种功能，可帮助您处理单细胞数据。

Chan Zuckerberg Initiative（CZI）是由马克·扎克伯格（Mark Zuckerberg）和妻子普莉希拉·陈（Priscilla Chan）于2015年创立的慈善组织。其使命是通过支持科学、教育和社区发展来推动人类潜能和促进平等。

其核心是整合了大量的单细胞转录组数据集于一体，具有如下特点

物种来源主要是Homo sapiens, Mus musculus;
样本主要来自大多数组织的正常细胞(Normal, 即non-disease);
数据类型是未标准化的原始Counts表达数据;
在目前最新的版本(20240701)中，细胞总量已分别达到74M(human), 41M(mouse)

Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

CELLxGENE基于这些数据资源主要提供两类功能，Download (下载) & Explore (分析)。下面将主要学习下数据获取的方式。

方式1（如上图）：从网站的Datasets页面，根据提供的多种filter筛选目标数据集，然后点击下载按钮；提供了h5ad与rds两种格式，可分别用于Python的scanpy(v0.8)分析以及R的Seurat分析(v5)

方式2：使用网站提供的API接口进行定制化下载，如下主要记录对于Python端的简单学习。

简单来说，可以将census视为CELLxGENE将所有细胞数据进行的整合。我们可以基于特定的cell或者gene metadata条件(Slice)，筛选目标的单细胞数据。

1
2


# pip install -U cellxgene-census
import cellxgene_census

查看所有的cell metadata类型（2023-05-15版本）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


census = cellxgene_census.open_soma(census_version="2023-05-15")
keys = list(census["census_data"]["homo_sapiens"].obs.keys())
keys
# ['soma_joinid',
#  'dataset_id',
#  'assay',
#  'assay_ontology_term_id',
#  'cell_type',
#  'cell_type_ontology_term_id',
#  'development_stage',
#  'development_stage_ontology_term_id',
#  'disease',
#  'disease_ontology_term_id',
#  'donor_id',
#  'is_primary_data',
#  'self_reported_ethnicity',
#  'self_reported_ethnicity_ontology_term_id',
#  'sex',
#  'sex_ontology_term_id',
#  'suspension_type',
#  'tissue',
#  'tissue_ontology_term_id',
#  'tissue_general',
#  'tissue_general_ontology_term_id']

获取该版本所有细胞的metadata（建议使用特定条件筛选，这里仅为了下面探索全部的类别信息）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


with cellxgene_census.open_soma(census_version="2023-05-15") as census:
    # Reads SOMADataFrame as a slice
    cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
        # value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
        # column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
        # value_filter = "tissue_general == 'prostate gland'"
    )
    # Concatenates results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # Converts to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()
# 1m31.6s
    
cell_metadata.shape
# (53794728, 21)

cell_metadata['dataset_id'][:200000].drop_duplicates()
# 0         9d8e5dca-03a3-457d-b7fb-844c75735c83
# 72335     a6388a6f-6076-401b-9b30-7d4306a20035
# 103124    842c6f5d-4a94-4eef-8510-8c792d1124bc
# Name: dataset_id, dtype: object

cell_metadata['donor_id'][:10000].drop_duplicates()
# 0       donor-GOLD
# 869     donor-BOAT
# 4369    donor-KEYS
# 8675    donor-PINK
# Name: donor_id, dtype: object

# 统计其它metadata的类别分布情况
from pathlib import Path
filtered_keys = [item for item in keys if item not in ['soma_joinid','dataset_id','donor_id']]
for item in filtered_keys:
    # cell_metadata[item].value_counts()
    file_name = f'stat_{item}.txt'
    file_path = Path('census_20230515') / file_name
    print(file_path)
    cell_metadata[item].value_counts().to_csv(file_path, sep = '\t')
## e.g. stat_sex.txt
# sex	count
# male	28197731
# female	22513226
# unknown	3083771 

详见：https://github.com/lishensuo/utils/tree/main/CELLxGene/census_20230515

获取目前最新版本(20240701)所有细胞的metadata

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


census = cellxgene_census.open_soma(census_version="2024-07-01")
# The "stable" release is currently 2024-07-01. 
# Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
keys_latest = list(census["census_data"]["homo_sapiens"].obs.keys())
list(set(keys_latest) - set(keys))
# ['raw_mean_nnz',
#  'observation_joinid',
#  'raw_variance_nnz',
#  'raw_sum',
#  'tissue_type',
#  'n_measured_vars', 
#  'nnz']  #类似 质控相关的指标

with cellxgene_census.open_soma() as census:
    cell_metadata_latest = census["census_data"]["homo_sapiens"].obs.read()
    cell_metadata_latest = cell_metadata_latest.concat()
    cell_metadata_latest = cell_metadata_latest.to_pandas()
#2m30s

cell_metadata_latest.shape
# (74322510, 28)

filtered_keys = [item for item in keys if item not in ['soma_joinid','dataset_id','donor_id']]
for item in filtered_keys:
    # cell_metadata[item].value_counts()
    file_name = f'stat_{item}.txt'
    file_path = Path('census_20240701') / file_name
    print(file_path)
    cell_metadata_latest[item].value_counts().to_csv(file_path, sep = '\t') 

详见：https://github.com/lishensuo/utils/tree/main/CELLxGene/census_20240701

下面简单探索前列腺细胞(Prostate gland cells)的metadata

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


with cellxgene_census.open_soma(census_version="2024-07-01") as census:
    # Reads SOMADataFrame as a slice
    cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
        # value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
        # column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
        value_filter = "tissue_general == 'prostate gland'"
    )
    # Concatenates results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # Converts to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.shape
# (348664, 28)

cell_metadata['disease'].value_counts()[lambda x: x > 0]
# disease
# normal                          227680
# benign prostatic hyperplasia    120984
# Name: count, dtype: int64

cell_metadata['cell_type'].value_counts()[lambda x: x > 0].head(10)
# cell_type
# basal cell of prostate epithelium                                   138418
# luminal cell of prostate epithelium                                  29128
# epithelial cell of urethra                                           23634
# epithelial cell                                                      20552
# secretory cell                                                       16662
# prostate gland microvascular endothelial cell                        14230
# smooth muscle cell of prostate                                        8702
# fibroblast of connective tissue of nonglandular part of prostate      7654
# basal epithelial cell of prostatic duct                               7295
# CD1c-positive myeloid dendritic cell                                  7210
# Name: count, dtype: int64

单细胞数据下载，以上面前列腺组织的fibroblast (669 cells)为例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


with cellxgene_census.open_soma(census_version="2024-07-01") as census:
    adata = cellxgene_census.get_anndata(
        census = census,
        organism = "Homo sapiens",
        var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
        obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
        column_names = {"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]},
    )

    print(adata)
# 2m58s
adata
# AnnData object with n_obs × n_vars = 669 × 60530
#     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 
#          'is_primary_data', 'observation_joinid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 
#          'tissue_type', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars'
#     var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'

adata.var.head()
#   soma_joinid  feature_id       feature_name  feature_length    nnz         n_measured_obs
#   0            ENSG00000000003  TSPAN6        4530              4530448     73855064
#   1            ENSG00000000005  TNMD          1476              236059      61201828
#   2            ENSG00000000419  DPM1          9276              17576462    74159149
#   3            ENSG00000000457  SCYL3         6883              9117322     73988868
#   4            ENSG00000000460  C1orf112      5970              6287794     73636201

adata.layers['counts'] = adata.X.copy() 
adata.layers['counts']
# <Compressed Sparse Row sparse matrix of dtype 'float32'
# 	with 1591668 stored elements and shape (669, 60530)>
adata.write_h5ad("cellxgene_prostate.h5ad")

另外的下载方式（scGPT采用）
- 首先查询目标细胞的soma_joinid数据
- 然后直接根据上述ID，获取对应细胞的表达数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


with cellxgene_census.open_soma(census_version="2024-07-01") as census:
    cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
        value_filter = "tissue_general == 'prostate gland' and cell_type == 'fibroblast'"
    )
    cell_metadata = cell_metadata.concat()
    cell_metadata = cell_metadata.to_pandas()

    adata2 = cellxgene_census.get_anndata(
        census = census,
        organism = "Homo sapiens",
        # obs_value_filter = "tissue_general == 'prostate gland' and cell_type == 'fibroblast'"
        obs_coords = list(cell_metadata['soma_joinid'])
    )
# 2m52s

API的下载速度目前体验下来，比较不稳定，有时非常慢。在时间允许的情况下，可以编写脚本，在后台慢慢下载。