scVI (single-cell variational inference)是2018年发表的一项单细胞分析工具。它主要基于VAE变分自编码器的思想,计算去除批次效应后的细胞低维空间表示。而后于2022年被整合到单细胞组学分析工具包scvi-tool中。

image-20241103145224602

上图为18年发表文章的Figure,简单理解如下

观测/输入数据

  • Xn,1: Cell N的Gene 1的观测表达值

  • Sn:Cell N的批次标签

潜变量:

  • Zn:Cell N的低维表示空间,即我们最关心的结果。

  • ln:Cell N的library size,可以理解为UMI counts

变分后验(Variational posterior)表示对观测数据 (X,S) 所假设的潜在变量 (Z, l) 的真实后验分布的近似(编码器)。在VAE中,通常将其建模为高斯分布,其参数(均值和方差)由神经网络输出。然后生成模型部分(解码器)负责从潜在空间中生成新的数据样本X,应与真实数据相似。

上述过程的可优化损失主要来自两方面:(1)重构损失,即解码器的输出与输入X的差异;(2)通过KL散度计算变分后验分布p(z|x)与先验分布p(z)的差异,主要确保潜在空间的分布结构合理。

详细解释参考:https://docs.scvi-tools.org/en/stable/user_guide/models/scvi.html

1. 安装

1
2
3
4
5
6
7
8
9
conda create -n scvi python=3.10 -y
conda activate scvi
# conda install scvi-tools -c conda-forge

pip install -U scvi-tools -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -U scanpy -i https://pypi.tuna.tsinghua.edu.cn/simple

# 在具体分析过程中,还遇到部分缺少依赖包的报错,根据提示安装即可
# 例如:scib-metrics leidenalg pymde scib_metrics

2. 初始设置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__) # 1.2.0

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = "./data/"

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

3. 读取数据及预处理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# adata = scvi.data.heart_cell_atlas_subsampled(save_path=save_dir.name)
adata = sc.read(os.path.join(save_dir, "hca_subsampled_20k.h5ad"))
remove = ["doublets", "NotAssigned"]
keep = [c not in remove for c in adata.obs.cell_type.values]
adata = adata[keep, :].copy()
adata
# AnnData object with n_obs × n_vars = 18641 × 26662
#     obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used'
#     var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts'
#     uns: 'cell_type_colors'
  • 质控/标准化/相关备份
1
2
3
4
5
6
7
sc.pp.filter_genes(adata, min_counts=3)

adata.layers["counts"] = adata.X.copy()  # 保存备份整个对象的状态.X 矩阵的副本
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

adata.raw = adata #保存备份整个对象的状态
  • 鉴定高变基因
1
2
3
4
5
6
7
8
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)

3. 计算潜变量

  • 实例化模型
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"],
)

model = scvi.model.SCVI(adata) #可在此调整n_latent等参数
model
# SCVI model with the following parameters: 
# n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
# latent_distribution: normal.
# Training status: Not Trained
# Model's adata is minified?: False
  • VAE属于自监督学习算法,scVI基于Pytorch实现的神经网络架构,支持GPU加速
1
2
3
4
5
6
model.train()

# model_dir = os.path.join(save_dir, "scvi_model")
# model.save(model_dir, overwrite=True)

# model = scvi.model.SCVI.load(model_dir, adata=adata)
  • 计算潜变量结果
1
2
3
4
5
6
SCVI_LATENT_KEY = "X_scVI"

latent = model.get_latent_representation()
adata.obsm[SCVI_LATENT_KEY] = latent
latent.shape
# (18641, 10)
  • 计算 normalized (decoded) gene expression.
1
2
3
4
adata_subset = adata[adata.obs.cell_type == "Fibroblast"]
latent_subset = model.get_latent_representation(adata_subset) # (2446, 10)
denoised = model.get_normalized_expression(adata_subset, library_size=1e4, ) # reconstructed output
denoised.iloc[:5, :5]

4. 可视化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SCVI_NORMALIZED_KEY = "scvi_normalized"
adata.layers[SCVI_NORMALIZED_KEY] = model.get_normalized_expression(library_size=10e4)

sc.pp.neighbors(adata, use_rep=SCVI_LATENT_KEY)

sc.tl.umap(adata, min_dist=0.3)

sc.pl.umap(
    adata,
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)
image-20241103161347836
1
2
3
# neighbors were already computed using scVI
SCVI_CLUSTERS_KEY = "leiden_scVI"
sc.tl.leiden(adata, key_added=SCVI_CLUSTERS_KEY, resolution=0.5)
image-20241103161510785

5. 鉴定差异基因

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# one vs one
de_df = model.differential_expression(
    groupby="cell_type", group1="Endothelial", group2="Fibroblast"
)
de_df.head()

# one vs rest
de_df = model.differential_expression(
    groupby="cell_type",
)
de_df.head()

6. SCANVI去批次

  • Now, we assume that all of our data is annotated. This can lead to a more accurate integration result when using scANVI
  • 如果已经有注释好的细胞类型,使用scANVI去批次会更有效。
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
adata.obs.cell_type.unique().tolist()
# ['Myeloid',
#  'Ventricular_Cardiomyocyte',
#  'Fibroblast',
#  'Endothelial',
#  'Adipocytes',
#  'Pericytes',
#  'Atrial_Cardiomyocyte',
#  'Smooth_muscle_cells',
#  'Neuronal',
#  'Lymphoid',
#  'Mesothelial']

scanvi_model = scvi.model.SCANVI.from_scvi_model(
    model,
    adata=adata,
    labels_key="cell_type",
    unlabeled_category="Unknown",
)

scanvi_model.train()

SCANVI_LATENT_KEY = "X_scANVI"
adata.obsm[SCANVI_LATENT_KEY] = scanvi_model.get_latent_representation(adata)
adata.obsm[SCANVI_LATENT_KEY].shape
# (18641, 10)
1
2
# 传统PCA
sc.tl.pca(adata)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from scib_metrics.benchmark import Benchmarker

bm = Benchmarker(
    adata,
    batch_key="cell_source",
    label_key="cell_type",
    embedding_obsm_keys=["X_pca", SCVI_LATENT_KEY, SCANVI_LATENT_KEY],
    n_jobs=-1,
)
bm.benchmark()

bm.plot_results_table(min_max_scale=False)

image-20241103163422118