RNA velocity(RNA速率)分析是基于单细胞转录组测序数据分析细胞发育状态动力学的方法。简单来说,该模型根据测序read片段属于unspliced pre-mrna以及spliced mature mrna的比例推测单细胞的发育轨迹。 高比例的unspliced pre-mrna的占比越高,velocity速率越大,表明在之后阶段中将产生高表达趋势。

如下将主要学习分析流程,算法原理可参考原始论文:

分析流程参考教程:

  • 上游比对:https://velocyto.org/velocyto.py/index.html
  • 下游分析:https://smorabit.github.io/tutorials/8_velocyto/
An external file that holds a picture, illustration, etc. Object name is MSB-17-e10282-g002.jpg

1、上游比对

  • 单细胞测序技术有很多,如下以常用的10X genomics的测序结果为例。
  • 首先根据原始测序结果fastq.gz,经10X官方的cellranger软件进行比对(使用方法参考之前笔记)
    • 根据filtered_feature_bc_matrix 文件夹结果使用Seurat包进行单细胞常规分析;
    • 根据比对的bam文件,使用velocyto软件分别识别出比对到unspliced与spliced mRNA。

示例数据是GSE178911中的GSM5400792样本,为人体急性髓系白血病(AML)骨髓细胞。已经过cellranger比对,结果保存在201008_D0_scRNA_fastq文件夹。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
tree -L 1 ./201008_D0_scRNA_fastq/outs/
# ./201008_D0_scRNA_fastq/outs/
# ├── cellsorted_possorted_genome_bam.bam
# ├── filtered_feature_bc_matrix
# ├── filtered_feature_bc_matrix.h5
# ├── filtered_feature_bc_matrix.tar.gz
# ├── metrics_summary.csv
# ├── molecule_info.h5
# ├── possorted_genome_bam.bam
# ├── possorted_genome_bam.bam.bai
# ├── raw_feature_bc_matrix
# ├── raw_feature_bc_matrix.h5
# └── web_summary.html

1.1 velocyto分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# (1) 参考下面链接,建立软件环境
# https://velocyto.org/velocyto.py/install/index.html

# (2) 准备三份数据
## 10X比对结果文件夹 
cellranger_outDir=/path/to/201008_D8Ven_scRNA_fastq
## 10X参考数据集中的gtf文件
cellranger_gtf=/path/to/refdata-gex-GRCh38-2020-A/genes/genes.gtf
## 重复序列注释文件(optiona;)
rmsk_gtf=/path/to/hg38_repeat_rmsk.gtf

rmsk_gtf下载链接:https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=611454127_NtvlaW6xBSIRYJEBI0iRDEWisITa&clade=mammal&org=Human&db=0&hgta_group=allTracks&hgta_track=rmsk&hgta_table=rmsk&hgta_regionType=genome&position=&hgta_outputType=gff&hgta_outFileName=mm10_rmsk.gtf

1
2
3
4
5
6
7
# (3) 开始分析
velocyto run10x -m $rmsk_gtf  $cellranger_outDir $cellranger_gtf

# (4) 分析结果
tree ./201008_D0_scRNA_fastq/velocyto/
# ./201008_D0_scRNA_fastq/velocyto/
# └── 201008_D0_scRNA_fastq.loom

如上,每个样本都会产生一个loom文件结果。

1.2 Seurat分析

  • 使用Seurat包完成基本分析,主要包括UMAP降维与分群两步
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(Seurat)
library(tidyverse)

counts = Read10X("filtered_feature_bc_matrix/")
sce = CreateSeuratObject(counts)
dim(sce)
# [1] 36601  6005

sce = sce %>%
  NormalizeData() %>%
  FindVariableFeatures(nfeatures = 2000) %>%
  ScaleData()
sce = sce %>% 
  RunPCA() %>% 
  RunUMAP(dims = 1:30) %>%  #RunTSNE
  FindNeighbors(dims = 1:30) %>% 
  FindClusters(resolution = c(0.5))
table(sce$seurat_clusters)
#    0    1    2    3    4    5    6    7    8    9 
# 1250 1077  827  621  599  541  397  386  215   92 

DimPlot(sce)
saveRDS(sce, file = "sce.rds")
image-20230126113814256

2、下游分析

以下在jupyter notebook中执行

2.1 读入数据

  • 参考之前scanpy学习笔记,导入Seurat对象为Anndata
1
2
3
4
5
6
7
8
9
## 准备Python环境
import scanpy as sc
import numpy as np
import anndata2ri
# Activate the anndata2ri conversion between SingleCellExperiment and AnnData
anndata2ri.activate()
#Loading the rpy2 extension enables cell magic to be used
#This runs R code in jupyter notebook cells
%load_ext rpy2.ipython
1
2
3
4
%%R
library(Seurat)
sce = readRDS("./201008_D0_scRNA_fastq/sce.rds")
sce
1
2
3
4
%%R -o adata
#convert the Seurat object to a SingleCellExperiment object
adata = as.SingleCellExperiment(sce)
adata
1
2
3
4
5
adata
# AnnData object with n_obs × n_vars = 6005 × 36601
#     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.5', 'seurat_clusters', 'ident'
#     obsm: 'X_pca', 'X_umap'
#     layers: 'logcounts'
  • 读入velocyto分析的loom文件
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scvelo as scv
import scanpy as sc
import numpy as np
import pandas as pd
import anndata as ad

ldata1 = scv.read('./201008_D0_scRNA_fastq/velocyto/201008_D0_scRNA_fastq.loom')
# AnnData object with n_obs × n_vars = 6005 × 36601
#     var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
#     layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

## 修改细胞名barcode,与Seurat保持一致
barcodes = [bc.split(':')[1] for bc in ldata1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '-1' for bc in barcodes]
barcodes
# ['AACAAGACATGCGGTC-1',
#  'AACAGGGCACGACCTG-1',
#  'AAAGAACGTTATGTGC-1',
#  'AACAACCAGCAATAAC-1',
#  'AACAAGAGTCTCTCCA-1']
ldata1.obs.index = barcodes
ldata1.var_names_make_unique()
  • 合并上述两个数据
1
2
3
4
5
6
7
8
9
adata = scv.utils.merge(adata, ldata1)
adata
# AnnData object with n_obs × n_vars = 6005 × 36591
#     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.5', 'seurat_clusters', 'ident', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size'
#     var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
#     obsm: 'X_pca', 'X_umap'
#     layers: 'logcounts', 'matrix', 'ambiguous', 'spliced', 'unspliced'
sc.pl.umap(adata, color='seurat_clusters', 
		   legend_loc='on data')
image-20230126125018274

2.2 scvelo速率分析

  • (1)统计每个cluster的spliced与unspliced比例
1
scv.pl.proportions(adata, groupby='seurat_clusters')

image-20230126125232129

  • (2)速率分析
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
scv.pp.filter_and_normalize(adata)
scv.pp.moments(adata)

scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
adata
# AnnData object with n_obs × n_vars = 6005 × 36591
#     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.5', 'seurat_clusters', 'ident', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
#     var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'gene_count_corr', 'velocity_gamma', 'velocity_qreg_ratio', 'velocity_r2', 'velocity_genes'
#     uns: 'seurat_clusters_colors', 'neighbors', 'velocity_params'
#     obsm: 'X_pca', 'X_umap'
#     layers: 'logcounts', 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity'
#     obsp: 'distances', 'connectivities'
  • (3)速率分析结果可视化
1
2
3
scv.pl.velocity_embedding_grid(adata, basis='umap', color='seurat_clusters', 
							   scale=0.25)
# 可通过多个参数调整图形(箭头)细节展示
image-20230126130013583
1
2
3
scv.pl.velocity_embedding_stream(adata, basis='umap', 
								 color='seurat_clusters')
# 可通过多个参数调整图形(箭头)细节展示
image-20230126130132801
1
2
3
4
5
6
scv.tl.rank_velocity_genes(adata, groupby='seurat_clusters', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()

scv.pl.scatter(adata, df['2'][:3],  frameon=True, 
			   color='seurat_clusters', size=10, linewidth=1.5)

image-20230126130540798

image-20230126130631766
1
2
scv.tl.velocity_pseudotime(adata)  # 由0到1
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
image-20230126131215163
1
2
3
scv.tl.paga(adata, groups='seurat_clusters')
scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5)
image-20230126131321685
  • (4)特定细胞群速率分析
1
2
3
cur_celltypes = ["0"]
adata_subset = adata[adata.obs['seurat_clusters'].isin(cur_celltypes)]
sc.pl.umap(adata_subset)
image-20230126131634828
1
2
3
4
5
6
sc.pp.neighbors(adata_subset, n_neighbors=15, use_rep='X_pca')
scv.pp.filter_and_normalize(adata_subset)
scv.pp.moments(adata_subset)
scv.tl.recover_dynamics(adata_subset)

scv.pl.velocity_embedding_stream(adata_subset, basis='umap')
image-20230126132125090
1
2
3
4
df = adata_subset.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]
top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.scatter(adata_subset,  basis=top_genes[:15], ncols=5, frameon=True)

image-20230126132228007