R包DoubletFinder可用于检测基于Droplet单细胞测序技术(10X)的双细胞。如下简单学习识别原理以用法。

  • 原始论文:https://doi.org/10.1016/j.cels.2019.03.003
  • 官方手册:https://github.com/chris-mcginnis-ucsf/DoubletFinder
  • 视频教程:https://www.youtube.com/watch?v=NqvAS4HgmrE
Screenshot_2021-03-30_131342.png
1
2
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
library(DoubletFinder)

1、原理简介

  • 如下示意图可分为4个步骤

(1)根据随机两个单细胞的平均表达水平模拟出若干artificial doublet;

(2)从模拟双细胞的角度,计算出这些模拟双细胞的最近邻细胞群;

(3)从真实单细胞数据角度,计算每个细胞与模拟双细胞群的距离"pANN";

(4)最后通过期望的双细胞比例,设置"pANN"阈值,划分出双细胞群。

image-20230128150550760

  • 在上述计算过程中涉及3个重要参数

(1)pN:模拟出的artificial doublet数量。不同取值对识别结果影响不大,默认为0.25。

(2)pK:计算每个双细胞的最近邻居细胞数量,不同取值对识别结果影响很大,需结合数据选最佳值。

(3)nExp:期望的双细胞群比例。如下图所示(10X v3.1),不同测序数据体量的预期占比不同。

image-20230128152528719

例如对于10000个细胞的单细胞表达矩阵, 当pN=0.25、pK=0.1、nExp=0.08,表示如下含义:

(1)随机模拟出3333个双细胞(总共13333);(2)计算每个模拟双细胞的1333个最近邻细胞;

(3)最终期望识别出800个双细胞。

  • 双细胞群根据组成细胞的转录相似性可分为如下两大类

    (1)heterotypic doublets :doublets formed from transcriptionally-distinct cell states – but is insensitive to homotypic doublets

    (2)homotypic doublets :doublets formed from transcriptionally-similar cell states

作者认为上述识别算法适合于检测heterotypic doublets,因此在计算过程中有必要考虑homotypic doublets所占比例的影响。

2、计算过程

  • 示例数据:10X官方提供的PBMC8K测序数据
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k
untar("pbmc8k_filtered_gene_bc_matrices.tar.gz")
counts = Read10X("filtered_gene_bc_matrices/GRCh38/")
sce = CreateSeuratObject(counts)
dim(sce)
# [1] 33694  8381

## (1) 预处理:质控过滤 + 降维分群
library(Seurat)
library(tidyverse)

sce = PercentageFeatureSet(sce, "^MT-", col.name = "percent_mito")
sce = sce %>% 
  subset(., nFeature_RNA > 500) %>% 
  subset(., nCount_RNA > 800) %>%   
  subset(., percent_mito < 10)
sce = sce %>%
  NormalizeData() %>%
  FindVariableFeatures() %>%
  ScaleData()
sce = sce %>% 
  RunPCA() %>% 
  RunUMAP(dims = 1:30) %>%  #RunTSNE
  FindNeighbors(dims = 1:30) %>% 
  FindClusters(resolution = 0.1)

table(sce$seurat_clusters)
#    0    1    2    3    4    5    6    7 
# 2330 2039 1230 1142  970  339  213  108

DimPlot(sce, reduction = "umap", group.by = "orig.ident")
image-20230128160039740
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## (2) DoubletFinder分析
library(DoubletFinder)

# 首先获得最佳的pK值
sweep.res.list <- paramSweep_v3(sce, PCs = 1:30, sct = FALSE)
sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE)
bcmvn <- find.pK(sweep.stats)
pk_best = bcmvn %>% 
  dplyr::arrange(desc(BCmetric)) %>% 
  dplyr::pull(pK) %>% 
  .[1] %>% as.character() %>% as.numeric()
# [1] 0.28
ggplot(bcmvn, aes(x=pK, y=BCmetric, group=1)) + 
  geom_point() +
  geom_line()
image-20230128160711436
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 然后估算出双细胞群中,homotypic doublets的比例(optional)
annotations <- sce$seurat_clusters
homotypic.prop <- modelHomotypic(annotations)  
# [1] 0.1928876

# 根据10X V3.1,对于8K+的单细胞数据,期望的双细胞占比为7%左右
nExp_poi <- round(0.07*nrow(sce@meta.data))        #586
nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop)) #473

# 最后根据上述参数(pN=0.25, pK=0.28, nExp=0.07), 识别出双细胞
sce <- doubletFinder_v3(sce, PCs = 1:30, 
                        pN = 0.25, pK = pk_best, nExp = nExp_poi, # nExp = nExp_poi.adj, 
                        reuse.pANN = FALSE, sct = FALSE)
head(sce@meta.data[,c("pANN_0.25_0.28_586",
                      "DF.classifications_0.25_0.28_586")])
#                    pANN_0.25_0.28_586 DF.classifications_0.25_0.28_586
# AAACCTGAGCATCATC-1            0.21824                          Singlet
# AAACCTGAGCTAACTC-1            0.35488                          Singlet
# AAACCTGAGCTAGTGG-1            0.18624                          Singlet
# AAACCTGCACATTAGC-1            0.10816                          Singlet
# AAACCTGCACTGTTAG-1            0.34528                          Singlet
# AAACCTGCATAGTAAG-1            0.35904                          Singlet

DimPlot(sce, reduction = "umap", 
        group.by = "DF.classifications_0.25_0.28_586")

VlnPlot(sce, group.by = "DF.classifications_0.25_0.28_586", 
        features = c("nCount_RNA", "nFeature_RNA"), 
        pt.size = 0, ncol = 2)
image-20230128161303842 image-20230128161330809