1、背景知识

(1)两种富集分析

img
  • 基于超几何检验的ORA(over representation analysis)富集分析

① 假设对转录组分组测序的10000个基因表达数据进行差异分析;

② 按照规定cutoff阈值(logFC/Pvalue)筛选得到100个差异表达基因;

③ 对于特定通路,包含500个基因;其中有30个属于差异基因

img

④ 差异基因是否富集到该通路,即表示30/70与470/9430相比,是否具有显著意义。

img

1
2
3
4
# 如上公式:N=10000,M=500,n=100,k=30
d <- data.frame(non_DEG=c(470, 9430), DEG=c(30, 70))
row.names(d) <- c("In set", "not in category")
fisher.test(d, alternative = "less")

img

如上计算结果,p值越小(<0.05),说明得到的差异基因是显著富集到这个通路基因集上的。然后就可以这个基因集的生物学意义讲述后续的故事了。

  • 基于置换检验的GSEA(gene set enrichment analysis)富集分析

① 假设对转录组分组测序的10000个基因表达数据进行差异分析,一般按照差异倍数从高到低进行排序;

② 假设对于包含500个基因的特定通路,遍历上述10000个基因的排序列表;

③ 在遍历过程中,如基因属于通路集则加分(hit),不属于则减分;加减分的权重与差异倍数先关。

④ ES(enrichment score)表示遍历过程中产生的绝对值最大值。ES大于0,表明通路集富集到排序列表的顶部;反之则富集到排序列表的底部。

⑤ 随机选择500个基因若干次,计算ES背景分布,从而计算相应的P值。

Gene set enrichment analysis: A knowledge-based approach for interpreting  genome-wide expression profiles | PNAS

(2)通路基因集

常见的通路基因集包括GO、KEGG、Reactome等。MsigDB数据库则综合收集了多种通路基因集。4

  • GO

    • Gene Ontology 基因本体论
    • http://www.geneontology.org/
    • GO 分别定义了三种类型的基因集:①细胞组成(cellular component,CC);②生物过程(biological process,BP);③分子功能(Molecular Function,MF)。详见http://geneontology.org/docs/ontology-documentation/
    • 举个例子来说:基因A(其蛋白产物是cytochrome c(细胞色素c)),该基因的具有氧化还原活性(molecular function);参与氧化磷酸化生物学过程(molecular function);而发挥作用的位置位于细胞线粒体基质(cellular component)
  • KEGG

    • Kyoto Encyclopedia of Genes and Genomes 京都基因与基因组百科全书
    • https://www.genome.jp/kegg/
    • KEGG是系统分析基因功能,联系基因组信息和功能信息,以探索分子相互作用和反应网络的知识库(人工绘制),其中包含有大量的通路(PATHWAY)图,涉及metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development七大类。
  • Reactome https://reactome.org/

  • WikiPathway https://www.wikipathways.org/

  • Msigdb

    • Molecular Signatures Database 分子特征数据库
    • https://www.gsea-msigdb.org/gsea/msigdb/index.jsp
    • 截止目前该数据库收集有32284个基因集,可分为9大类 — H: hallmark gene sets,C1: positional gene sets,C2: curated gene sets,C3: regulatory target gene sets,C4: computational gene sets,C5: ontology gene sets,C6: oncogenic signature gene sets,C7: immunologic signature gene sets,C8: cell type signature gene sets

此外还有其它类型通路基因集,例如疾病基因集DO,就不一一列举了。

(3)clusterProfiler包

R包clusterProfiler由南方医科大学余光创团队开发,是一个常用的富集分析工具包,目前已更新到4.x系列版本。该包包含了富集分析相关的方方面面函数功能,从上游分析到下游可视化等一应俱全。后续笔记主要参考官方教程学习该包的基础用法。

image-20230113092610681

2、富集分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## 加载包
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)

## 示例基因数据
# (1) 降序差异基因列表用于GSEA分析
data(geneList, package="DOSE")
str(geneList)
# Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
# - attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
gene_list = geneList

# (2) 显著差异基因集用于ORA分析
gene_deg = names(geneList)[abs(geneList) > 2]
head(gene_deg)
# [1] "4312"  "8318"  "10874" "55143" "55388" "991" 

需要使用ENTREZID的基因ID。如果是其它格式,可使用clusterProfiler::bitr()函数或者之前笔记提到的方式进行转换。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
keytypes(org.Hs.eg.db)
# [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
# [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
# [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
# [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
# [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
# [26] "UNIPROT"     
ids <- bitr(x, fromType="SYMBOL", 
            toType="ENTREZID", 
            OrgDb="org.Hs.eg.db")

(1)GO

  • ORA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
ego <- enrichGO(gene          = gene_deg,
                OrgDb         = org.Hs.eg.db,
                ont           = "CC", # "BP","MF","ALL"
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
t(ego@result[1,])
#              GO:0005819                                                                                                                                                 
# ID          "GO:0005819"                                                                                                                                               
# Description "spindle"                                                                                                                                                  
# GeneRatio   "26/201"                                                                                                                                                   
# BgRatio     "426/19869"                                                                                                                                                
# pvalue      "2.149144e-13"                                                                                                                                             
# p.adjust    "6.339976e-11"                                                                                                                                             
# qvalue      "5.610398e-11"                                                                                                                                             
# geneID      "CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A"
# Count       "26"  

## *补充:如果输入基因集是SYMBOL格式,可以设置参数keyType = "SYMBOL"直接分析。但只有GO支持。
  • GO 去冗余
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## (1) GOSemSim包可以计算GO term相似度
library(GOSemSim)
hsGO <- godata('org.Hs.eg.db', ont="MF")

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)  # pair
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA") # whole

# 基因(set)相似度
GOSemSim::geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")
mgeneSim(genes=c("835", "5261","241", "994"),
         semData=hsGO, measure="Wang",verbose=FALSE)  # pair-wise

gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")
mclusterSim(list(gs1, gs2, gs3), semData=hsGO, measure="Wang", combine="BMA")


## (2) 富集结果去冗余
dim(ego)
# [1] 23  9
ego_sim <- clusterProfiler::simplify(ego, cutoff=0.7, measure = "Wang",
                                     by="p.adjust", select_fun=min)
dim(ego_sim)
# [1] 8 9
# 同样也支持对GSEA结果的去冗余处理
  • GSEA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## GSEA
gseago <- gseGO(geneList     = gene_list,
                OrgDb        = org.Hs.eg.db,
                ont          = "CC",
                minGSSize    = 100,
                maxGSSize    = 500,
                pvalueCutoff = 0.05,
                verbose      = FALSE)
gseago = setReadable(gseago, OrgDb = org.Hs.eg.db)
t(gseago@result[1,])
#                  GO:0000775                                                                                                                                                                                                                  
# ID              "GO:0000775"                                                                                                                                                                                                                
# Description     "chromosome, centromeric region"                                                                                                                                                                                            
# setSize         "188"                                                                                                                                                                                                                       
# enrichmentScore "0.5970689"                                                                                                                                                                                                                 
# NES             "2.676869"                                                                                                                                                                                                                  
# pvalue          "1e-10"                                                                                                                                                                                                                     
# p.adjust        "1.62e-09"                                                                                                                                                                                                                  
# qvalue          "1.021053e-09"                                                                                                                                                                                                              
# rank            "530"                                                                                                                                                                                                                       
# leading_edge    "tags=20%, list=4%, signal=19%"                                                                                                                                                                                             
# core_enrichment "CDCA8/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/CCNB1/KIF2C/PLK1/BUB1B/ZWINT/CENPU/SPC25/SPAG5/DSCC1/CENPI/OIP5/HELLS/NCAPD2/CENPA/BUB1/CENPF/ZWILCH/CEBPB/H2BC11"

(2)KEGG

  • ORA
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# R.utils::setOption("clusterProfiler.download.method","auto")
ekg <- enrichKEGG(gene         = gene_deg,
                  organism     = 'hsa',
                  pvalueCutoff = 0.05)
ekg <- setReadable(ekg, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")
t(ekg@result[1,])
#              hsa04110                                                        
# ID          "hsa04110"                                                      
# Description "Cell cycle"                                                    
# GeneRatio   "11/94"                                                         
# BgRatio     "127/8218"                                                      
# pvalue      "1.809545e-07"                                                  
# p.adjust    "3.781948e-05"                                                  
# qvalue      "3.695281e-05"                                                  
# geneID      "CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1"
# Count       "11" 
browseKEGG(ekg, 'hsa04110')
library("pathview")
pathview(gene.data  = gene_list,
         pathway.id = "hsa04110",
         species    = "hsa",
         limit      = list(gene=max(abs(geneList)), cpd=1))

当出现报错类似于No gene can be mapped when using enrichKEGG,且检查输入基因无误后,一个可能的解决方案是自己构建本地KEGG数据库。参考https://github.com/YuLab-SMU/clusterProfiler/issues/561中undo6411解答。或者也可以直接从MsigDB数据库下载,再使用GSEA()函数。

  • GSEA
1
2
3
4
5
6
7
gseakg <- gseKEGG(geneList     = gene_list,
                  organism     = 'hsa',
                  minGSSize    = 120,
                  pvalueCutoff = 0.05,
                  verbose      = FALSE)
gseakg <- setReadable(gseakg, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")
t(gseakg@result[1,])

针对其它通路集的富集分析函数

通路集 ORA函数 GSEA函数
WikiPathway enrichWP gseWP
Reactome enrichPathway gsePathway
DO(disease ontology) enrichDO gseDO
NCG(Network of Cancer Gene) enrichNCG gseNCG
DisGeNET(disease gene network) enrichDGN gseDGN
Mesh(医学主题词) enrichMeSH gseMeSH

(3)自定义基因集

  • clusterProfiler包提供的enricher()GSEA()函数可实现对自定义基因集进行富集分析
  • 主要通过TERM2GENE参数提供基因集数据框,一列名为term代表通路名;一列为gene代表组成基因
  • 如下以MsigDB数据库为例,学习用法。该库的详细介绍参见笔记
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# install.packages("msigdbr")
library(msigdbr)
hm_df = msigdbr(species = "Homo sapiens") %>% as.data.frame()
head(hm_df)
colnames(hm_df)
# [1] "gs_cat"             "gs_subcat"          "gs_name"            "gene_symbol"       
# [5] "entrez_gene"        "ensembl_gene"       "human_gene_symbol"  "human_entrez_gene" 
# [9] "human_ensembl_gene" "gs_id"              "gs_pmid"            "gs_geoid"          
# [13] "gs_exact_source"    "gs_url"             "gs_description"   
unique(hm_df$gs_cat)
# [1] "C3" "C2" "C8" "C6" "C7" "C4" "C1" "C5" "H" 
unique(hm_df$gs_subcat)
# [1] "MIR:MIR_Legacy"  "TFT:TFT_Legacy"  "CGP"             "TFT:GTRD"       
# [5] ""                "VAX"             "CP:BIOCARTA"     "CGN"            
# [9] "GO:BP"           "GO:CC"           "IMMUNESIGDB"     "GO:MF"          
# [13] "HPO"             "CP:KEGG"         "MIR:MIRDB"       "CM"             
# [17] "CP"              "CP:PID"          "CP:REACTOME"     "CP:WIKIPATHWAYS"

hm_H = msigdbr(species = "Homo sapiens",
        category = "H",
        subcategory = NULL) %>% as.data.frame() %>% 
  dplyr::select(gs_cat, gs_subcat, gs_name, gene_symbol)

hm_H = hm_H %>% 
  dplyr::select(gs_name, gene_symbol) %>% 
  dplyr::rename("term" = "gs_name", "gene" = "gene_symbol")

head(hm_H)
#                    term  gene
# 1 HALLMARK_ADIPOGENESIS ABCA1
# 2 HALLMARK_ADIPOGENESIS ABCB8
# 3 HALLMARK_ADIPOGENESIS ACAA2
# 4 HALLMARK_ADIPOGENESIS ACADL
# 5 HALLMARK_ADIPOGENESIS ACADM
# 6 HALLMARK_ADIPOGENESIS ACADS
  • 富集分析
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 注意上述选择的是Symbol格式基因,输入基因也需要与之匹配
res_go <- enricher(gene_deg2, TERM2GENE = hm_H)
dim(res_go)
dim(res_go@result)
res_go_df = res_go@result

res_gsea <- GSEA(gene_list2, TERM2GENE = hm_H)
dim(res_gsea)
dim(res_gsea@result)
res_gsea_df = res_gsea@result

3、可视化

  • 以上面分析的ego与gseago为例
1
2
library(enrichplot)
# 除第一个barplot函数,其余均基于ggplot绘图体系

(1)柱状图/点图

1
2
3
4
5
6
7
8
9
## barplot (R基础函数)
barplot(ego, showCategory=10) 
# 指定GO通路集
pw2show = c("GO:0005819","GO:0072686","GO:0098687","GO:0000779")
barplot(ego, 
        showCategory=ego@result$Description[which(rownames(ego@result) %in% pw2show)]) 
# 针对ont = "ALL"的情况
barplot(ego, split = "ONTOLOGY") +
  facet_grid(ONTOLOGY~., scale = "free")
image-20230114080237928
1
2
3
4
5
dotplot(ego)
## 默认如下对应关系,可设置参数修改这些关系:
# 颜色映射P值
# 大小映射交集基因数(差异基因与通路基因集)
# 横轴表示比例(count/geneset)
image-20230114081856363
1
2
edox <- pairwise_termsim(ego)
treeplot(edox, cluster.params = list(method="ward.D", n = 4, label_words_n = 2))
image-20230617155433789

(2)GSEA打分

  • 单个通路GSEA
1
2
3
4
5
gseaplot(gseago, 
         geneSetID = 1, 
         title = gseago$Description[1])
# by = "preranked"只展示上半部分
# by = "runningScore"只展示下半部分
image-20230114083100951
  • 多条通路结果
1
2
3
gseaplot2(gseago, 
          geneSetID = 1:3, 
          pvalue_table = T)
image-20230114083247332

(3)通路与基因

  • 热图
1
2
3
heatplot(ego, 
         foldChange=gene_list,
         showCategory = 5)
image-20230114083736463
  • 网络图
1
2
3
4
5
6
library(patchwork)
p1 = cnetplot(ego, foldChange=gene_list,showCategory = 3)
p2 = cnetplot(ego, foldChange=gene_list,showCategory = 3, circular = T)
p3 = cnetplot(ego, foldChange=gene_list,showCategory = 3, colorEdge = T, node_label="gene")
p4 = cnetplot(ego, foldChange=gene_list,showCategory = 3, layout = "gem")
(p1 + p2) / (p3 + p4)
image-20230114083957103

(4)通路与通路

1
2
3
4
5
ego_pair <- enrichplot::pairwise_termsim(ego)
emapplot(ego_pair, showCategory = 15, layout="kk", cex_category=1.5,min_edge = 0.8) 

gseago_pair <- enrichplot::pairwise_termsim(gseago)
emapplot(gseago_pair, showCategory = 15, layout="kk", min_edge = 0.1, color = "NES") 
image-20230114084348140

4、多基因集分析

  • clusterProfiler包的compareCluster()函数提供了针对多组实验设计得到的若干差异基因列表的同时富集分析(ORA)结果,及友好的可视化。
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## (1) 直接多基因集的富集通路比较
data(gcSample)
str(gcSample[5:8]) #取后4个差异基因列表做演示
# List of 4
# $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
# $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
# $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
# $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...

## 两个主要参数
# 多个差异基因列表
# 富集分析method (One of "groupGO", "enrichGO", "enrichKEGG", "enrichDO" or "enrichPathway")
ck <- compareCluster(geneCluster = gcSample[5:8], fun = enrichKEGG)
ck <- setReadable(ck, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
table(ck@compareClusterResult$Cluster)
# X5 X6 X7 X8 
# 10  1 15 19

dotplot(ck)
img
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## (2) 假如分别用两种药物A、B进行干扰细胞系,经1h、3h后得到4组差异基因
group_A_1h=gcSample[[2]]
group_A_3h=gcSample[[3]]
group_B_1h=gcSample[[4]]
group_B_3h=gcSample[[5]]

design=c(rep("A_1h",length(group_A_1h)),
          rep("A_3h",length(group_A_3h)),
          rep("B_1h",length(group_B_1h)),
          rep("B_3h",length(group_B_3h)))
mydf = data.frame(Drug = stringr::str_split(design,"_",simplify = T)[,1],
                  Time = stringr::str_split(design,"_",simplify = T)[,2],
                  DEG = c(group_A_1h,group_A_3h,group_B_1h,group_B_3h))
#   Drug Time   DEG
# 1    A   1h 23450
# 2    A   1h  5160
# 3    A   1h  7126

formula_res <- compareCluster(DEG~Time+Drug, data=mydf, fun="enrichKEGG")
table(formula_res@compareClusterResult$Cluster)
# 1h.A 1h.B 3h.A 3h.B 
#   3   18    3   10

dotplot(formula_res) + 
  dotplot(formula_res, x="Time") + facet_grid(~Drug)
img