R包MARVEL是由牛津大学MRC Weatherall分子医学研究所团队开发的,用于分析单细胞水平的可变剪切事件。相关文章于2023年1月在Nucleic Acids Research期刊发表,在其github页面分享了MARVEL工具的分析流程,在此学习、记录如下。

  • Pape:https://doi.org/10.1093/nar/gkac1260
  • Github:https://github.com/wenweixiong/MARVEL
  • Plate pipeline:https://wenweixiong.github.io/MARVEL_Plate.html
  • Droplet pipeline:https://wenweixiong.github.io/MARVEL_Droplet.html
image-20230226161252918

可变剪切

  • 关于可变剪切(Alternative splicing,AS)的基础,可以参看一篇中文教程;其中关于基因结构组成,可以参考之前的笔记

  • 过去,针对Bulk RNA-seq数据的可变剪切分析工具已经有很多,例如之前学习的IsoformSwitchAnalyzeR包。MARVEL是个人目前了解到的第一个针对单细胞转录数据AS分析工具。

单细胞数据类型

  • AS分析的核心是通过检测比对到splice junction位点的read(例如,跨越两个外显子)进行分析。
  • 目前主流的单细胞测序技术主要分为Droplet-based 10X方法与Plate-based Smart-seq方法。
  • 10X技术由于仅对cDNA的3’端部分进行测序,只能对该区域涉及的splice junction进行分析,且无法具体分析事件类型
  • Smart-seq技术虽然通量低,但是全长转录本的单细胞测序原理,可以进行更为深入的AS分析。

一、Plate-based pipeline

1、上游分析

1.1 工作环境

两个同级的目录(1)basic:参考数据以及代码脚本 (2)proj1:具体的分析项目文件

  • 后续分析均在 proj1/ 路径下操作
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
tree basic/ -L 2
# basic/
# ├── gtf
# │   ├── gencode.v31.annotation.gtf
# │   ├── GRCh38.primary_assembly.genome.fa
# │   ├── Log.out
# │   ├── rsem_index
# │   └── star_index
# ├── rscript_Anno_SJ_rMATs.R
# ├── rscript_bedtools_input.R
# └── single_sample_upstream.sh
tree proj1/ -L 2
# proj1/
# ├── SJ_phenoData.txt
# └── work
#     ├── fq
#     ├── intron
#     ├── merge
#     ├── rmats
#     ├── rsem
#     └── star

1.2 conda环境

1
2
3
4
5
6
7
# 名为marvel_plate的分析环境
conda create -n marvel_plate python=3.7
conda activate marvel_plate
conda install -c hcc aspera-cli
conda install -c bioconda star=2.7.1a rmats=4.1.0 -y 
conda install -c bioconda bedtools samtools rsem -y 
conda install -c r r-base r-essentials r-tidyverse -y

1.2 准备数据

(1)测序数据

  • 官方手册分析了如下数据集的 192个单细胞数据,以其中一个演示流程
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# https://www.ebi.ac.uk/ena/browser/view/ERR1562083
id=ERR1562083
# id=$1
mkdir -p ./work/fq/${id}
ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_1.fastq.gz  ./work/fq/${id}

ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_2.fastq.gz  ./work/fq/${id}

(2)基因组数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## STAR索引
STAR  --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v31.annotation.gtf \
--runThreadN 40

## RSEM索引
rsem-prepare-reference \
--gtf gencode.v31.annotation.gtf \
--star  -p 20 \
GRCh38.primary_assembly.genome.fa \
rsem_index/hg38

(3)教程数据

1
2
ls Data
# GTF  MARVEL  rMATS  R Object  RSEM  SJ

1.3 分析步骤

1.3.1 单样本4步分析

(1)单样本STAR比对

将每个样本的Fastq文件使用STAR软件进行比对,产生每个splice junction count matrix

1
2
# Reference
less -SN ./Data/SJ/SJ.txt
image-20230305130339185
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
mkdir -p ./work/star/${id}
ref_idx_star=../basic/gtf/star_index
ref_gtf=../basic/gtf/gencode.v31.annotation.gtf
echo ${id} is runing......
#1st pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--outSAMtype None
#2nd pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--sjdbFileChrStartEnd ./work/star/${id}/*SJ.out.tab \
--sjdbGTFfile ${ref_gtf} \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM XS \
--quantMode TranscriptomeSAM

(2)单样本rMATs注释

根据上述比对BAM结果,注释出每个基因(转录本)涉及的可变剪切事件类型。使用rMATs软件,可注释出5种类型:SE、RI、MXE、A5SS、A3SS。

1
2
3
4
5
6
7
find ./Data/rMATS/ -name *featureData.txt
# ./Data/rMATS/SE/SE_featureData.txt
# ./Data/rMATS/A3SS/A3SS_featureData.txt
# ./Data/rMATS/MXE/MXE_featureData.txt
# ./Data/rMATS/RI/RI_featureData.txt
# ./Data/rMATS/A5SS/A5SS_featureData.txt
less -SN ./Data/rMATS/SE/SE_featureData.txt

image-20230305131524024

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
mkdir -p ./work/rmats/${id}

echo ${id} is running......
echo "./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam" > ./work/star/${id}/BAM_fls.txt

rmats.py \
--b1 ./work/star/${id}//BAM_fls.txt \
--gtf ${ref_gtf} \
--od ./work/rmats/${id} \
--tmp ./work/rmats/${id}/tmp \
-t paired \
--readLength 125 \
--variable-read-length \
--nthread 16 \
--statoff
# readLength参考如下
# zcat ./work/fq/${id}_1.fastq.gz | awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' | less

(3)单样本Intron矩阵

统计Reads对于每个内含子的比对情况,逐碱基累加作为比对结果。

1
2
# reference
less -SN Data/MARVEL/PSI/RI/Counts_by_Region.txt
image-20230305132411262
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
mkdir -p ./work/intron/${id}
echo ${id} is running......
samtools view -H ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam | \
  grep SQ | cut -f 2 | awk '{ sub(/^SN:/, ""); print;}' > ./work/star/${id}/sorted_chr_in_bam.txt
# 预定义R脚本,生成bedtools计算所需的两个文件:染色体大小以及内含子坐标
# https://github.com/lishensuo/utils/blob/main/marvel/rscript_bedtools_input.R
Rscript ../basic/rscript_bedtools_input.R ${id}

bedtools coverage \
-g ./work/intron/${id}/hg38.chrom.sizes.txt \
-split \
-sorted \
-a ./work/intron/${id}/RI_Coordinates_sorted.bed \
-b ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam > \
./work/intron/${id}/intron_count.txt \
-d

(4)单样本gene矩阵

最后需要对基因表达水平进行定量。教程推荐使用RSEM软件,并进行TPM类似的标准化处理

1
2
# reference
less -SN ./Data/RSEM/TPM.txt
image-20230305133226554
1
2
3
4
5
6
7
8
mkdir -p ./work/rsem/${id}
rsem-calculate-expression \
--bam \
--paired-end \
-p 16 \
./work/star/${id}/${id}.Aligned.toTranscriptome.out.bam \
../basic/gtf/rsem_index/hg38 \
./work/rsem/${id}/${id}

(1)如上,由于4步的分析过程中产生的中间数据可能超过数百兆;且SMART-seq往往涉及成百上千的细胞。为节约空间,可在每个样本分析完成后,删除不必要的数据。

1
2
3
4
5
rm ./work/fq/${id}/${id}*
rm  ./work/star/${id}/*bam
rm -rf ./work/rmats/${id}/tmp/*
rm ./work/rsem/${id}/${id}.transcript.bam
rm ./work/rsem/${id}/${id}.isoforms.results

(2)多样本批量后台运行方式

1
2
3
4
5
6
7
8
9
cat ../basic/single_sample_upstream.sh
chmod u+x ../basic/single_sample_upstream.sh
# 单样本测试
# ../basic/single_sample_upstream.sh ERR1562273

# parallel多线程批量运行
nohup parallel -j 5 ../basic/single_sample_upstream.sh {} ::: \
  $(grep -v "sample" ./SJ_phenoData.txt | awk  '{print $1}') \
  1> ./work/single_sample_upstream.log 2>&1 &

1.3.2 整合多样本结果

即分步整合上游4个步骤中的每个步骤的所有样本结果;均使用R语言完成

1
2
3
4
library(tidyverse)
library(parallel)
sample_meta = data.table::fread("./SJ_phenoData.txt",data.table=F)
samples=sample_meta$sample.id
  • step1: 汇总剪切位点矩阵
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
sj.list = mclapply(samples, function(sample){   
     sj_sp = data.table::fread(paste0("./work/star/",sample,"/",sample,".SJ.out.tab"), data.table=F)
     if(nrow(sj_sp)==0) return(NULL)
     sj_sp = sj_sp %>% 
       dplyr::mutate(`coord.intron`=paste0(V1,":",V2,":",V3)) %>% 
       dplyr::select(`coord.intron`, V7)
     colnames(sj_sp)[2] = sample
     return(sj_sp)
},mc.cores=10)
names(sj.list) = samples
sj.list = sj.list[!unlist(lapply(sj.list, is.null))]

sj_res = sj.list[[1]]
for(x in sj.list[-1]){
     sj_res = dplyr::full_join(sj_res, x)
}

for(sample in setdiff(samples,names(sj.list))){
     sj_res[,sample]=NA
}
dim(sj_res)

write.table(sj_res, file=paste0("./work/merge/SJ.txt"),
     quote=F, row.names=F, sep="\t")
  • step2:整理AS注释结果
    • 这一步最为复杂,需要理解每一种剪切事件的定义以及正负链的差异
    • 参考官方示意:https://wenweixiong.github.io/Splicing_Nomenclature
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# https://github.com/lishensuo/utils/blob/main/marvel/rscript_Anno_SJ_rMATs.R
source(paste0("../basic/rscript_Anno_SJ_rMATs.R"))

AS_types = c("SE", "RI", "MXE", "A3SS", "A5SS")

for(AS_type in AS_types){
     # AS_type = AS_types[1]
     print(AS_type)
     fls_AS = list.files(paste0("./work/rmats"), pattern=paste0("fromGTF.",AS_type,".txt"),
          recursive=T, full.name=T)

     if (AS_type=="SE"){
          func = se_func
     } else if (AS_type=="RI"){
          func = ri_func 
     } else if (AS_type=="MXE"){
          func = mxe_func 
     } else if (AS_type=="A3SS"){
          func = a3ss_func 
     } else if (AS_type=="A5SS"){
          func = a5ss_func
     }

     df_AS = mclapply(fls_AS, function(fls){
          rmats_AS = data.table::fread(fls, data.table=F)
          rmats_AS = rmats_AS %>%
               dplyr::mutate(tran_id = func(.)) %>% 
               dplyr::rename(gene_id=GeneID,gene_short_name=geneSymbol) %>% 
               dplyr::select(tran_id, gene_id, gene_short_name)
          return(rmats_AS)
     },mc.cores=10) %>% do.call(rbind, .) %>% 
          dplyr::distinct()
     write.table(df_AS, row.names=F, sep="\t",quote=F,
          file=paste0("./work/merge/as_",AS_type,"_featureData.txt"))
}
  • step3:汇总intron矩阵
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
int_mt.list = mclapply(samples, function(sample){
     # sample = samples[1]
     int_mt = data.table::fread(paste0("./work/intron/",sample,"/intron_count.txt")) 
     int_mt = int_mt %>% 
          dplyr::mutate(`coord.intron`=paste(V1,V2,V3, sep=":")) %>% 
          dplyr::group_by(coord.intron) %>% 
          dplyr::summarize(count=sum(V5)) %>% 
          dplyr::select(coord.intron, count) %>% as.data.frame()
     colnames(int_mt)[2] = sample
     return(int_mt)
},mc.cores=10)

intron_res = int_mt.list[[1]]
for(x in int_mt.list[-1]){
     intron_res = dplyr::full_join(intron_res, x)
}
dim(intron_res)

write.table(intron_res, file=paste0("./work/merge/intron_count_by_region.txt"),
     quote=F, row.names=F, sep="\t")
  • 汇总gene矩阵
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
gene_mt.list = mclapply(samples, function(sample){
     fls=paste0("./work/rsem/",sample,"/",sample,".genes.results")
     gene_mt = data.table::fread(fls, data.table=F) %>% 
          dplyr::select(gene_id, TPM)
     colnames(gene_mt)[2] = sample
     return(gene_mt)
},mc.cores=10)
gene_res = gene_mt.list[[1]]
for(x in gene_mt.list[-1]){
     gene_res = dplyr::full_join(gene_res, x)
}
dim(gene_res)
write.table(gene_res, file=paste0("./work/merge/rsem_tpm.txt"),,
     quote=F, row.names=F, sep="\t")

2、下游分析

后续根据教程提供的数据学习,需要安装的主要的MARVEL包以及其它辅助分析包,具体可参考教程。

1
2
# 建议再单独建立一个conda R环境
install.packages("MARVEL")

2.1 创建marvel对象

需要准备七类文件,读入为R语言对象

  • 样本表型data.frame:第一列名为sample.id
  • 剪切位点计数矩阵:第一列名为coord.intron
  • 剪切事件类型注释列表:由数据框包装而出的list,每个数据库至少包含tran_id与gene_id
  • 内含子计数矩阵:第一列名为coord.intron
  • 基因表达矩阵:经标准化处理(RPKM/FPKM/TPM),暂时不要log转换
  • 基因类型data.frame:至少包含3列,gene_id, gene_short_name, gene_type
  • 上游分析所用到的GTF数据
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
library(tidyverse)
library(MARVEL)

# (1) 细胞表型分组
df.pheno = data.table::fread("./SJ_phenoData.txt",data.table=F)

# (2) Gene文件
df.tpm <- read.table("./work/merge/rsem_tpm.txt", sep="\t", header=TRUE)
dim(df.tpm)


# (3) GTF文件
gtf <- data.table::fread("../basic/gtf/gencode.v31.annotation.gtf", sep="\t", header=FALSE, na.strings="NA",quote="\"") %>% 
     as.data.frame()

# (4) Gene类型
gene.feature = subset(gtf, V3=="gene") 
df.tpm.feature <- str_split(gene.feature$V9, ";", simplify=T) %>% 
     as.data.frame() %>%
     dplyr::mutate(gene_id=str_match(V1, '"(.*)"')[,2]) %>% 
     dplyr::mutate(gene_short_name=str_match(V3, '"(.*)"')[,2]) %>%      
     dplyr::mutate(gene_type=str_match(V2, '"(.*)"')[,2]) %>% 
     dplyr::select(gene_id, gene_short_name, gene_type) %>% 
     dplyr::filter(gene_id %in% df.tpm$gene_id)
df.tpm.feature = df.tpm.feature[match(df.tpm$gene_id, df.tpm.feature$gene_id),]
rownames(df.tpm.feature) = seq(nrow(df.tpm.feature))

# (5) SJ文件
sj = data.table::fread("./work/merge/SJ.txt", sep="\t", header=TRUE, na.strings="NA") %>% 
     as.data.frame()
dim(sj)

# (6) AS文件
df.feature.list = lapply(c("SE", "MXE", "RI", "A5SS", "A3SS"), function(x){
     # x = "SE"
     df.feature = read.table(paste0("./work/merge/as_",x,"_featureData.txt"), 
          sep="\t", header=TRUE, na.strings="NA") %>% 
          dplyr::distinct(tran_id, .keep_all=TRUE) %>% 
          dplyr::left_join(df.tpm.feature)
     return(df.feature)
})
names(df.feature.list) <- c("SE", "MXE", "RI", "A5SS", "A3SS")

# (7) Intron文件
df.intron.counts <- data.table::fread("./work/merge/intron_count_by_region.txt", 
     sep="\t", header=TRUE, na.strings="NA") %>% 
     as.data.frame()
dim(df.intron.counts)
## MARVEL object
marvel <- CreateMarvelObject(SpliceJunction=sj,
                             SplicePheno=df.pheno,
                             SpliceFeature=df.feature.list,
                             IntronCounts=df.intron.counts,
                             GeneFeature=df.tpm.feature,
                             Exp=df.tpm,
                             GTF=gtf
                             )

2.2 预处理步骤

(1)基因表达log转换

1
2
3
4
5
marvel <- TransformExpValues(MarvelObject=marvel,
                             offset=1,
                             transformation="log2",
                             threshold.lower=1
                            )

(2)检测额外AS事件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# 如果基因表达水平整体较低,则不考虑该基因的可变剪切事件
# min.expr:判定单个细胞是否表达基因的阈值
# min.cells: 基于上述条件,判定细胞群表达该基因的阈值
marvel <- DetectEvents(MarvelObject=marvel,
                       min.cells=50, #细胞群的表达百分比
                       min.expr=1,   #认为细胞表达该基因的阈值
                       track.progress=FALSE,
                       EventType="AFE"
                       )
marvel <- DetectEvents(MarvelObject=marvel,
                       min.cells=50,
                       min.expr=1,
                       track.progress=FALSE,
                       EventType="ALE"
                       )
length(marvel$SpliceFeature)
# 7

(3)计算PSI值

计算一个基因对于特定剪切事件的PSI值,这是后续分析的基础

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# CoverageThreshold: 支持该剪切事件的最小reads数,小于该阈值标记为NA
# UnevenCoverageMultiplier: 针对SE与MXE
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     UnevenCoverageMultiplier=10,
                     EventType="SE"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     UnevenCoverageMultiplier=10,
                     EventType="MXE"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="RI",
                     thread=4  # only support RI
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="A5SS"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="A3SS"
                     )
 
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="AFE"
                     )
    
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="ALE"
                     )

(4)保存中间结果

  • 首先检查不同矩阵与对应的meta行列名信息一致
1
2
3
4
marvel <- CheckAlignment(MarvelObject=marvel, level="SJ")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing")
marvel <- CheckAlignment(MarvelObject=marvel, level="gene")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing and gene")
  • 可进一步选取感兴趣的细胞群
1
2
3
4
5
6
7
index.1 <- which(df.pheno$cell.type %in% c("iPSC", "Endoderm"))
index.2 <- which(df.pheno$qc.seq=="pass")
index <- intersect(index.1, index.2)
sample.ids <- df.pheno[index, "sample.id"]
marvel <- SubsetSamples(MarvelObject=marvel,
                        sample.ids=sample.ids
                        )
  • 保存中间结果,方便后续分析
1
2
3
4
save(marvel, file="./MARVEL2.RData")

# 后续以官方提供的示例数据学习
# load("../Data/MARVEL.RData")

2.3 细胞群AS特征分析

以其中84个iPSC细胞群进行演示

1
2
3
sample.ids = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="iPSC") %>% 
    dplyr::pull(sample.id) 

(1)AS类型占比

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
marvel <- CountEvents(MarvelObject=marvel,
                      sample.ids=sample.ids,
                      min.cells=25
                      )
# 对于一个基因至少有25个细胞的AS PSI值为非缺失值,则判定发生了可变剪切事件
marvel$N.Events$Table
#   event_type freq        pct
# 1         SE 5270 40.1523810
# 3         RI 2030 15.4666667
# 6        AFE 1789 13.6304762
# 5       A3SS 1746 13.3028571
# 4       A5SS 1494 11.3828571
# 7        ALE  724  5.5161905
# 2        MXE   72  0.5485714
marvel$N.Events$Plot 
# ggplot绘图体系,下同
image-20230305142515213

(2)AS modality

判断细胞群中每个基因的每种可变剪切的PSI分布类型,如下可分为7类

image-20230305142905888
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
marvel <- AssignModality(MarvelObject=marvel,
                         sample.ids=sample.ids,
                         min.cells=25,
                         seed=1
                         )
marvel$Modality$Results[1:5, c("tran_id", "event_type", "gene_id", "gene_short_name", "modality.bimodal.adj")]
table(marvel$Modality$Results$modality.bimodal.adj, marvel$Modality$Results$event_type)
#                    A3SS A5SS  AFE  ALE  MXE   RI   SE
# Bimodal              20    7    4    6    0    4   17
# Excluded.Dispersed  386  484  538  189   18  783 1260
# Excluded.Primary    289  282  477  101   22  712 1049
# Included.Dispersed  563  385  372  237   12  352 1685
# Included.Primary    439  291  346  176   17   47 1184
# Middle                9   10    9    3    2   51   15
# Multimodal           40   35   43   12    1   81   60
  • 后续有两种统计方式:仅按modality分,同时按modality与AS type分。以后者作为演示
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
marvel <- PropModality(MarvelObject=marvel,
                       modality.column="modality.bimodal.adj",
                       modality.type="extended",
                       event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "AFE", "ALE"),
                       across.event.type=TRUE,
                       prop.test="chisq",
                       prop.adj="fdr",
                       xlabels.size=8
                       )
marvel$Modality$Prop$BarChart$Stats
head(marvel$Modality$Prop$BarChart$Table)
marvel$Modality$Prop$BarChart$Plot
image-20230305143612067

2.4 细胞群差异分析

(1)差异基因分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
cell.group.g1 = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="iPSC") %>% 
    dplyr::pull(sample.id)  # (reference)
cell.group.g2 = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="Endoderm") %>% 
    dplyr::pull(sample.id)

marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        min.cells=3,
                        method="wilcox",
                        method.adjust="fdr",
                        level="gene",
                        show.progress=FALSE
                        )
marvel$DE$Exp$Table[1:5, ]

(2)差异AS分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# time consuming
marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        min.cells=25,
                        method=c("ad", "dts"),  #同时使用两种的推荐差异方法
                        method.adjust="fdr",
                        level="splicing",
                        event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "ALE", "AFE"),
                        show.progress=FALSE
                        )
names(marvel$DE$PSI$Table)
head(marvel$DE$PSI$Table[["ad"]])
head(marvel$DE$PSI$Table[["dts"]])

(3)差异基因&AS分析

  • 对差异可变剪切的基因执行基因表达差异分析
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        psi.method=c("ad", "dts"),
                        psi.pval=c(0.10, 0.10),
                        psi.delta=0,
                        method.de.gene="wilcox",
                        method.adjust.de.gene="fdr",
                        downsample=FALSE,
                        show.progress=FALSE,
                        level="gene.spliced"
                        )
head(marvel$DE$Exp.Spliced$Table)

对于上述差异分析结果均可快速火山图,标记marker基因等

2.5 差异AS进阶分析

(1)差异PSI分类

  • Explicit:明显的变化;Implicit:有限的变化;Restricted:微小的变化
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
marvel <- ModalityChange(MarvelObject=marvel,
                     method=c("ad", "dts"),
                     psi.pval=c(0.10, 0.10)
                     )
table(marvel$DE$Modality$Table$modality.change)
# Explicit   Implicit Restricted 
#    161        280       1113
head(marvel$DE$Modality$Table)
marvel$DE$Modality$Plot

cell.group.list <- list("iPSC"=cell.group.g1,
                        "Endoderm"=cell.group.g2
                        )
tran_id <- "chr3:129171771:129171634|129171655:-@chr3:129171446:129171538"
# tran_id <- "chr6:21596763:21598248:+@chr6:21598321:21599378"
# tran_id <- "chr3:109337464:109337652:-@chr3:109333920|109333993:109333870"
marvel <- PlotValues(MarvelObject=marvel,
                     cell.group.list=cell.group.list,
                     feature=tran_id,
                     xlabels.size=5,
                     level="splicing",
                     min.cells=25
                     )
marvel$adhocPlot$PSI 
image-20230305150246079

(2)差异基因与AS

  • Coordinated:基因高表达,相应的PSI也增高;
  • Opposing:基因表达与PSI变化方向相反;
  • Isoform-switching :基因表达不变,PSI发生变化
  • Complex:其它复杂情况
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
marvel <- IsoSwitch(MarvelObject=marvel,
                    method=c("ad", "dts"),
                    psi.pval=c(0.10, 0.10),
                    psi.delta=0,
                    gene.pval=0.10,
                    gene.log2fc=0.5
                    )

marvel$DE$Cor$Plot
head(marvel$DE$Cor$Table_Raw)
table(marvel$DE$Cor$Table$cor)
# Coordinated    Opposing  Iso-Switch     Complex 
#         192         158         331         113

## 以Coordinated的DHX9与相应的AFE剪切事件为例
# gene
df.feature <- marvel$GeneFeature
gene_id <- df.feature[which(df.feature$gene_short_name=="DHX9"), "gene_id"]
marvel <- PlotValues(MarvelObject=marvel,
                   cell.group.list=cell.group.list,
                   feature=gene_id,
                   maintitle="gene_short_name",
                   xlabels.size=7,
                   level="gene"
                   )
plot.1_gene <- marvel$adhocPlot$Exp
# Splicing
tran_id <- "chr1:182839347:182839456|182839734:182839935:+@chr1:182842545:182842677"
marvel <- PlotValues(MarvelObject=marvel,
                   cell.group.list=cell.group.list,
                   feature=tran_id,
                   xlabels.size=7,
                   level="splicing",
                   min.cells=25
                   )
plot.1_splicing <- marvel$adhocPlot$PSI
library(patchwork)
plot.1_gene + plot.1_splicing
image-20230305151250520

(3)NMD预测

  • nonsense-mediated decay (NMD)无义介导的mRNA降解预测
  • 相比对照组,观测组所剪切的外显子可能由于包含premature stop codons (PTCs),导致编码提前结束,从而进一步使得表达量降低。
  • 本质上是对可产生PTC的可变剪切与下调基因取交集。
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
marvel <- ParseGTF(MarvelObject=marvel)
marvel <- FindPTC(MarvelObject=marvel,
                  method=c("ad", "dts"),
                  pval=c(0.10, 0.10),
                  delta=5
                  )

marvel <- PropPTC(MarvelObject=marvel,
                  xlabels.size=8,
                  show.NovelSJ.NoCDS=FALSE, #仅展示Non PTC与PTC
                  prop.test="chisq"
                  )
marvel$NMD$PTC.Prop$Table[1:5, ]

marvel <- CompareExpr(MarvelObject=marvel,
                      xlabels.size=8
                      )
marvel$NMD$NMD.Expr$Plot

marvel <- AnnoVolcanoPlot(MarvelObject=marvel,
                          anno=FALSE,
                          xlabel.size=7
                          )

marvel$NMD$AnnoVolcanoPlot$Plot
marvel$NMD$AnnoVolcanoPlot$Table

image-20230305152637830

二、Drop-baed pipeline

1、上游分析

1.1 conda环境

1
2
3
4
5
conda create -n marvel_drop
conda activate marvel_drop

conda install -c hcc aspera-cli
conda install -c bioconda STAR=2.7.8a samtools

1.2 下载数据

(1)测序数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR900/003/SRR9008752/SRR9008752_1.fastq.gz ./work/fq/

ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR900/003/SRR9008753/SRR9008753_2.fastq.gz ./work/fq/

mv ./work/fq/SRR9008753_1.fastq.gz ./work/fq/SRR9008753_S1_L001_R1_001.fastq.gz
mv ./work/fq/SRR9008753_2.fastq.gz ./work/fq/SRR9008753_S1_L001_R2_001.fastq.gz

(2)基因组数据

还是在 ./basic/gtf 文件夹下,构建starsolo版本的索引文件。

cellranger软件及相关数据下载参考之前的笔记。

1
2
3
4
5
6
STAR  \
--runMode genomeGenerate \
--runThreadN 20 \
--genomeDir ./starsolo_index \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa  \
--sjdbGTFfile gencode.v31.annotation.gtf

(3)教程数据

1
2
ls Data/
# Gene_SingCellaR  Gene_STARsolo  GTF  R Object  SJ_STARsolo

1.3 比对分析

(1)cellranger

使用cellranger软件初次比对产生bam文件,该软件用法可参考之前的笔记

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
bin=/home/lishensuo/Softerware/cellranger-7.1.0/bin/cellranger 
ref=/home/lishensuo/Softerware/cellranger-7.1.0/refer/refdata-gex-GRCh38-2020-A
fq_dir=/home/lishensuo/STUDY/marvel/proj2/work/fq
sample=SRR9008752

${bin} count --id=${sample} \
            --fastqs=${fq_dir} \
            --sample=${sample} \
            --localcores=10 \
            --transcriptome=${ref}
            
mkdir -p ./work/cellranger
mv ${sample} ./work/cellranger/${sample}

(2)STARsolo

使用STAR软件针对单细胞数据的模式,分析其基因表达以及剪切位点矩阵

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-
# Cell Ranger 4 and later: cellranger-x.y.z/lib/python/cellranger/barcodes/
# Cell Ranger 3 and earlier: cellranger-x.y.z/cellranger-cs/x.y.z/lib/python/cellranger/barcodes/
ls /home/lishensuo/Softerware/cellranger-7.1.0/lib/python/cellranger/barcodes/
whitelist=/home/lishensuo/Softerware/cellranger-7.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt
sample=SRR9008752

STAR --runThreadN 16 \
     --genomeDir ../basic/gtf/starsolo_index \
     --soloType CB_UMI_Simple \
     --readFilesIn ./work/cellranger/${sample}/outs/possorted_genome_bam.bam \
     --readFilesCommand samtools view -F 0x100 \
     --readFilesType SAM SE \
     --soloInputSAMattrBarcodeSeq CR UR \
     --soloInputSAMattrBarcodeQual CY UY \
     --soloCBwhitelist ${whitelist} \
     --soloFeatures Gene SJ

mkdir ./work/star/${sample}
mv Aligned.out.sam Log.final.out  Log.out  Log.progress.out  SJ.out.tab ./Solo.out
mv Solo.out ./work/star/${sample}

2、下游分析

2.1 整理比对结果

1
2
3
4
5
6
library(tidyverse)
library(Seurat)
library(R.utils) # ※
library(Matrix)

samples=c("SRR9008752","SRR9008753")

(1)Raw gene count

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
dir.create("./work/gene")
dir.create("./work/gene/raw")
dir.create("./work/gene/norm")

sce.list = lapply(samples, function(x){
    print(x)
    fls = list.files(paste0("./work/star/",x,"/Solo.out/Gene/filtered"), full.name=T)
    if(length(grep("gz",fls))!=3){
        for(fl in fls){ gzip(fl, remove=FALSE) }
    }
    count = Read10X(paste0("./work/star/",x,"/Solo.out/Gene/filtered"))
    sce = CreateSeuratObject(count)
    sce = RenameCells(sce, add.cell.id=x)
    sce$Group = x
    return(sce)
})
names(sce.list) = samples
# 若多个样本则合并处理
sce = merge(sce.list[[1]], sce.list[-1])
table(sce$Group)

count_merge = sce@assays$RNA@counts
dim(count_merge)

writeMM(count_merge, file = './work/gene/raw/matrix.mtx')

count_merge_phe = sce@meta.data %>% 
    tibble::rownames_to_column("cell.id") %>% 
    dplyr::select(cell.id)  # 仅需要1列
write.table(count_merge_phe, sep="\t", row.names=F, 
    file="./work/gene/raw/phenoData.txt")


count_merge_feat = data.frame(gene_short_name=rownames(sce))
write.table(count_merge_feat, sep="\t", row.names=F, 
    file="./work/gene/raw/featureData.txt")

(2)Norm gene count

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
sce = NormalizeData(sce)

count_merge_norm = Matrix::Matrix(exp(sce@assays$RNA@data)-1, sparse=T)  # 非log转换
count_merge_norm[1:4,1:4]
writeMM(count_merge_norm, file = './work/gene/norm/matrix_normalised.mtx')


count_merge_norm_phe = sce@meta.data %>% 
    tibble::rownames_to_column("cell.id") %>% 
    dplyr::select(cell.id, Group)  # 需要细胞分组等表型信息
write.table(count_merge_norm_phe, sep="\t", row.names=F, 
    file="./work/gene/norm/phenoData.txt")

count_merge_norm_feat = data.frame(gene_short_name=rownames(sce))
write.table(count_merge_norm_feat, sep="\t", row.names=F, 
    file="./work/gene/norm/featureData.txt")

(3)Raw SJ count

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
dir.create("./work/sj/raw")

# 整理SJ meta
for(x in samples){
sj_name = data.table::fread(paste0("./work/star/",x,"/Solo.out/SJ.out.tab"), data.table=F) %>% 
    dplyr::mutate(coord.intron=paste(V1, V2, V3, sep=":")) %>% 
    dplyr::select(coord.intron)
write.table(sj_name, sep="\t", row.names=F, col.names=F, quote=F,
    file=paste0("./work/star/",x,"/Solo.out/SJ/raw/features.tsv"))
}

sce_sj.list = lapply(samples, function(x){
    # x="SRR15319877"
    print(x)
    fls = list.files(paste0("./work/star/",x,"/Solo.out/SJ/raw"), full.name=T)
    if(length(grep("gz",fls))!=3){
        for(fl in fls){ gzip(fl, remove=FALSE) }
    }
    count = Read10X(paste0("./work/star/",x,"/Solo.out/SJ/raw"),gene.column=1)
    sce_sj = CreateSeuratObject(count)
    sce_sj = RenameCells(sce_sj, add.cell.id=x)
    sce_sj = sce_sj[,colnames(sce.list[[x]])]
    return(sce_sj)
})

sce_sj = merge(sce_sj.list[[1]], sce_sj.list[-1])
sj_count_merge = sce_sj@assays$RNA@counts
writeMM(sj_count_merge, file = './work/sj/raw/matrix.mtx')

sj_count_merge_phe = sce_sj@meta.data %>% 
    tibble::rownames_to_column("cell.id") %>% 
    dplyr::select(cell.id)  # 仅需要1列
write.table(sj_count_merge_phe, sep="\t", row.names=F, 
    file="./work/sj/raw/phenoData.txt")

sj_count_merge_feat = data.frame(coord.intron=rownames(sce_sj)) %>% 
    dplyr::mutate(coord.intron=gsub("-","_", coord.intron))
write.table(sj_count_merge_feat, sep="\t", row.names=F, 
    file="./work/sj/raw/featureData.txt")

(4)Cell dimension reduction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
dir.create("./work/umap")

sce = sce %>%
  NormalizeData() %>%
  FindVariableFeatures() %>%
  ScaleData() %>% 
  Seurat::RunPCA() %>% 
  Seurat::RunUMAP(dims = 1:30)

umap_embed = sce@reductions$umap@cell.embeddings %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column("cell.id") %>%
    dplyr::rename(x=UMAP_1, y=UMAP_2) 
write.table(umap_embed, row.names=F, sep="\t",
    file="./work/umap/UMAP_coordinate.txt")

2.2 创建marvel对象

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# (1) norm gene的三文件
df.gene.norm <- readMM("./work/gene/norm/matrix_normalised.mtx")
df.gene.norm.pheno <- read.table("./work/gene/norm/phenoData.txt", sep="\t", header=TRUE)
df.gene.norm.feature <- read.table("./work/gene/norm/featureData.txt", sep="\t", header=TRUE)

# (2) raw gene的三文件
df.gene.count <- readMM("./work/gene/raw/matrix.mtx")
df.gene.count.pheno <- read.table("./work/gene/raw/phenoData.txt", sep="\t", header=TRUE)
df.gene.count.feature <- read.table("./work/gene/raw/featureData.txt", sep="\t", header=TRUE)

# (3) SJ gene的三文件
df.sj.count <- readMM("./work/sj/raw/matrix.mtx")
df.sj.count.pheno <- read.table("./work/sj/raw/phenoData.txt", sep="\t", header=TRUE)
df.sj.count.feature  <- read.table("./work/sj/raw/featureData.txt", sep="\t", header=TRUE)

rownames(df.sj.count) = df.sj.count.feature[,1]
colnames(df.sj.count) = df.sj.count.pheno[,1]

# (4) 细胞降维
df.coord <- read.table("./work/umap/UMAP_coordinate.txt", sep="\t", header=TRUE)

# (5) GTF
gtf <- data.table::fread("../basic/gtf/gencode.v31.annotation.gtf", sep="\t", header=FALSE) %>% 
     as.data.frame() 
## 由于不同来源gtf版本差异而踩的坑,需要修改如下
gtf$V9 = gsub("gene_type","gene_biotype",gtf$V9)
gtf$V1 = gsub("chr","",gtf$V1)


library(MARVEL)
marvel <- CreateMarvelObject.10x(gene.norm.matrix=df.gene.norm,
                                 gene.norm.pheno=df.gene.norm.pheno,
                                 gene.norm.feature=df.gene.norm.feature,
                                 gene.count.matrix=df.gene.count,
                                 gene.count.pheno=df.gene.count.pheno,
                                 gene.count.feature=df.gene.count.feature,
                                 sj.count.matrix=df.sj.count,
                                 sj.count.pheno=df.sj.count.pheno,
                                 sj.count.feature=df.sj.count.feature,
                                 pca=df.coord,
                                 gtf=gtf
                                 )

2.3 预处理步骤

  • 筛选有效的剪切位点
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 注释基因类型
marvel <- AnnotateGenes.10x(MarvelObject=marvel)
head(marvel$gene.metadata)

## SJ涉及两边分属一个蛋白编码基因
marvel <- AnnotateSJ.10x(MarvelObject=marvel)  #基因类型
head(marvel$sj.metadata)
marvel <- ValidateSJ.10x(MarvelObject=marvel)  #同一基因
head(marvel$sj.metadata)
marvel <- FilterGenes.10x(MarvelObject=marvel,
                          gene.type="protein_coding"  #蛋白基因
                          )
head(marvel$sj.metadata)
  • 分析基因在细胞群的表达率以及SJ在细胞群的发生率
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cell.group.list = split(marvel$sample.metadata$cell.id,marvel$sample.metadata$Group)
## 基因在细胞群的表达率
marvel <- PlotPctExprCells.Genes.10x(MarvelObject=marvel,
                                     cell.group.g1=cell.group.list[[1]],
                                     cell.group.g2=cell.group.list[[2]],
                                     min.pct.cells=5
                                     )
head(marvel$pct.cells.expr$Gene$Data)
p1 = marvel$pct.cells.expr$Gene$Plot

## 剪切位点在不同细胞群的发生率
marvel <- PlotPctExprCells.SJ.10x(MarvelObject=marvel,
                                  cell.group.g1=cell.group.list[[1]],
                                  cell.group.g2=cell.group.list[[2]],
                                  min.pct.cells.genes=5,
                                  min.pct.cells.sj=5,
                                  downsample=TRUE,   
                                  downsample.pct.sj=10 # 抽样10%分析
                                  )
head(marvel$pct.cells.expr$SJ$Data)
p2 = marvel$pct.cells.expr$SJ$Plot
library(patchwork)
p1 + p2
image-20230306091952794
  • 保存中间结果
1
2
marvel <- CheckAlignment.10x(MarvelObject=marvel)
save(marvel, file="***.rda")

2.4 SJ差异分析

(1)差异SJ与基因

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## 首先分析差异SJ
marvel <- CompareValues.SJ.10x(MarvelObject=marvel,
                               cell.group.g1=cell.group.list[[1]],
                               cell.group.g2=cell.group.list[[2]],
                               min.pct.cells.genes=10, #至少10%细胞表达
                               min.pct.cells.sj=10,    #至少10%细胞剪切
                               min.gene.norm=1.0,
                               seed=1,
                               n.iterations=100,
                               downsample=TRUE,  #根据样本量少的一组下采样
                               show.progress=FALSE
                               )
## 然后进一步分析差异基因
marvel <- CompareValues.Genes.10x(MarvelObject=marvel,
                                  show.progress=FALSE
                                  )

head(marvel$DE$SJ$Table)

(2)SJ与基因关系

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
marvel <- IsoSwitch.10x(MarvelObject=marvel,
                        pval.sj=0.05,
                        delta.sj=5,
                        min.gene.norm=1.0,
                        pval.adj.gene=0.05,
                        log2fc.gene=0.5
                        )
head(marvel$SJ.Gene.Cor$Data[,c("coord.intron", "gene_short_name", "cor.complete")])
marvel$SJ.Gene.Cor$Proportion$Table
#   sj.gene.cor freq       pct
# 2 Coordinated   26  8.387097
# 4    Opposing   54 17.419355
# 3  Iso-Switch  226 72.903226
# 1     Complex    4  1.290323
marvel$SJ.Gene.Cor$Proportion$Plot
  • 降维可视化
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Plot cell groups
marvel <- PlotValues.PCA.CellGroup.10x(MarvelObject=marvel,
                                       cell.group.list=cell.group.list[c(1,4)],
                                       legendtitle="Cell group",
                                       type="tsne"
                                       )

plot_group <- marvel$adhocPlot$PCA$CellGroup

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="VIM",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )
plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr10:17235390:17235845",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
plot_sj <- marvel$adhocPlot$PCA$PSI

library(patchwork)
p = plot_group + plot_gene + plot_sj

2.5 特定基因分析

(1)分组表达

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 该基因在不同阶段的表达
marvel <- adhocGene.TabulateExpression.Gene.10x(MarvelObject=marvel,
                                                cell.group.list=cell.group.list,
                                                gene_short_name="TPM2",
                                                min.pct.cells=10,
                                                downsample=TRUE
                                                )
marvel$adhocGene$Expression$Gene$Table
marvel$adhocGene$Expression$Gene$Plot

# 该基因的不同剪切位点在不同阶段的count表达
marvel <- adhocGene.TabulateExpression.PSI.10x(MarvelObject=marvel,
                                               min.pct.cells=10
                                               )
head(marvel$adhocGene$Expression$PSI$Table)
marvel$adhocGene$Expression$PSI$Plot

(2)组间差异

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 组间两两差异
marvel <- adhocGene.DE.Gene.10x(MarvelObject=marvel)
marvel <- adhocGene.DE.PSI.10x(MarvelObject=marvel)
results <- marvel$adhocGene$DE$PSI$Data
head(results)

# 选择特定SJ绘图 
# SJ-1
# Define SJ
coord.intron <- results[which(results$figure.column=="SJ-1"), "coord.intron"]
coord.intron <- unique(coord.intron)
coord.intron
marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                  coord.intron=coord.intron,
                                  log2fc.gene=0.5,
                                  delta.sj=5,
                                  label.size=2,
                                  point.size=2,
                                  xmin=-2.0,
                                  xmax=2.0,
                                  ymin=-25,
                                  ymax=25
                                  )
marvel$adhocGene$DE$VolcanoPlot$Plot
image-20230306092629789
  • 使用wiggleplotr包可视化某剪切位点在基因不同转录本的可视化
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# http://datashare.molbiol.ox.ac.uk/public/wwen/wiggleplotr_1.18.0.tar.gz
#install.packages("wiggleplotr_1.18.0.tar.gz", repos = NULL, type="source")

marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                      coord.intron=coord.intron,
                                      rescale_introns=FALSE,
                                      show.protein.coding.only=TRUE,
                                      anno.label.size=1.5
                                      )

marvel$adhocGene$SJPosition$Plot
image-20230306092824217 image-20230401100858958