基于先前的hECA文献笔记,学习使用python工具ECAUGHT高效提取特定类型的人类单细胞图谱数据。值得注意的是hECA对不同来源数据集仅进行了测序文库的标准化以及log转换,用户可根据特定应用场景进行适当的批次校正处理。

  • 官方教程:http://eca.xglab.tech/ecaugt/index.html

    1
    
    pip install ECAUGT
    
image-20230118211337588

1、前期准备

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## 加载包
import sys
import pandas as pd
import ECAUGT
import time
import multiprocessing
import numpy as np

## 建立API链接
# set parameters
endpoint = "https://HCAd-Datasets.cn-beijing.ots.aliyuncs.com"
access_id = "LTAI5t7t216W9amUD1crMVos" #enter your id and keys
access_key = "ZJPlUbpLCij5qUPjbsU8GnQHm97IxJ"
instance_name = "HCAd-Datasets"
table_name = 'HCA_d'
# setup client
ECAUGT.Setup_Client(endpoint, access_id, access_key, instance_name, table_name)

2、表型筛选

hECA数据储存方式是行名是细胞id,列名包括基因名(43878) + 表型信息(18),共43896列的巨大矩阵。

首先可根据表型信息(meta.data)筛选目标细胞群,常用的两个条件是器官(organ)与细胞(cell_type)类型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## 如下查询肺组织的T细胞
rows_to_get = ECAUGT.search_metadata("organ == Lung && cell_type == T cell")
# 14894 cells found
rows_to_get[:5]
# [[('cid', 2000932)],
#  [('cid', 2000962)],
#  [('cid', 2000971)],
#  [('cid', 2000978)],
#  [('cid', 2000987)]]
## cell id格式为长度为2的元组,第一个元素是'cid',第二个元素是数字序号


## 常用逻辑操作符
# '==' means equal
# '<>' means unequal
# '&&' means AND operation
# '||' means OR operation
# '!' means not NOT operation

3、下载数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
df_result = ECAUGT.get_columnsbycell_para(
	rows_to_get = rows_to_get[:20], ## 返回指定细胞id
	cols_to_get=None,          ## 返回指定基因/表型列
	col_filter=None,           ## 特定基因表达模式
	do_transfer = True,        ## 是否返回DataFrame格式
	thread_num = multiprocessing.cpu_count()-1)
df_result.shape
# (20, 43896) 43878个基因信息 + 18列表型信息(metadata)

genes    = df_result.columns[:43878]         # 基因名
metaCols = df_result.columns[43878:43878+18] # 表型名
# Index(['cell_id', 'cell_type', 'cl_name', 'donor_age', 'donor_gender',
#        'donor_id', 'hcad_name', 'marker_gene', 'organ', 'original_name',
#        'region', 'sample_status', 'seq_tech', 'study_id', 'subregion',
#        'tissue_type', 'uHAF_name', 'user_id'],
#       dtype='object')

meta = df_result.loc[:,metaCols]
meta.reset_index(inplace=True)

expr = df_result.loc[:,genes]
expr.reset_index(inplace=True)
expr=expr.drop(['cid'], axis=1)
  • 筛选特定基因表达模式,并返回指定列的细胞数据
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
gene_condition = ECAUGT.set_gene_condition("PTPRC > 0.1 && CD3D>=0.1")

df_result = ECAUGT.get_columnsbycell_para(
	rows_to_get = rows_to_get[:20], 
	cols_to_get=['CD3D','PTPRC','donor_id','hcad_name'], 
	col_filter=gene_condition, 
	do_transfer = True, 
	thread_num = multiprocessing.cpu_count()-1)
df_result.shape
# (5, 4)

4、分批下载

  • 若目标细胞群过多,可分成小批量多次下载
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from tqdm import tqdm
import pickle

for chunk in tqdm(range(int(1+len(rows_to_get)/500)),ncols=80):
    # split batches
    lb, rb = chunk*500, (chunk+1)*500
    rows = rows_to_get[lb:rb]
    if len(rows)<=0:break

    # download rows from the unified Giant Table (uGT)
    result = ECAUGT.get_columnsbycell_para(rows_to_get = rows,
                                           cols_to_get = None, # download all columns
                                           col_filter  = None,
                                           do_transfer = True,
                                           thread_num  = 24)
    result.to_pickle("__temp_%d_%d.pk"%(lb,rb))
    #print("downloading %d~%d"%(lb, rb))
    #print(len(rows))

# load split batches
giant_table_list = []
for chunk in tqdm(range(int(1+len(rows_to_get)/500)),ncols=80):
    lb, rb = chunk*500, (chunk+1)*500
    fname = "__temp_%d_%d.pk" % (lb, rb)
    with open(fname,'rb') as f:
        df=pickle.load(f)
        giant_table_list.append(df)

giant_table= giant_table_list[0]

for i in range(1, len(giant_table_list)):
    giant_table = pd.concat([ giant_table, giant_table_list[i] ])

# remove intermediate results
del giant_table_list
import gc
gc.collect()

#giant_table.to_pickle("sorted_tcells_raw.pk")
df_result = giant_table
df_result.shape
# (14894, 43896)