1、前期准备

(1)安装工具包

1
2
pip install padelpy
pip install pandas

(2)12种分子指纹解析文件:https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip

1
2
3
4
5
6
7
unzip fingerprints_xml.zip
mkdir fingerprints_xml
ls *.xml | while read id; do mv $id ./fingerprints_xml/; done
ls ./fingerprints_xml
# AtomPairs2DFingerprintCount.xml  ExtendedFingerprinter.xml   KlekotaRothFingerprintCount.xml  PubchemFingerprinter.xml
# AtomPairs2DFingerprinter.xml     Fingerprinter.xml           KlekotaRothFingerprinter.xml     SubstructureFingerprintCount.xml
# EStateFingerprinter.xml          GraphOnlyFingerprinter.xml  MACCSFingerprinter.xml           SubstructureFingerprinter.xml

(3)示例小分子SMILES文件:https://raw.githubusercontent.com/dataprofessor/data/master/HCV_NS5B_Curated.csv

2、分子指纹转换

(1)映射指纹解析文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import glob
xml_files = glob.glob("./fingerprints_xml/*.xml")
xml_files.sort()
xml_files

FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']
 
fp = dict(zip(FP_list, xml_files))
fp
# {'AtomPairs2DCount': './fingerprints_xml/AtomPairs2DFingerprintCount.xml',
#  'AtomPairs2D': './fingerprints_xml/AtomPairs2DFingerprinter.xml',
#  'EState': './fingerprints_xml/EStateFingerprinter.xml',
#  'CDKextended': './fingerprints_xml/ExtendedFingerprinter.xml',
#  'CDK': './fingerprints_xml/Fingerprinter.xml',
#  'CDKgraphonly': './fingerprints_xml/GraphOnlyFingerprinter.xml',
#  'KlekotaRothCount': './fingerprints_xml/KlekotaRothFingerprintCount.xml',
#  'KlekotaRoth': './fingerprints_xml/KlekotaRothFingerprinter.xml',
#  'MACCS': './fingerprints_xml/MACCSFingerprinter.xml',
#  'PubChem': './fingerprints_xml/PubchemFingerprinter.xml',
#  'SubstructureCount': './fingerprints_xml/SubstructureFingerprintCount.xml',
#  'Substructure': './fingerprints_xml/SubstructureFingerprinter.xml'}

(2)目标小分子smiles文件

1
2
3
4
df = pd.read_csv('./HCV_NS5B_Curated.csv')
df[['CANONICAL_SMILES','CMPD_CHEMBLID']].head()
# 保存文件
df2.head(50).to_csv('molecule.smi', sep='\t', index=False, header=False)
image-20230210211845137

(3)进行目标格式的指纹转换

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from padelpy import padeldescriptor

# 假设需要转换为Substructure指纹
fingerprint = 'Substructure'
fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv 结果文件名
fingerprint_descriptortypes = fp[fingerprint] #解析文件地址

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'Substructure.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

descriptors = pd.read_csv(fingerprint_output_file)
descriptors
image-20230210212213823

3、分子描述符计算

可根据小分子的SMILES式计算出1875种分子描述符

1
2
3
4
5
6
df = pd.read_csv('./HCV_NS5B_Curated.csv')
smi = list(df["CANONICAL_SMILES"][0:50])
descriptors = from_smiles(smi)
descriptors_df = pd.DataFrame(descriptors)
descriptors_df.shape
# (50, 1875)
image-20230210213309468