📖 生信数据分析--分析流程，工具包等

机器学习基于sklearn(4)--常见回归任务学习器

目的：演示常见几种回归器的使用方法，对其超参数调优候选超参数的选择 0、示例数据 1 2 3 4 5 6 7 8 9 10 11 12 from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV diabetes = datasets.load_diabetes() diabetes_X, diabetes_y = diabetes.data, diabetes.target diabetes_X = StandardScaler().fit_transform(diabetes_X) train_X, test_X, train_y, test_y = train_test_split(diabetes_X, diabetes_y, test_size=0.3) train_X.shape, test_X.shape, train_y.shape, test_y.shape # ((309, 10), (133, 10), (309,), (133,)) 1、K近邻 1 2 3 4 5 6 7 8 9 10 11 12 13 from sklearn.neighbors import KNeighborsRegressor model_knn = KNeighborsRegressor() param_grid = {"n_neighbors": [3, 5, 7, 10, 20], "p": [1, 2], "weights": ["uniform", "distance"]} grid_search = GridSearchCV(model_knn, param_grid, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1) grid_search.fit(train_X, train_y) print(grid_search.best_params_) print(grid_search.best_score_) knn_grid_search = grid_search # {'n_neighbors': 10, 'p': 2, 'weights': 'distance'} # -58.18148180421127 2、线性回归 1 2 3 4 5 6 7 from sklearn.linear_model import LinearRegression model_linear = LinearRegression() scores = cross_val_score(model_linear, train_X, train_y, scoring="neg_root_mean_squared_error", cv=10) print(scores.mean()) linear_cv = scores # -56.297486183914245 3、支持向量机 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from sklearn.svm import SVR model_svm = SVR() param_grid = [ {'C' : [0.01, 0.1, 1, 10, 100], 'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'], 'gamma' : ['scale','auto'] } ] grid_search = GridSearchCV(model_svm, param_grid, cv=5, scoring="neg_root_mean_squared_error", n_jobs=1) grid_search.fit(train_X, train_y) print(grid_search.best_params_) print(grid_search.best_score_) svm_grid_search = grid_search # {'C': 1, 'gamma': 'scale', 'kernel': 'linear'} # -56.48299266830155 4、随机森林 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 随机森林 from sklearn.ensemble import RandomForestRegressor model_rf = RandomForestRegressor() param_grid = [ {'n_estimators' : [100, 200, 300, 500, 1000], 'criterion' : ["squared_error", "absolute_error"], #与分类任务有变化 'max_depth' : [4, 8, 16, 32], 'max_features' : ["sqrt", "log2"] } ] grid_search = GridSearchCV(model_rf, param_grid, cv=5, scoring="neg_root_mean_squared_error", n_jobs=10) grid_search.fit(train_X, train_y) print(grid_search.best_params_) print(grid_search.best_score_) rf_grid_search = grid_search # {'criterion': 'squared_error', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 500} # -56.98892728898064 5、梯度增加机 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.ensemble import GradientBoostingRegressor model_gbm = GradientBoostingRegressor() param_grid = [ {'loss' : ['squared_error', 'absolute_error', 'huber', 'quantile'], #与分类任务有变化 'learning_rate' : [0.001, 0.01, 0.1], 'n_estimators' : [100, 200, 300, 500], 'subsample' : [0.5, 0.7, 1] } ] grid_search = GridSearchCV(model_gbm, param_grid, cv=5, scoring="neg_root_mean_squared_error", n_jobs=10) grid_search.fit(train_X, train_y) print(grid_search.best_params_) print(grid_search.best_score_) gbm_grid_search = grid_search # {'learning_rate': 0.01, 'loss': 'absolute_error', 'n_estimators': 500, 'subsample': 0.5} # -57.07526918837941 6、XGBoost 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from xgboost import XGBRegressor model_xgb = XGBRegressor() param_grid = [ {'n_estimators' : [10, 30, 50], 'learning_rate' : [0.01, 0.1], 'subsample' : [0.5, 0.7, 1], 'colsample_bytree' : [0.5, 0.7, 1] } ] grid_search = GridSearchCV(model_xgb, param_grid, cv=5, scoring="neg_root_mean_squared_error", n_jobs=10) grid_search.fit(train_X, train_y) print(grid_search.best_params_) print(grid_search.best_score_) xgb_grid_search = grid_search # {'colsample_bytree': 0.5, 'learning_rate': 0.1, 'n_estimators': 50, 'subsample': 0.5} # -58.57262307535045 简单比较 1 2 3 4 5 6 7 8 9 import pandas as pd pd.DataFrame({ "KNN" : knn_grid_search.best_score_, "Linear" : linear_cv.mean(), "SVM" : svm_grid_search.best_score_, "RF" : rf_grid_search.best_score_, "GBM" : gbm_grid_search.best_score_, "XGB" : xgb_grid_search.best_score_ }, index=["score"]).T.plot.line() ...

机器学习--自动机器学习工具autogluon

最先在李沐大神在B站的分享中了解到autogluon，它是一个自动机器学习工具，可用于文本图片识别、表格任务等。据说效果非常不错–号称3行代码打败99%的机器学习模型，甚至说标志着手动调参的时代已经结束。 ...

数据库--药物与药物靶点TTD

1、TTD数据库简介首先关于靶点的生物学定义是：生物学靶点（英語：Biological target）是指位于生物体内，能够被其他物质（配体、药物等）识别或结合的结构。常见的药物靶点包括蛋白质、核酸和离子通道等。—维基百科 ...

CMap数据库整理与使用方法

Cmap LINCS计划采用L1000技术进行大规模的细胞系干扰实验测序，得到差异基因。具体可分为Phase-1，Phase-2两个阶段。数据已整理、上传至阿里云盘。本片笔记整理下数据的操作、使用方法。 ...

MsigDB基因集数据库

官方介绍：https://www.gsea-msigdb.org/gsea/msigdb/ 下载界面：http://www.gsea-msigdb.org/gsea/downloads.jsp ...

使用igraph包进行网络结构分析与可视化

1、创建与查看igraph对象 1.1 示例数据 igraph包提供了很多创建igraph对象的函数与思路。这里采用常用的基于data.frame的格式创建。示例数据来自STRINGdb的PPI蛋白互作数据以及对应基因的上下调信息 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 library(STRINGdb) library(tidyverse) string_db <- STRINGdb$new(version="11", species=9606, score_threshold=200, input_directory="") data(diff_exp_example1) genes = rbind(head(diff_exp_example1,30), tail(diff_exp_example1,30)) head(genes) genes_mapped <- string_db$map(genes, "gene" ) head(genes_mapped) ppi = string_db$get_interactions(genes_mapped$STRING_id) %>% distinct() edges = ppi %>% dplyr::left_join(genes_mapped[,c(1,4)], by=c('from'='STRING_id')) %>% dplyr::rename(Gene1=gene) %>% dplyr::left_join(genes_mapped[,c(1,4)], by=c('to'='STRING_id')) %>% dplyr::rename(Gene2=gene) %>% dplyr::select(Gene1, Gene2, combined_score) nodes = genes_mapped %>% dplyr::filter(gene %in% c(edges$Gene1, edges$Gene2)) %>% dplyr::mutate(log10P = -log10(pvalue), direction = ifelse(logFC>0,"Up","Down")) %>% dplyr::select(gene, log10P, logFC, direction) ###边信息 head(edges) # Gene1 Gene2 combined_score # 1 UPK3B PTS 244 # 2 GSTM5 ACOT12 204 # 3 GRHL3 IGDCC4 238 # 4 TNNC1 ATP13A1 222 # 5 NNAT VSTM2L 281 # 6 EZH2 RBBP7 996 ###节点信息 head(nodes) # gene log10P logFC direction # 1 VSTM2L 3.992252 3.333461 Up # 2 TNNC1 3.534468 2.932060 Up # 3 MGAM 3.515558 2.369738 Up # 4 IGDCC4 3.290137 2.409806 Up # 5 UPK3B 3.248490 2.073072 Up # 6 SLC52A1 3.227019 3.214998 Up 1.2 创建对象使用graph_from_data_frame()函数创建 ...

重启随机游走算法与RandomWalkRestartMH包

1、关于RWR 1.1 算法简介 Random Walk with Restart，RWR重启随机游走算法在给定的一个由节点和边组成的网络结构中（下面均已PPI蛋白相互作用网络为例），选择其中一个或者一组基因。我们想知道其余的哪些基因与我们先前所选择的一个或者一组基因最相关。此时可以用到RWR，简单原理如下： ...

Gene2vec算法根据基因对计算基因表示

学习Genecompass时，了解参考到可以基因调控网络信息（Gene pair），计算Gene的嵌入表示（Embedding） https://github.com/jingcheng-du/Gene2vec 关键是要在python=3.7环境下，安装genesim=3.4.0 关于Genesim是NLP领域受欢迎的工具：https://github.com/jingcheng-du/Gene2vec ...

obabel化学小分子格式转换

conda 安装 1 2 3 4 conda install -c conda-forge openbabel obabel # Open Babel 3.1.0 -- Nov 2 2021 -- 08:43:45 查看支持的格式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 obabel -L # charges # descriptors # fingerprints # forcefields # formats # loaders # ops obabel -L formats | head # acesin -- ACES input format [Write-only] # acesout -- ACES output format [Read-only] # acr -- ACR format [Read-only] # adf -- ADF cartesian input format [Write-only] # adfband -- ADF Band output format [Read-only] # adfdftb -- ADF DFTB output format [Read-only] # adfout -- ADF output format [Read-only] # alc -- Alchemy format # aoforce -- Turbomole AOFORCE output format [Read-only] 格式转换 ...

化合物指纹与描述符生成系列工具

1、rdkit 1 2 3 4 5 6 # conda install -c conda-forge rdkit from rdkit import Chem from rdkit.Chem import MACCSkeys from rdkit import DataStructs from rdkit.Chem import Draw 1.1 指纹编码式（1）Topological Fingerprints 1 2 3 4 5 6 7 8 m = Chem.MolFromSmiles('CCOC') # Chem.MolToSmiles(mol) fp = Chem.RDKFingerprint(m, fpSize=1024) # fpSize 自定义数目,默认为2048 fp.GetNumBits() # 1024 fp.ToBitString() ...